Get a step-by-step guide to building a free solution for Day 1 Vault logging and alerting on AWS.
This blog post will answer two challenging questions operators often face after deploying a HashiCorp Vault cluster:
… all without impacting Vault service availability or depending on legacy IT systems.
The post outlines a free solution for Day 1 logging and alerting that runs on Amazon Web Services (AWS). The configuration files can be applied anywhere you run Unix/Linux systems on a small compute instance. Follow along by building this solution yourself using the HashiCorp Terraform provisioning code in this GitHub repo: vault-syslog-ng-audit.
Here's a preview of what we'll build:
Vault logging to local syslog-ng socket buffer. Forwards to remote syslog-ng.
The ability to audit secrets access and administrative actions are core elements of Vault's security model. HashiCorp follows the Unix philosophy of building simple modular tools that can be connected together. Rather than building security information and event management (SIEM) features into Vault, HashiCorp focuses engineering resources on core secrets management capabilities and generating a robust audit log that can be sent to purpose-built log management and alerting tools.
However, many companies still struggle with log management and alerting. Small companies may not already have a shared logging service. Using an existing logging service in a large company may be cumbersome or expensive. So this post presents an effective, no cost Day 1 solution to this problem using syslog-ng that might be a good fit for your organization.
Vault generates two types of logs:
Audit logs can be sent to one or more audit devices, including local files, the local operating system's syslog framework, and remote log servers over raw UDP or TCP.
Server logs are sent to the systemd journal when using our standard OS packages for Redhat and Ubuntu or printed to stdout/stderr file handles when starting Vault in a terminal.
Vault prioritizes audit log accuracy over service availability. Once you configure an audit device, Vault must be able to persist an audit entry to at least one audit device before each client request is serviced. If all of your configured audit devices are unavailable, the Vault client requests will hang until the requests can be written to the log. This means that Vault availability is only as good as the combined availability of the configured audit log device(s).
This flowchart illustrates a simplified view of how client requests, responses, and their associated audit logs are handled in Vault. A working audit device is required at each purple box for request handling to proceed.
Vault API request and audit log process flow.
Enabling a local audit file is where most people start because it's the easiest option with no external dependencies, and it requires only one command.
But this approach also has downsides:
Reading local audit files requires SSH access to Vault nodes, which violates the extended Vault production hardening guidelines.
Reviewing separate audit log files individually is cumbersome.
Filesystems can fill quickly when Vault is under high load, when executing benchmark tests, or if misbehaving clients flood the cluster with requests in an infinite loop. This causes a Vault service outage.
Forwarding logs from a local audit file to a network log server (with tools like Logstash or Promtail) simplifies analysis and review in a central location, but filesystems can still fill quickly under load, especially if the local file is rotated only hourly or daily.
Sending audit logs through the operating system syslog interface prepends messages with hostname and timestamp headers, which prevents operators from easily parsing raw logs with standard JSON tools like jq
.
Pointing Vault nodes directly to a network log server with the socket audit device provides easy centralized review, retains valid JSON format, and eliminates the potential to fill a local filesystem, but it causes Vault downtime when the log server is not responsive. This is unacceptable for many organizations.
Vault cluster with socket audit devices enabled for central log server.
Logging strategies are debated between HashiCorp engineers, but perhaps the best option is to run a forwarding syslog-ng daemon on each Vault node. This daemon is configured to forward audit logs directly to a centralized log server and buffer log messages to disk only when the remote server is unavailable.
Analysis, alerting, and any other resource-intensive tasks can be performed on the remote log server while keeping the Vault node as simple and reliable as possible.
Vault node logging to network log server through local syslog-ng buffer.
This provides several benefits:
Because Vault logs audit messages in JSON format and server logs as traditional unstructured text, sending them to separate listeners on the log server simplifies parsing and alerting. This example uses syslog-ng for the log server.
Syslog-ng forwarding Vault audit logs and systemd journal to remote syslog-ng server.
Once the cluster logs are received and persisted to disk by the central syslog-ng instance, syslog-ng's program() driver can send log messages to separate processes for analysis and alerting. One Python script performs regex matching on Vault's unstructured server logs. A second script parses JSON audit logs into Python dictionary objects and alerts on conditions in each message.
When the Python scripts identify security events, they post notifications to a Slack webhook for real-time operator visibility, as well as writing them to an alert log for long-term archiving on the system-ng instance.
Vault server to remote syslog-ng with Slack notifications.
I initially attempted to do this with syslog-ng filters and the included JSON parser, but found the docs difficult to follow and configure correctly. I chose to implement the message parsing and alerting logic in Python to keep the syslog-ng configs as simple as possible, easy to understand, and reliable.
This can easily be expanded to send messages into:
syslog-ng program() handler sending to multiple destinations, AWS SNS, Azure Message Bus, Slack, etc.
The Terraform code in this post’s companion GitHub repo will deploy a single Vault node with a local socket buffer and a syslog-ng network log server so you can see this solution in action and test it in your environment:
The deployment includes the following resources:
Update terraform.tfvars
with the required values for your account:
If you want to send event notifications to Slack, set the webhook URL as an environment variable. This URL contains an embedded authorization token tied to your workspace, so don't check this into version control.
If you don't already have a webhook for your Slack workspace, you can use this URL to create one: https://api.slack.com/apps?new_app=1&ref=bolt_start_hub
Run terraform apply
:
Wait a few minutes for the cloud-init scripts to complete and if everything worked correctly you should see notifications in your Slack channel:
Sample Vault event notifications sent to Slack.
Note: These configs are known to work with Ubuntu 22.04 LTS. Adjustments may be needed for other releases or distributions.
The syslog-ng configs are commented to explain the purpose of key parameters. Here are some sections for review:
Navigate to /etc/syslog-ng/conf.d
on the Vault instance and view vault.conf
.
The configuration block below creates a raw TCP listener on port 1515 without applying any syslog message parsing. It listens only on the loopback interface, so unauthorized clients outside the Vault instance can't send messages to it.
The next block creates a remote logging destination pointing to the syslog-ng instance. The ${syslog_ip}
variable will be populated during terraform apply
and contain the syslog-ng instance private IP address. It also creates a disk buffer that will be used to store up to 4GB of audit log messages in the event the downstream syslog-ng service is unavailable. This is the key feature that provides Vault service resiliency during log server outages.
The Vault instance syslog-ng config includes a template reference to the private IP address of the syslog-ng server and is defined here.
Navigate to /etc/syslog-ng/conf.d
on the syslog-ng instance and view vault.conf
:
The destination … program()
lines create destination targets that launch the specified Python scripts and pass new messages into the stdin filehandles of each script as the messages arrive. The scripts will be restarted automatically if they exit.
syslog-ng retains the raw audit log message body according to the specified template, so stored messages can be parsed with JSON tools. We include standard syslog fields for the unstructured server logs.
The syslog-ng server config is defined here.
syslog-ng captures Vault audit logs, server logs, and generated alert logs to /var/log/vault
on the syslog-ng instance.
audit
contains Vault audit logs.audit.alert
contains a record of alerts generated from the audit log.messages
contains the Vault application logs from the systemd journal.messages.alert
contains a record of alerts generated from the Vault application logs.Vault logs in /var/log/vault
.
Each of the above log files will be rotated, compressed, and stored in the archive
sub-directory.
Archived logs in /var/log/vault/archive
.
A logrotate
cron entry is used to rotate all the Vault-related log and alert files into /var/log/vault/archive
:
ubuntu@syslogng:/etc 2022-06-16 20:26:20
$ cat /etc/cron.d/logrotate-vault
* * * * * root /usr/sbin/logrotate /etc/logrotate.d/vault-syslog-ng
ubuntu@syslogng:/etc 2022-06-16 20:26:28
$ cat /etc/logrotate.d/vault-syslog-ng
compress
compresscmd /usr/bin/zstd
compressext .zst
uncompresscmd /usr/bin/unzstd
/var/log/vault/audit
/var/log/vault/audit.alert
/var/log/vault/messages
/var/log/vault/messages.alert
{
rotate 2300
hourly
maxsize 1G
dateext
dateformat -%Y%m%d_%H:%M:%S
missingok
olddir /var/log/vault/archive
postrotate
invoke-rc.d syslog-ng reload > /dev/null
endscript
}
This logrotate
configuration rotates log files hourly, or when they reach 1GB, and compresses with Facebook's zstandard compression algorithm for better performance than gzip. You might prefer daily rotation for lower volume-environments. In that case, keep the "every minute" cron schedule and change the logrotate
config from "hourly" to "daily".
Vault's KV engine is fast and can generate gigabytes of audit logs in less than a minute on fast hardware. This can fill a small filesystem during performance testing since cron can execute logrotate
only once per minute.
This should be less of an issue with a centralized log server, which can be easily (re)sized appropriately. The log server's filesystem size is specified in this section of the solution repo:
Resizing Note: Amazon EC2 block devices and instance types can be resized without destroying the instance, provided you keep the same AMI and CPU architecture. The Terraform AWS provider stops the instance, resizes, and restarts the instance, which comes back up on the same IP address.
You can also add a cron entry to sync the archive directory to S3 Infrequent Access or Glacier for cheap long-term storage:
Archiving logs and alerts to an s3 bucket for inexpensive long term storage.
syslog-ng's program()
driver streams incoming messages for each input source to the stdin
file handle of the appropriate Python script. This is a common pattern for looping over stdin lines in Python:
Repo file: vault-server-log-handler.py (line 85)
The unstructured log handler uses a list of regular expressions and event labels to define which events should generate alerts and how to label them. These are important log entries for every environment. Check the latest from the main branch on this repo for alert definition updates.
Repo file: vault-server-log-handler.py (line 12)
The JSON audit log handler makes all the audit log fields available as a Python dictionary so you can create arbitrary alert conditions, such as field values or time of day. You can also interface with external systems through other Python SDKs, REST APIs, etc.
To make things easier, I’ve created a couple rules in the GitHub repository, including a simple string match looking for root token generation:
And a match looking for a sensitive KV path in a specific namespace:
Sending an alert to Slack is simple with the Python requests module. You append the alert message to a local file for permanent storage, then POST to the configured Slack webhook:
This solution provides centralized network log storage, durable long-term archiving, and alert notifications for critical Vault events. It costs only a few dollars a month in AWS compute charges for small installations.
This is an opinionated framework and is intended to serve as a starting point for teams that don't have access to a highly available logging and alerting system or that aren't sure which events should be monitored. It can also be useful for anyone who needs centralized logging without impacting Vault availability. I hope you find it valuable. Feel free to reach out to me in this HashiCorp Discuss Forum thread or your regional HashiCorp solutions engineer if you have any questions!
Learn about the ACME protocol for PKI, the common problems it solves, and why it should be part of your certificate management roadmap.
New HashiCorp Vault ecosystem integrations extend security use cases for customers.
With Vault and Boundary, HashiCorp makes its debut in Gartner’s Magic Quadrant for privileged access management.