Learn how to implement a mature Vault monitoring and observability strategy to simplify finding answers to important Vault questions.
Observability is the ability to measure the internal states of a system by examining its outputs. In the context of HashiCorp Vault, the key outputs to examine are log files, telemetry metrics, and data scraped from API endpoints.
A mature Vault monitoring and observability strategy simplifies finding answers to important Vault questions. For example:
This post will walk through how to architect a well-rounded Vault monitoring strategy with log analysis, telemetry analysis, and API/synthetic monitoring
A comprehensive, production-grade HashiCorp Vault monitoring strategy should include three major components:
While there is some overlap in capabilities between the components, they are each focused on a different aspect of observability. When combined, they enable quick identification, analysis, and resolution of issues.
For example, a Vault operator might receive an alert from a synthetic monitor that detected a breached SLA in authenticating to Vault and reading a secret. Telemetry data might show that Vault is experiencing an abnormally high login request volume, resulting in high disk I/O. A subsequent audit log analysis would identify the source of the login traffic.
Perhaps the offender is a runaway application or team not conforming to Vault usage best practices. The operator could then have a conversation with that team or quickly push a fix to that app. They could also use telemetry and log data to set a reasonable rate limit for the app in question, preventing this situation from occurring again.
Infrastructure monitoring is a critical component of a comprehensive monitoring/observability strategy. Infrastructure and host-level events that should be adequately logged and observed include:
If the infrastructure hosting Vault is not healthy or stable, it will most likely impact its reliability and performance. Logs provided by the operating system in use, as well as metrics/telemetry agents, can be used as a source for these events.
There are many publicly available guides to system/host-level monitoring, and organizations should follow industry standards and best practices. If an organization already uses a specific log and metrics analysis solution, that vendor likely provides useful guidance. Here are some helpful guides:
To ensure adherence to SLAs defined by the organization, it is important to know how the Vault software is performing on top of the platform or infrastructure on which it is hosted. This includes an understanding of how much of the allocated system resources Vault is using on average and during busy periods such as a large deployment or other events that drive high request rates to Vault. It is also important to know the rate and timing of requests being made to Vault. The speed of Vault request handling and the times of high request volume are key factors to monitor for anomalous activity.
In order to know how Vault is being used, it is important to understand that Vault handles requests from both applications and users. Analyzing the requests that Vault is handling can help answer some important questions, such as:
An understanding of Vault service consumption can help Vault operators be sure that teams and applications are using Vault properly. It can help them discover patterns of use that are not efficient or that go against best practices. For example, teams may not be practicing proper token hygiene or limiting time-to-live (TTLs) reasonably. A team might be unnecessarily relying on Vault during the run time of their application instead of just fetching secrets once at deploy time, causing a high reliance on Vault and a high volume of requests.
If a company uses chargebacks to recoup the cost of running Vault within the organization, Vault service consumption data can help determine each team’s bill. Service consumption data can help Vault operators identify and partner with the top teams by usage or with teams that use a particular feature to test Vault changes or upgrades in a development environment before going to production.
There are two types of Vault logs: the Vault operational log and the Vault audit log. Both logs contain useful and important information for teams operating a Vault service. It is important to understand what information exists in each of these logs to understand why and how they should be monitored.
The Vault logs should be sent to a log-analysis tool that allows analysis, search, and building of reports and dashboards using the log data. Examples of suitable log aggregation and analysis tools include:
Like many modern apps, Vault writes details about its internal operation and subsystem to standard output and standard error. On systemd-based Linux distributions, the journald daemon automatically captures Vault’s output to the system journal. Depending on the Linux distribution and specific journald configuration, the journald logs are typically found in log files matching one of these patterns:
It’s also possible to configure systemd to send the logs from a specific unit to a separate file, such as
The events logged in the Vault operational log match the format of many other common system logs and are time-stamped and categorized by severity.
Important event types logged in the Vault operational log include:
It is important to note that some of these events are also exposed in Vault telemetry. When the events are available in both logs and telemetry, it is up to the team implementing the monitoring to determine which source to use for monitoring/alerting purposes. The logs will likely have more context as they are often more verbose and can be compared with other log events occurring just before or just after the triggered event. For more information, see Vault operational log details in our documentation.
This log keeps a detailed record of all requests to Vault, and the associated responses, in JSON format. Sensitive fields in the request and response events are hashed with a salt value using HMAC-SHA256 before being written to the log.
The Vault audit log is not enabled by default; it will need to be specifically configured and enabled using audit logging settings within Vault. Supported audit devices include file, syslog, and socket.
Because audit device failures can block Vault from processing further requests, we recommend configuring at least two audit devices. The audit devices can be of the same type. For example, it is possible to configure two file audit devices with each on a separate and independent disk volume.
Events that can be found in the Vault audit log include:
Typically, the events logged in the Vault audit log are authenticated requests or attempts to authenticate. Unauthenticated actions can be found in the Vault operational log.
For more information, see Audit device notes in our documentation.
Vault telemetry provides both real-time and interval-based metrics about the status and usage of each Vault deployment. It is useful for determining current cluster health and identifying issues before they become critical.
Some metrics should be observed for anomalous values whereas others have specific recommended values on which to trigger alerts. Profile telemetry data over time to observe any abnormalities in resource usage, consumption patterns, and overall load. Establish baselines and trends, with any significant deviation indicating a potential problem.
Please note that new metrics are added periodically in new Vault releases, so some may be unavailable for teams using older versions of Vault. You can view available metrics by selecting your Vault version from the dropdown on the Vault telemetry internals page.
Telemetry is enabled and configured using the telemetry stanza in Vault’s configuration file. Like most changes to the config file, enabling telemetry requires a restart of the Vault service on each node. Vault provides built-in support for multiple telemetry providers, including:
It is up to each organization to select an appropriate telemetry provider compatible with its chosen monitoring tool. Depending on the selection, one can either stream telemetry to an available monitoring endpoint or scrape this data from the Prometheus-compatible
/v1/sys/metrics API endpoint.
Vault’s server process aggregates runtime metrics about performance every 10 seconds. It also includes high-cardinality usage data such as token, entity, and secret counts. These high cardinality items are aggregated every 10 minutes, by default, but this frequency is tunable by adjusting the
usage_gauge_period property in the telemetry stanza. Bear in mind that high-cardinality metrics put a larger load on Vault than real-time metrics. For this reason, it is best to avoid collecting them more frequently than the default without performance testing and a good reason.
We also recommend avoiding providers that don’t support labels (such as vanilla StatsD), as this results in a flattened metric key that requires additional processing to be useful. For example, the
vault.token.count.by_policy metric would display as separate metrics (shown below) instead of a single metric with multiple labels that can be can split or filtered on.
vault.token.count.by_policy.mycluster.ns1.policy1 vault.token.count.by_policy.mycluster.ns1.policy2 vault.token.count.by_policy.mycluster.ns2.policy3 vault.token.count.by_policy.mycluster.ns2.policy4 …
A detailed write-up on one Vault monitoring pattern option is available in our documentation at Monitor telemetry & audit device log data.
As a starting point, the most critical metrics that could indicate an immediate threat to Vault stability are listed below. Create alerts for these metrics.
To better understand the request load on Vault, start with the metrics below. You might alert on anomalous changes and sudden spikes in request load.
Note: Unauthenticated requests against endpoints that are not handled at Vault’s outer HTTP layer, like
sys/replication/status, are also captured in the
vault_core_handle_login_request metric. This means the metric may display authentication requests in an otherwise idle cluster that is not receiving any client authentication requests.
For specific telemetry monitoring recommendations, please see our Telemetry metrics reference. Specifics on values that should trigger an alert are called out in the “what to look for” section of key metrics on that page.
Metric names and how they are formatted can vary depending on monitoring tool, telemetry provider, and whether the metrics are coming from HashiCorp-managed HCP Vault or from a self-managed Vault deployment.
HCP Vault emits a subset of the metrics available in the self-hosted Vault Enterprise release. This is meant to simplify monitoring by exposing only metrics that are actionable by operators while abstracting away those that are ultimately HashiCorp’s responsibility as a service provider. These metric names may appear slightly different from those emitted by a self-managed Vault and will be prefixed with
hcp. For more information, please reference the HCP Vault metrics guide.
While each node in a cluster emits many metrics, there are exceptions.
Certain metrics are emitted only when there is a matching event, so it is normal to be “missing” data in some areas. Examples include the
Furthermore, some metrics emit only from the current cluster leader node because only the leader actively handles write operations and various other tasks. In a typical Vault cluster, non-leader nodes are in a standby state where they service read requests and forward all write requests to the leader.
Examples of metrics emitted only by the leader include:
In Vault versions prior to 1.13.0, 1.12.3, and 1.11.12, a further metrics subset is not emitted by non-performance standby nodes and is only emitted from the leader. This applies to all disaster recovery (DR) secondary clusters on earlier versions. One such example is the
vault.core.unsealed metric, which is reported only by the leader in a DR secondary cluster. This is important to note when viewing dashboards and configuring alerts.
Synthetic monitoring involves simulating user interactions instead of relying on real user traffic to a service. This type of monitoring is valuable in measuring the performance of, and detecting issues with, the Vault service. The data collected provides a snapshot of what users are actually experiencing when interacting with Vault, which is particularly useful for SLA/OLA reporting.
Some monitoring solutions (e.g. Datadog, Dynatrace, and Splunk) provide out-of-the-box support for synthetic monitoring. If your chosen tool does not, you can build a simple script, run it on a recurring schedule, and stream the results to a service endpoint such as the Splunk HTTP Event Collector.
For each run, measure and track:
Using these measures, you can build an accurate understanding of how the Vault service is performing, from the perspective of the people and machines consuming it.
To effectively monitor HashiCorp Vault, the organization’s platform team should design comprehensive synthetic monitoring scenarios. These scenarios should mimic real-world user interactions and include critical functionality and features of Vault. Here are some starter examples:
When designing a dashboard, focus on how to visualize data in an easily consumable format. Consider the target user group of each dashboard. The best way to construct and break out dashboards depends on an enterprise’s choice of tools, service architecture, and team skill set. What is natural and obvious to one team may be unclear to another.
A dashboard should satisfy a particular need or answer a particular question, such as:
Include dashboard-wide filters for different dimensions such as:
Consider also including the following dashboard-wide filters on a consumption-focused dashboard:
After completing prototype dashboards, build automation around the deployment and ongoing maintenance of them. It is important to drive consistency in naming and tile configuration across environments (i.e. dev should look the same as prod). An operator having to hunt for important data complicates analysis and wastes valuable time during an outage.
Similar to dashboarding, generating reports based on Vault data provides operators and management insights into compliance, security, access patterns, performance, and Vault adoption. Teams typically generate reports using a combination of the previously mentioned observability mechanisms and, in some cases, custom scripts that extract data directly from Vault.
Most enterprises find it valuable to correlate key indicators with organizational constructs (e.g. team, business unit, application, service, etc.). This is considerably easier when working with a well-defined path structure, naming convention, and tagging standard.
Path structure should tie back to how teams are organized within the enterprise. This makes it easier to generate a report that is split by each team or business unit. Reference the Vault namespace and mount structuring guide for more information.
Make naming consistent wherever possible. The most critical items are paths, policies, and namespaces. This not only eases report generation, but also allows humans to more quickly understand the structure and data within Vault.
Many Vault constructs — like namespaces, entities, entity aliases, and KV secrets — support custom metadata tagging. We recommend seeding key organizational information associated with each of these constructs via custom tags. This provides another way to map data back to the many dimensions of your business.
The reporting needs of each organization vary, but the examples below are common across many of our customers.
Comprehensive monitoring and observability of Vault is one of the most important components for operating Vault successfully as a shared service within an organization. If issues arise, proper monitoring can help a platform team confidently identify the source and the impact, enabling quicker issue resolution.
With the right strategy, organizations proactively discover risks and address them before they impact Vault consumers. A complete monitoring strategy offers a clear overview of how Vault is being implemented and utilized throughout the organization, enabling platform teams and leadership to make informed decisions based on data-driven insights.
HashiCorp Vault 1.15 contains a range of updates from UI updates and PKI enhancements to betas for Enterprise secrets sync, Enterprise seal high availability, and event monitoring.
Learn about the ACME protocol for PKI, the common problems it solves, and why it should be part of your certificate management roadmap.
New HashiCorp Vault ecosystem integrations extend security use cases for customers.