How resilient is HCP Vault during real AWS regional outages?

Learn how HCP Vault Dedicated was designed for data plane resilience during the AWS regional outage and the lessons we’re applying to further strengthen our platform.

Jan 26, 2026

On October 20, 2025, a significant service disruption affected AWS us-east-1, impacting numerous services across the cloud ecosystem. This event provided a real-world validation of HashiCorp Cloud Platform (HCP) Vault Dedicated's architectural design principles.

At approximately 7:00 AM UTC, the HCP Vault control plane (hosted in us-east-1 and serving both AWS and Azure data planes) experienced elevated HTTP 500 error rates and intermittent panics. Even in the midst of this AWS us-east1 disruption, HCP Vault Dedicated customer clusters maintained 100% uptime throughout the event, continuing to serve workloads without interruption. Affected workflows were swiftly rectified with minimal impact, underscoring our robust operational procedures and the resilience provided by building on foundational AWS compute and networking services.

This post shares our experience during this event, the architectural decisions that contributed to our data plane resilience, and the lessons we're applying to further strengthen our platform. Our goal is to provide transparency into how cloud-native services can be designed for resilience and to share practical insights with the broader community.

»Understanding HCP Vault Dedicated architecture

HCP Vault Dedicated is a a SaaS version of Vault Enterprise that consists of two primary components:

»Control plane

This is a centralized management layer where customers perform operations such as:

Creating Vault clusters
Adjusting cluster configurations
Managing cluster tiers
Retrieving audit logs
Configuring backups or secondary regions

»Data plane

This consists of dedicated Vault clusters that handle all secrets management operations, including the CRUD-L (Create, Read, Update, Delete, List) API calls that applications depend on for accessing secrets, encryption keys, and other sensitive data.

This separation of concerns is fundamental to our architecture:

Administrative functions are centralized in the control plane for operational efficiency
Production workloads interact with Vault in the data plane and run independently in dedicated, customer-specific infrastructure

»Impact summary

The impact reporting is broken down by the effects on the control plane and the data plane.

»Control plane impact

During the AWS service disruption, the HCP Vault control plane experienced partial impact to specific administrative workflows like creating new snapshots, fetching audit logs, updating tiers, adding new backup regions, or adding new secondary regions. During the outage, two primary workflows experienced transient issues:

Create snapshot workflow: This periodic backup process, which runs every 24 hours, failed due to errors when the control plane attempted to create a snapshot blob on S3.
Forward update workflow: This routine workflow for base image and certificate refresh operations experienced transient failures because it requires bringing up new EC2 instances for Vault clusters.

However, during the incident, no customers reported issues accessing control plane functionality for their day-to-day administrative needs. After AWS services stabilized, these workflows resumed automatically, and no customer clusters were degraded. The HCP Vault team retriggered the create snapshot workflow manually to ensure a backup of the customer cluster is taken every 24 hours to be SOC2 compliant.

»Data plane impact

Zero customer impact. All HCP Vault Dedicated clusters across all regions and cloud providers (including those running in us-east-1) remained fully operational throughout the event.

»Key successes

We observed the following success indicators during the outage:

Zero downtime for the data plane despite regional cloud failure: Even with the AWS us-east-1 region-wide incident, HCP Vault clusters across various regions (including us-east-1) and cloud providers remained operational, meeting our 99.99% SLA across tiers. This confirms that our reliance on core AWS services like EC2 and VPC, and our minimal dependency on auxiliary services, provides stability benefits.
Multi-cloud fault tolerance: Azure Data Planes, despite their dependency on the control plane hosted in AWS us-east-1, continued functioning without interruption, validating our cross-cloud resiliency. This demonstrates that even when our control plane faces issues, individual customer Vault clusters remain online and fully functional.

»Lessons learned and improvements

While the outage reinforced confidence in HVD's architecture, it also brought to light opportunities for long-term enhancements in resilience, particularly for the control plane:

»Automated workflow recovery

Current state: Some workflows required manual intervention to restart after the event.

Improvement: We're implementing intelligent retry mechanisms that will automatically resume failed workflows once dependencies are healthy, eliminating the need for manual intervention and ensuring continuous compliance with our backup schedules.

»Fleet-wide workflow management

Current state: Workflow management during service disruptions required individual cluster attention.

Improvement: We're developing fleet-wide automation for snapshot creation and cluster update workflows, enabling faster, more comprehensive recovery from future regional events and reducing operational overhead.

»Looking forward

Cloud service disruptions are an industry reality, and no cloud provider is immune to complex system failures. What matters is how services are architected to handle these scenarios and what we learn from them.

This event demonstrated that HCP Vault Dedicated's architecture successfully protected customer workloads during a regional disruption. It also showed us where we can improve our administrative operations to provide an even more seamless experience during future events.

We're committed to continuous improvement and transparency. As we implement the enhancements described above, we'll continue sharing our learnings with the community.

»More information and references around the outage

For those interested in the technical details of the AWS event, Amazon has published a comprehensive post-event analysis: Summary of the Amazon DynamoDB Service Disruption in the Northern Virginia (US-EAST-1) Region

The report details the cascading impacts across DynamoDB, EC2, NLB, and other services, providing valuable insights into the complexity of operating large-scale distributed systems. Some key points:

Amazon DynamoDB experienced increased API error rates in the N. Virginia (us-east-1) region.
Network Load Balancer (NLB) experienced increased connection errors for some newly launched instances. This was caused by health check failures in the NLB fleet, which resulted in increased connection errors on some NLBs
New EC2 instance launches failed and, while instance launches began to succeed from 10:37 AM, some newly launched instances experienced connectivity issues

If you’re interested in testing HCP Vault Dedicated yourself, visit our HCP portal and sign up for the service, which can take advantage of the $500 HCP trial credit.