How does Vault handle various failure scenarios?

Co-founder Armon Dadgar describes HashiCorp Vault's three resilient failure modes, in a nutshell.


  • Armon Dadgar
    Armon DadgarCo-founder & CTO, HashiCorp


When we talk about Vault, it's designed to operate in a data center, in an online fashion. So it's important to us from the very beginning to build in multiple different layers of resiliency.

Single cluster failure

Where this starts is, when we talk about a single Vault cluster, we're not talking about one node running Vault. Typically, you're running three or five instances of Vault, and internally they're doing a leader election, so one of those three instances is active and servicing traffic. If that node dies, the other two nodes—within 10 to 20 seconds—will detect that failure and automatically take over operation so that within one data center, if we lose the instance that's running, within 10, 20 seconds, we're back up running and servicing traffic again.

Data center failure

Now, what happens if we zoom out from one data center and talk multi–data center? What if we lose the whole data center that is running Vault? The link goes down, there's a power loss, whatever might occur.

To handle this, Vault has a built-in notion of multi–data center replication. What you do is run one cluster of Vault in DC1 and it has three different instances of Vault to provide HA within that data center, and you run another three instances of Vault in a second data center and they're replicating to one another. In this configuration, if we lose a whole data center—data center one goes down—data center two is able to operate independently. In this situation, there's no service disruption to anything running in DC2.

Data loss and disaster recovery

Now, independent from that, what happens if we experience catastrophic data loss? We're running a Vault cluster and due to operator error, disc corruption, or some other unforeseen event that leads to total data loss of that cluster, what do we do?

Vault has a third mode built-in, which is known as disaster recovery replication. In this mode, you're running your primary Vault cluster—again, with multiple nodes for high availability—and you have a second DR site that is acting as a real-time backup. The moment any change is made against the primary site, that's being real-time replicated to our DR site, which is a full mirror.

Now, if we have a catastrophic loss of our primary site, what we can do is either promote the DR site and it takes over active operation, or we can use this as a backup and restore that back into our primary site and bring it back online.

Vault failure scenarios in a nutshell

There are these multiple different failure scenarios where we're either talking about:

  • Within a single data center, and the system's designed to do an automatic failover within 10 to 20 seconds
  • Multi-data center, where you might lose a data center, and Vault's designed to have no service interruption if that takes place
  • Catastrophic data loss, where the design is around being able to promote a DR site—and that promotion is only an API call in a few seconds, to basically bring a DR site back into active operation.

There are multiple layers of defense built-in, but really designed to operate in a production environment, with mission critical SLAs, and recover quickly.

More resources like this one