Co-founder Armon Dadgar describes HashiCorp Vault's three resilient failure modes, in a nutshell.
When we talk about Vault, it's designed to operate in a data center, in an online fashion. So it's important to us from the very beginning to build in multiple different layers of resiliency.
Where this starts is, when we talk about a single Vault cluster, we're not talking about one node running Vault. Typically, you're running three or five instances of Vault, and internally they're doing a leader election, so one of those three instances is active and servicing traffic. If that node dies, the other two nodes—within 10 to 20 seconds—will detect that failure and automatically take over operation so that within one data center, if we lose the instance that's running, within 10, 20 seconds, we're back up running and servicing traffic again.
Now, what happens if we zoom out from one data center and talk multi–data center? What if we lose the whole data center that is running Vault? The link goes down, there's a power loss, whatever might occur.
To handle this, Vault has a built-in notion of multi–data center replication. What you do is run one cluster of Vault in DC1 and it has three different instances of Vault to provide HA within that data center, and you run another three instances of Vault in a second data center and they're replicating to one another. In this configuration, if we lose a whole data center—data center one goes down—data center two is able to operate independently. In this situation, there's no service disruption to anything running in DC2.
Now, independent from that, what happens if we experience catastrophic data loss? We're running a Vault cluster and due to operator error, disc corruption, or some other unforeseen event that leads to total data loss of that cluster, what do we do?
Vault has a third mode built-in, which is known as disaster recovery replication. In this mode, you're running your primary Vault cluster—again, with multiple nodes for high availability—and you have a second DR site that is acting as a real-time backup. The moment any change is made against the primary site, that's being real-time replicated to our DR site, which is a full mirror.
Now, if we have a catastrophic loss of our primary site, what we can do is either promote the DR site and it takes over active operation, or we can use this as a backup and restore that back into our primary site and bring it back online.
There are these multiple different failure scenarios where we're either talking about:
There are multiple layers of defense built-in, but really designed to operate in a production environment, with mission critical SLAs, and recover quickly.