Operationalizing HashiCorp Vault
Aug 04, 2020
This talk is intended to be the foundation on which to build your HashiCorp Vault runbook.
- Nicolas CorrarelloRegional Director, Solutions Engineering, HashiCorp
If you are reading this, you probably are using Vault in way or another, and looking into how the capability should be run either at scale, or at least according to best practices. This talk is intended to be the foundation on which to build your Vault runbook.
Hello, everyone. It's a very hot afternoon, and I'm locked in my home office. But what better way to cool down than talking about security and Vault?
Let's assume for a second that you are in a stage where either you have 1 application that is currently being secured by Vault and you are looking at supporting other applications, or you have to support multiple teams in operating Vault.
Long story short, if you're anywhere in this line, you might get some value from this talk.
My name is Nico Corrarello, and I'm HashiCorp's regional director for solutions engineering. I manage a team of about 25 brilliant individuals. So if you're using the Consul or the Nomad secrets engines, you're very welcome. If I broke your entropy workflow, I'm not really sorry, because you shouldn't be getting much entropy done at once.
During my career in HashiCorp, I have helped a number of organizations in operationalizing Vault and running these platforms at scale. I've been with HashiCorp for about 3.5 years. I've been using these tools for a long while. I absolutely love interacting with them. You can find me at email@example.com.
Today we going to be talking about the best practices that we have available in regards to how to operationalize really large-scale Vault deployments. Most importantly, this is the "no wire-boarding" approach. This is not the custom integration. This is not how to pull the other tool from this thing. This is how Vault works and was designed to work.
This is fully supported. If you are a customer, you can reach support. If you're not a customer, you can reach support, and some people will know what you're talking about.
It's an approach that tries to not introduce friction. Although we're going to talk about where potentially the friction points are when using Vault at scale.
I'm talking about using Vault as a capability, not as a tool.
If you're using Vault, let's say as part of your pipeline to get an SSH sign key for a system and you're using it just as a tool, some of these patterns will be helpful, but they are not the kind of patterns that you're going to be looking into.
It's going to be programmatic in every sense of the word, where effectively you can do things with open source, where I'm going to tell you, "This is the kind of thing to do with open source when you're going to do a commercial feature."
I am going to talk about commercial features. Everything that I put in this presentation is documented in our resource page and our documentation, or our learn guides. And it's ultimately HashiCorp-sanctioned.
Secrets: The Basics
Let's start on Day 0. Some of you may have been using Vault for a while, and some of you may have no idea what Vault is. Either way, I always like to agree with you about this: What is your definition of a secret?
The dictionary says: "A secret is a piece of information that is only known by 1 person or very few people and should not be told to others." That definition is behind how we design Vault. The first idea was, before I give you a secret, I need to know who you are. Traditionally, we had other people do that introduction of a system in an environment of complete trust. We wanted to take that out.
The first thing we looked at was, Who is the primary audience for Vault? Well, it's an application, so the interface must be programmatical. It's also the minimum common denominator when it comes to access. Because on top of that API, we can build a web UI. We can build a CLI. But ultimately, applications can only talk through that API.
Some security products have bolted an API onto an existing product. It's like, now it's programmatical access. But the workflows haven't really felt right for programmatical access. That is not something we did evolve. Because if you have the chance to look up the source code, you will see that there's no way of exposing things in Vault outside that API, either the HTTP REST API, or the KMIP API.
The second idea was that before I give you a secret, I need to know who you are.
A Matter of Trust
We know how to assert a human's identity in a complete trust environment. They're going to have an ID account. They're going to use some sort of federation system like Okta or Ping Identity or Auth0. They're going to use multi-factor.
That's fine. We know how to handle that.
But machines have been a challenge for a lot of organizations.
We looked into what we actually trust. If you're in AWS or GCP, you trust a system called IAM. Azure has its alternative and Pivotal has its alternative and Kubernetes has its alternative in namespaces and unserviced accounts.
But ultimately, we wanted to get to a point where an application that has just been deployed, or a system—could be an EC2 instance, could be an Azure virtual machine, could be a pod—they can just go from zero-trust to identifying themselves with Vault, and ultimately, getting into that trust environment. No human interaction required.
If you're doing these kinds of authentications in Vault, you are using it right. If you're not, you probably should be looking into a couple of things.
In Vault, if you are an EC2 instance, you're going to get a short-lived token. That short-lived token is going to basically have a number of policies associated that normally dictate what you can do, i.e., consume the secret or store the secret.
But also, and this is more on the commercial side, they look into under what conditions you can do it. For example, you're a user. You can get this secret if 3 other people from that group allowed you to get the secret. Or if you're trying to get this secret on a change window.
The Vault Auditing System
This is a system called Vault. Everything that comes out of Vault needs a very long paper trail.
We do have an audit subsystem in Vault. It will send information to whatever your auditing or SIEM system is.
There is one thing I do want to make very clear: There are certain endpoints in Vault that your audit system should be alerting on access. Most of this access will probably be legitimate. But, for example, if someone initiates a key rotation, you want to hear alarm bells ring.
Vault stores secrets. That probably comes as no surprise to any of you. You can store static secrets in the same way you can put a piece of paper with a secret written on it in a physical vault.
But we want to take the human out of "Identify yourself." We also want to take the human out of "Go and generate the secret." You can put a secret in Vault and have Vault rotate that service principle or that user account with databases. This is very common.
But also Vault can reach out to third-party systems and generate access to a set of accounts based on least privilege. And as we can make these secrets very easy to access, we can also make them very short-lived.
We do this across multiple platforms. Across SSA and Active Directory and databases and cloud partners. I'm not going to go into a lot of detail on that.
We also provide a set of cryptographic capabilities where Vault holds your cryptographic material and just exposes an encrypt or unencrypt endpoint. Or does format-preserving encryption. Or holds cryptographic materials for other systems that, generally through KMIP, go and do transparent data encryption.
Vault Is a Broker for Access
Now that we have this good idea of the functionality of Vault, a key point that I want you to take from this is, Vault is a broker for access. If I present an identity to Vault, based on my identity, Vault is going to give me access to a system. And you can do it for humans, or you can do it programmatically for systems. Most of our users and customers use Vault mostly for programmatical access.
In order to keep these secrets, there is a whole lot of encryption going on. Whatever secret we store or whatever information we store is protected by a storage key. Everything that leads a Vault. And when I'm talking about "leads Vault," if you're using the Consul backend, it is encrypted by Vault before it goes into the Consul backend, or that integrated storage backend, or whatever backend you may be using.
We have a storage key that protects that information. But, of course, we cannot store that storage key in plaintext. We need to wrap it with something. We wrap it with a master key. That master key is protected using Shamir's Secret Sharing Algorithm.
But in the last couple of years, we had a lot of operational issues in both customers and users around the management of those keys. So we started introducing this concept of a seal key, a third-party cryptography system that can protect that master key. All of these keys need to be rotated on a cadence, and they require different actors to be involved.
For example, the storage key that protects secrets at rest may just need an operator. But some other keys are going to need more than that.
Vault is designed in the hindsight of cloud adoption. So it's designed to run in a way where it doesn't expect any resiliency to be available from your infrastructure.
For example, you could be running 5 nodes of Vault using the integrated storage backend. And the only thing you need is a load balancer and your external seal. Or you could be running 3 nodes of Vault backed by a Consul cluster of 5 nodes that is doing the storage.
Everything you're going to need here is compute, a load balancer, and your seal key. If 1 node goes down, another takes place.
I'm going to talk about all these resiliency patterns in a second.
This is how you scale your single site. But at global scale, we can have replicating Vault clusters where effectively certain data is replicated in what looks like an active manner, but Vault is doing the heavy lifting to an RGO.
We're going to talk about that in a second, probably more when we talk about the resiliency patterns.
Vault on Day 1
What does Vault look like in your Day 1? What things do you need to take into consideration?
The first one, when deploying Vault, we prefer that you would deploy Vault immutably. Why? First, because of auditability.
Every time that Vault is deployed, it's back to that original footprint, maybe using Packer, that you used to deploy it. When you do changes, you can be forced to do changes that go through your whole pipeline and execute all your security tests and so on.
We're trying to avoid people SSHing into the system and trying to poke memory in the system, because Vault keeps some sensitive stuff in memory, and you need to be aware.
Now, of course, a number of organizations are just not ready for this. And this is where configuration management really helps.
If you have a good pattern or a good set of practices in regards to how to manage your systems, that really helps when it comes to keeping Vault in a manner that is compliant, and also avoiding runtime changes in the sense that no one is going to SSH into the system, and potentially poke things out.
SELinux in this case is not just for disabling. We strongly suggest you do it, because if someone gains access to that box, you want to have that extra mandatory enforcement layer on top of it.
Be very aware that Vault is released quarterly, at least major releases, with minors potentially any given month.
These procedures for install and upgrade must be really ironed out. Be aware there are a considerable amount of people that have this problem. And then they realize they are 2-3 versions behind. So be very aware of this.
Consul can help with load balancing, in case you don't want to use a physical advancer or an ALB or whatever. We have guides around that.
If you're running Vault as central capacity, telemetry is one thing you want to look into to give you insights in terms of how your service is operating. And auditability is important.
I mentioned some endpoints, but /sys/policies, /sys/rekey, those are the endpoints that you really want to have alarm bells going off if someone tries to access them, because they are potentially changing the data structure or your security profile in Vault.
Vault is a very runtime-y thing. The object is always to try to keep it running, because imagine for a second you're using static secrets and you have all these credentials for these instances of an application that you know are expired, or will expire in 10 minutes. In 20 minutes, your Vault crashes, for example, and you go and do a restore with information from an hour ago.
What do you think Vault is trying to do? It's going to just try to do revoke operations on secrets that it has already done.
So, backup and restart? Yes, absolutely. Do it. It's probably part of your organizational best practices. From an operational perspective, you will hear that we sometimes talk about a DR cluster. Traditionally, what you would do is you would fail-over to DR. It's much easier. And you will have already 1 cluster ready to go.
Now, that 1 cluster is going to come into consideration when it comes to certain recovery scenarios. Because Vault is designed to not need any resiliency from the platform. No shared storage. Just a load balancer to give away the traffic.
If an individual node fails, guess what's going to happen. If you go and stop in a node, another cluster member will just take its place. No extra configuration required.
If a node actually fails, that fail-over may take exactly 45 seconds, because it's guided by some timeouts. But the recovery is happening automatically. You don't need any operational overhead with that.
If your whole cluster fails, and of course, you're using a secondary storage backend, you can just provision another cluster, or you can just restore your backup. It's a resiliency strategy.
If you're using Consul and a Consul node fails, pretty much the same as Vault. Nothing will really happen. Consul has this concept of a leader node. Even if you lose that leader, things will keep operating just fine, and a new leader will be assigned.
There is a reason why I was using odd numbers when talking about nodes and clusters, and it's because you always need to maintain a consensus, a quorum of systems, for the system to operate.
If you lose that quorum, you may be tempted by the Consul site to attempt to do a recovery. Truth is, from an RTO/RPO perspective, to get your Vault up and running, it might be much faster to do a backup. Just take that into consideration.
If you're using an external seal, which most of you should be using, be very aware that if a seal key is compromised or fails or is held ransom, it is game over, because this is part of the promises we do to users when they are using Vault.
If I enabled you to recover from that seal key failure, that means that my cryptographic structure is not as strong as it should be. The reason this is running in banks and health companies and heavily regulated industries is because of this.
These are the trade-offs, where you look at Vault and you say, "Wow, this is absolutely a thing. From an operational perspective, this is really smooth." People forget this is a security product. Sometimes there are certain things where we need to introduce friction, and the seal key failure is one of them.
The recovery scenario for this would be traditionally to fail-over to a DR cluster that has a different seal key. If you're using open source, you would have to find a way to encrypt that key, pulling it into a system through envelope encryption or something like that, in order to be able to recover from that. But this is one scenario that you need to be really aware of when it comes to recovery.
Those keys I was talking about—the storage key, the master key, the seal key—those traditionally don't exist on Day 0. They are provisioned by Vault. And these should be carried out in a ceremony
We always talk about that
vault_init command, and you may have read the documentation. And it might be slightly misleading, because it goes through the steps, but it doesn't really go through the ceremony.
When you start the ceremony, you have this Vault cluster that is installed, running, but uninitialized. You're going to have a couple of personas involved here. You're going to have a number of key holders, by default it's 5, which are designated "operators."
And remember, these key holders are going to guarantee that cryptographic trust of Vault. Why? Because a quorum or a consensus of them holds the keys to the kingdom.
These key holders are going to have for this ceremony a set of GPG keys. You're also going to have an operator that is also going to have a GPG key.
The way you should start this initialization process is by the key holders giving public GPG keys, using envelope encryption, to the operator. And the operator sending this public GPG key on its own to Vault.
Vault will start then the initialization process and return to that operator a set of GPG-encrypted recovery keys, or unseal keys if you're using Shamir, which are going to be passed back to the key holders and only each key holder can decrypt it. It's also going to get a GPG-encrypted root token.
This is very important. This root token should only be used for a little subset of things. It should be used to set basic authentication, basic policy, and basic audit into Vault. Once that is set up, that operator needs to revoke that key.
Because from there on, you can log in using LDAP or GitHub or whatever, and use the policy system. That is the only time where the key holders can leave the room, when they're sure that there is nothing that is overriding the policy system.
I talked about a number of operational roles involved here. But let's talk a little bit about in the organization.
There is generally a team that runs a Vault capability, kind of an operations team, and that is generally people running services, like if you have an AWS landing zone in your shared services account, or SREs, or people running things like your version control system, or your CI, part of the development tooling.
These are generally the profile of the individuals that run Vault, although some security teams run it for others. There are consumers in the organization, which are ultimately everyone that needs a secret, with different levels of skill.
There are infosec or cryptography people that generally handle that key rotation, for example. They review the architecture and they give guidance.
And there is always an audit team. That is the one that is exposed. They're looking at the audit logs from Vault. And potentially, these are the ones that are going to come back to you, if you initialize a rekey, and ask, "Why is it being done?" or probably, "Why is this being done out of cadence?"
Vault has a pluggable interface. So in terms of the authentication or secrets, you can just build your own to integrate your own system or a third-party system that we have not taken into consideration. Every member of the cluster must have the same binaries for plugins.
Those 3 elements live outside Vault, so you can develop it on its own. They communicate through a gRPC interface. These are standalone. If a plugin crashes, it doesn't affect your existing Vault. This is a very lean layer of software. My talk from last year talks about how to write secrets engines.
Vault on Day 2
Let's assume you have followed all these practices on Vault setup. The way you consume secrets, you authenticate using an identity that you trust. It could be a GitHub token used for deployments. It could be AppRole. It could be AWS. But ultimately, you let that workflow authenticate and establish your identity before they consume secrets.
And there are a number of ways to consume it. You can just talk to the Vault API and retrieve a secret on runtime in the application. Or you can, without touching your application, have a Vault agent that templates that information. You can pass it through environmental variables, or you can do some configuration management or a third-party passthrough if you just can't change this application and you need to get a secret to it.
You have a system requested for Vault and pass it. But if you do that, it needs to be wrapped using response wrapping. In that way, when it gets to the workload, that workload will try to unwrap it, and that unwrapping can only happen once. So if the workload can't unwrap the secret, and you know that that secret was intercepted.
If you're using Kubernetes, one of the most popular methods I've seen in the past 2 months or so is using our agent injector as a sidecar. It uses a mutation webhook in Kubernetes to create a file unmounted to all other containers in the port.
And as I said, mutant. This has been super-impressive. I've worked with a couple of organizations that have loved it.
In terms of how to operate this at scale, what traditionally happens when you have 1 Vault running is you have a central team with a ticket queue that is just getting a request of 4 people. Like, "I need to mount this." "I need to authenticate with that." "I have to coordinate clusters that I authenticate." "I need this custom policy."
When it comes to system coordination, it's both slowing down developers and driving that central team crazy. This is where we really build commercial features to try to help with that.
The way you should be running this is through a namespace capability, where you have a team that handles certain operational aspects. They install and operate key rotation, the backup and restore, and so on. And from there on, you can simply shift those capabilities to the left.
How to Keep Vault Running?
You need to rotate all those keys on maybe a separate cadence, but the storage key is easy; you just
vault_operator_rotate. It can be done by a privileged individual. Maybe you want to do this every quarter or so.
That master key rotation procedure is similar to the initialization. The operator will start by sending the new GPG public keys. It could be the same. It could be new.
And then that operator will receive a parameter from Vault that will be passed so the key holders can authenticate that this is a valid operation. Those key holders will send their shard on the nodes to Vault in a way where a new set of shards will be passed to the key holders. They're generally retrieved by an operator with a backup token.
If you're promoting a DR, you need to run a similar procedure where you create a DR operation token.
And as usual, that operator is going to send a GPG key in order to have that DR operation token encrypted, and the key holders are going to run a similar procedure where they send that GPG key. They're going to get unknowns. And they're once again going to send their shards of the key in order to get that DR operation token that is GPG-encrypted.
Remember, updating Vault, immutable is always preferred. Try to keep it running, and try to redeploy it always from the start. It helps with patching. It helps with everything.
I don't think I need to preach to the choir here when it comes to doing immutable infrastructure. We generally recommend doing a blue/green deployment pattern, like provision a new Vault and destroy your old one once everything's settled.
Remember that the new major releases are done quarterly. Check the change log. We have update guides in the documentation site.
Remember that minor ones update monthly. And the plugin updates are handled independently. And every member of your clusters needs to have the same version of a plugin.
Vault is a really safe product. We put a lot of care and consideration in regards to how we design this. And we try to strike the perfect balance. No one is perfect in regards to practicality and security.
As such, remember, Vault is a security product. So there are going to be things that are going to seem like they introduce friction. They are most likely not designed to be automated.
If you're in doubt, please reach out to us. We are happy to educate you on that.
Some of these procedures that I showed you, make them part of your run book. If you're running Vault, start looking into, How do I rotate keys? How do I initialize Vault correctly? And so on.
Vault on its own is fantastic, but it's not magic. It's people and processes that make Vault secure. And in this, you do not want to compromise, because Vault potentially is holding the keys to your castle.
And when it comes to security, whether you're paying for the software or not, believe me, we are going to want to help you on this. So ask for help if you have doubts.
This presentation is current as of the 24th of June. So please read the docs and ensure that what you're doing is current.
With that, I would like to thank you all.