End-to-End Automation for Vault on Kubernetes Using the Operator Pattern
Nov 23, 2020
Hear about ASAPP's method for running Vault on Kubernetes using the Operator pattern so SRE's don't even have to set things up for service owners.
ASAPP runs Vault on Kubernetes using the Operator pattern to enable service owners to integrate with Vault and manage their own secrets without requiring any SRE intervention. In this talk, we'll talk about our journey from running Vault on VMs to running on Kubernetes and discuss the advantages of running Vault on Kubernetes.
Hi, everybody. Thank you for choosing this talk. If you are listening to this — or if you're watching this — it's probably because either the words Vault, Kubernetes, or the Operator Pattern caught your eye.
Vault and Kubernetes are not easy to manage, as I'm going to show you — at least it wasn't in our use case. Hopefully, you can learn the lessons that we learned on the journey that we took to automate Vault on Kubernetes end-to-end.
Where We Were
First, let me tell you where we started — and where we were — with the challenges we were facing. We started using Kubernetes about three years ago and Vault shortly after that. Because secret management and Kubernetes wasn't that well-crafted — both in the platform and in our adoption itself — Vault was a no brainer there.
Back then, Kubernetes was 1.9, and Vault was 0.8 or 0.9, I believe. The recommendation was to run Vault as isolated as possible. In fact, the tooling for running it in Kubernetes was pretty much non-existent. Following the recommendation from back then, we were running it on EC2 VMs.
We're an AWS shop. You can see there's a VPC, there's the network boundary — we have our Kubernetes cluster there talking to Vault. The production workloads are running with ServiceAccount identities. The Kubernetes OAuth backend had just been released for Vault. It was perfect timing because then we could automatically use that authentication scheme.
You see that there's a square there with SRE. That means the SRE team was maintaining both the Vault configuration as well as the static secrets. That might sound scary, but I'll touch on that in a minute.
The Vault configuration was split into two — we're maintaining those with Terraform, by the way. One was the Vault cluster itself to the cloud resources that form the Vault cluster and the back-end configurations like roles, policies, options, etc. One was the static secrets were managed separately. I'll explain why in a minute.
Per-Service Configuration and Abstraction
We had abstracted out the per-service configuration enough. So for any new service that needed to be authenticated with Vault, all the service owners just had to add this single line in this list. This would be passed to a Terraform module that would create everything the service required — policy permissions, roles, etc.
That was a special use case. The specific situation was that back then, Vault UI was Enterprise only. We were using open source and the KV backend for static secrets was only at version 1.
The problem was that modifying secrets that are either appending keys or modifying assistant secrets was a hassle. First, you had to do it in the command line, get it exported to a JSON file, modified, re-inject — it's error-prone, very manual. We ended up with this pattern of encrypting plain text secrets offline with KMS. Then storing that KMS in SourceControl.
In the sample you're seeing there, the string is not the actual secret — the actual key — but rather the KMS encrypted version of it. We put this in a Terraform where we had a data source object that would decrypt that on the fly and then
vault generic secret to inject that secret into Vault. Obviously, there are risks in doing this, but this was mitigated by having the Terraform in-memory backend.
It's not that documented, but you can use this in-memory backend with an empty block. The backend will fully exist but will only exist in-memory — and only for the duration of the run. There are no remote states where secrets are being persistent in plain text.
What Are We Trying to Solve?
What I explained might sound like we have everything under control. So, were we trying to solve? It may not be clear.
Multiple Microservices and Single-Tenant Deployments Create High Cardinality
First, the picture that I showed earlier is oversimplified. It was showing a single Kubernetes cluster, a single Vault cluster, and a single network boundary. But the reality is we had a deployment model of multiple single-tenant Kubernetes clusters. Each of them with their corresponding Vault cluster — and the configuration of all of them was pretty much identical.
On top of that, we're having an explosion of multiple microservices. Modifying things meant modifying — not a single microservice and a single environment — it meant doing it in multiple places, multiple times for both configuration secrets. And it was getting out of control.
That usually wouldn't be a problem if we had some form of CI/CD for these changes, but we had none. Everything needed to be applied manually by a human, and the SRE team was a bottleneck. Only a few of us could do this, and it was getting to the point where it couldn't really scale.
The Code Was Collectively Owned, but the Resources Weren’t
There was also a false sense of self-service-ness. Service owners can add that line to indicate that they wanted the service to be authenticated and configured for Vault. The reality is that's where the interface stopped for service owners. The PR approvals — the merges, the applies — had to be done by the SRE team. That team had permissions for the Terraform backend and permissions to Vault and to decrypt KMS. It was too coupled with the SRE team to be made into actual self-service.
Terraform Only Provides Point-in-Time Configuration
It's accurate at the time you apply it, and then you do the
terraform apply run. But we didn't have good control between apply runs. Even in the presence of CI/CD, that still didn't give us the continuous and automated control that we wanted to have.
Who Am I?
Before I go on, let me introduce myself briefly. My name is Pato. I'm a lead SRE engineer for a company called ASAPP. We're based in New York City in the World Trade Center — or at least we were up until six months ago.
We build customer interaction platforms. Imagine when you're trying to call your bank or the ISP or airline or something — we want to transform that call into a digital interaction. Move it from one-on-one calls to a multichannel digital platform. I wanna throw on our special AI sauce. It means Augmented Intelligence. We're going to augment the human behind the call and not replace it with a button.
Also, I was known around the office as the unofficial HashiCorp ambassador until earlier this year that HashiCorp launched the official ambassador program — and I was nominated and selected. I can cross off those two first letters in the title.
The Kubernetes Operator Pattern
You don't need to know a lot about Kubernetes. Other than the fact that it's a container orchestration platform. And certainly, we didn't know a lot about Kubernetes when we started doing this. But there are two key aspects of Kubernetes that we discovered and explored — and guided the path that we took.
One of them is the Operator Pattern. Kubernetes itself is like one big controller running in a loop reconciling stated or declared objects into real objects like pods and containers — and things like that. But on top of that, you can run your layer of custom controllers that will reconcile custom resource definitions. This is another Kubernetes concept. They are arbitrary data structures that define the attributes and the state of things you want to represent. Whether things in your business domain, real-world objects, external APIs, etc.
It's not limited to Kubernetes API. The controller doesn't need to act only on Kubernetes objects. It certainly talks to the Kubernetes API, but it can integrate with external APIs. In our case, for example, we integrated with the AWS API, as well as a Vault backend.
One of the major advantages here over the Terraform setup that we had is that this reduces the drift. It continuously ensures that the state is running as defined. Terraform apply is more than a glorified Cron Job because it merges all the APIs — as well as runs on a loop that acts every cron sec — or every event — to ensure that things are as you defined them.
The other aspect of Kubernetes is the admission controllers. They are webhooks that intercept calls to the Kubernetes API. Whether from humans when you run kubectl apply, or when machines are running — for example, Jenkins or Spinnaker — or other controllers are talking to the Kubernetes API. These webhooks intercept the requests.
There are two types. One is the validating webhook that inspects the payload and either rejects the request or passes it on. Or mutating webhooks: Those inspect the payload and can modify it before passing it back and getting processed in a Kubernetes storage, SCD — or whatever you have.
The main use case for the validating webhooks is to enforce rules or restrictions that go beyond RBAC — whether a user or role has permission to do this or not. For the mutating webhooks, you can sanitize input or modify objects with things known at runtime — but not previously known.
Having said what I said about Vault, Kubernetes, and our usage of it — the struggles that we were having and these key aspects of Kubernetes — it’s obvious the solution was to bring it together and have them work in harmony.
Vault in Kubernetes
First of all, we wanted to leverage Kubernetes itself as the automation platform. We still have the pipelines for deploying the things that we want to define. But then we want to have Kubernetes modify the things as late as possible so they don't have to be previously known or defined — and they can be injected by Kubernetes at runtime or at the scheduled time.
Our services were already using Kubernetes authentication, so pulling Vault in didn't add a lot of benefit. But it removed a lot of the risks or operational complexity of having to maintain the additional set of EC2 instances. More importantly, we wanted to push ownership of configuration and secret management to service teams. We want to remove humans — and the SRE team — completely from the picture so they can do this by themselves and not have to depend on an external team.
So, if I paint a picture of what we were looking at and what we ended up with: First of all, you see that square represents Kubernetes. Now everything runs within Kubernetes — that looks the same. We still have the workloads, the pods running with ServiceAccount identities — using that to authenticate to Vault.
This is where things started to change. We now have a Vault Operator. That is the one that is in charge of creating the Vault resources internally, the pods and services, and all that. We have another operator that is in charge of discovering identities that need to be configured for Vault authentication and reconfigure Vault for that. We have the mutating webhooks that modify workloads so that they can discover Vault — or know what Vault to talk to. At that point, we have almost a full circle of Vault creation: something to configure, something to discover. But now we need the meat of the pie as the secrets that the services are going to be.
We're going to explore them one at a time. We have the Vault creation, authentication configuration, Vault discovery, as well as the secrets to consume. We start with the upper left.
For the Vault Operator, we were using an open source operator — maintained by a company called Banzai Cloud. It's called Bank-Vaults. This essentially installs a custom resource in your cluster that's called Vault — it’s a new kind. You can define the state and the configuration of your Vault cluster in that object.
That single object replaces the Terraform that we had for the cloud resources — and the configuration — with a single object that will have everything. On top of that, it has a programmatic interface that can be modified by other things.
Why Not use the Official HashiCorp Helm Chart?
We evaluated it, and it was good for day zero — and maybe day one — with some additional care. It only provided the baseline for creating the base infrastructure. And we weren't really looking just to have that — we were looking to add automation on top of that. We already had a solution for creating Vault clusters. Granted, it was outside of Kubernetes — and that wasn't where we were looking for — but we're looking to not just do that but programmatically augment it.
In our case, it certainly wasn't good for day two-plus, because the helm chart doesn't really help you in defining roles, policies, auth options, mounts, etc. It wasn't what we wanted to do.
Vault Dynamic Configuration Operator
Our production workloads use ServiceAccount objects to use that identity to identify with Vault, and exchange the job for a Vault token.
The operator is discovering ServiceAccounts that need to be configured and add that configuration in. The key here is that the operator is not modifying Vault directly but is modifying the Vault object that was defined earlier. That's the central operation point.
Then the Vault Operator takes those changes and makes the changes to the backend. The behavior at runtime doesn't change because — from the point of view of the service — it's still using some identity, still talking to the same Vault, and using the same flow to authenticate.
If you were declaring your ServiceAccount with the first four lines, you’d need to add this one annotation that says, "I want you to auto-configure me." Now the Vault backend will have the role policies — and the configuration — for any service using this ServiceAccount to be able to talk to it. That's it; no human needs to be involved because everything happens programmatically behind the scenes.
The idea of the webhooks was to reduce complexity to service owners by removing the surface area of things they needed to know or modify. We wanted to avoid having to copy-paste things. Maybe they copied from previous integrations that are not really compatible anymore or needed to copy a sidecar definition. We wanted to remove that.
On top of that, the sidecar itself abstracts out what I call the authentication dance. It’s a service having to have the knowledge of what’s the flow for using the Kubernetes-specific authentication endpoint — and instead — can go straight to requesting secrets; it doesn't need to know about the rest.
Because we know sidecars are controversial — and we can talk about that separately — we made this optional. In the long run, we hope it will converge that doing the easy thing made that the right thing to do. But we also wanted to give teams that want to have more control — and be more explicit about what they're doing — the option to do that. You can own your own Vault-related configuration and do that copy-paste and sidecar definition. Or you can declare your intention with annotations, and the webhook will take over.
Vault Agent Auto-Inject
It looks like this in real life — for example — I have my deployment definition, and I want to inject a sidecar there. I don't want to define a sidecar. I want the webhook to do it for me.
I add that annotation that says, "I want you to inject a sidecar into this deployment definition." When that pod gets scheduled, the webhook is going to intercept that call and inject that Vault agent sidecar. That sidecar definition is not declared explicitly in the deployment manifest. In fact, if you do something like
kubectl describe deployment, you won't see it. But you'll see it if you do
kubectl describe pod, because it was injected just in time, right before the pod got scheduled.
For services that are doing direct integration for Vault — they're not going through the sidecar. They can also request to have the environment variables explicitly defined. There's another annotation there that says, “I want you to configure this container specifically to inject the values that will allow it to talk to Vault directly.”
I need to define my variables that I want filled in. Then at runtime again — when that pod gets scheduled — by that point, the mutating webhook will inject the Vault address, the Vault CA cert — these just are examples. It won’t just mount the value, it will mount the volume and discover the CA certificate that it needs. All these values are known by the webhook previously because it's pre-configured — and that's where we abstract all the knowledge. It is concentrated there — it doesn't need to be distributed to humans. Again, almost full circle. But now we need to think that services will consume.
KMS Vault Operator
We have this KMS Vault Operator. It’s the link between the flow that I was describing earlier. It follows a similar pattern to what we had before. Now there's a new object kind. It's called KMSVaultSecret. It's a custom resource Kubernetes defined previously.
It follows the same flow — we encrypt secrets offline with KMS, we store that in Source Control. It becomes sort of secrets as code because — now it's a Kubernetes object in YAML — you can embed it in your helm charts or your pipelines; however you do it. The operator discovers those objects, and then reads the KMS string, decrypts in-memory, injects into Vault.
In practice, it looks like this very similar to the Terraform pattern we had before. You have the new object kind called KMSVaultSecret. You have your path where your secret's going to be created. You have KV version — V1 for example — and then your keys and your encrypted secrets. Again, nothing leaves memory in plaintext. We come back to this — the part of creating Vault, configuring Vault, mutating workloads, and injecting secrets. None of this requires a human other than obviously initial seed and creation.
A Piece of Advice
Storing secrets in source control, statically, even if they are encrypted, should be an exception. In our case, we had a very specific situation and a reason why we did it. But in reality, Vault can create dynamic secrets for you. If you're generating secrets or credentials for data sources like databases or Consul, or RabbitMQ, etc., Vault can do that for you. If you're generating dynamic identities or role secrets for AWS IAM, GCP IAM, Azure roles, etc., that can be done for you as well. If you’re using TLS certificates, Vault has a backend for that.
There are cases, though, where you can generate your secrets, rotate them, or generate them dynamically. But if it's a private API — or it's a non-standard operation — and Vault obviously doesn't support that, you can write a plugin for that. Vault has a plugin system that can be deployed in this same containerized Vault-in-Kubernetes way — and you can do that as well. There's the balance of maintaining that code versus using something else. There's no rule of thumb — it’s just managing risk.
Lastly, this wasn't really the goal of the project. But we realized that as we were integrating it into the Kubernetes ecosystem, there are many more things we could use to enrich the experience — or the system of Vault-in-Kubernetes. For example, my current favorite CNCF project is an open policy agent. If you're not familiar with it, go check it out, it is amazing. We're using it for enforcing rules on secrets or configuration to prevent people from shooting themselves in the foot — and to make sure we didn't have an incorrect configuration that could break the system.
Monitoring and alerting with Prometheus was way easier now. We always had that option but having Vault running on EC2 made auto-discovery very complicated. So, now, just create one object, and things appear for us.
If, for example, you're running Consul Connect or Istio as your service meshes, they have direct support for Vault — you can plug that into the Vault that's running within the same environment, and it makes things smoother. If you're using Raft integrated storage — and you're maintaining the external volumes like EBS volumes — you can use one of these storage systems to have your redundancy or your backups in a more cloud-native way.
In general, check out cncf.io periodically for new projects or new things to add to your system to enrich it. For example, the framework used for creating the operators that I mentioned earlier is an operator as a key framework that was recently added as a CNCF project. It's always fun to check it out.
Lastly, there're some links. You don't need to write it down right now. You can get this presentation. I wanted to point out that the first three are two operators (Vault Operator and Configuration Operator) and one webhook — that are open sourced for this.
By the way, they're also built for ARM architectures, so you can run that on Raspberry Pi, Kubernetes clusters — such as this one that I have right here. I also have a demo there that you can go check out — see this in action and in real life. Then Bank-Vaults and some other projects (Open Policy Agent, Prometheus, Rook, Openebs and Longhorn) that I mentioned in the talk.
That's it. Thank you very much. This is where you can find me, feel free to tweet at me. I'm also on several SOC workspaces — the CNCF one and the Kubernetes one. Get in touch — and thank you very much.