The United Kingdom’s Department for Work and Pensions serves over 20 million customers a month. Learn how they use HashiCorp Vault to manage not only secrets, but the entire product lifecycle.
Learn about the United Kingdom’s Department for Work and Pensions' use case for implementing HashiCorp Vault on a government delivery team, not only to store secrets but to manage the whole product lifecycle. See how they designed and deployed Vault and also how they involve the whole product team in its use — safeguarding the product, but also each other.
Chris Storey: Hi, my name's Chris Storey. I'm a tech lead with the Department for Work and Pensions, and this is Arun.
Chris Storey: And today we're going to talk to you about a project we've been involved in. But first, we'd like to talk to you about what is DWP. DWP is the largest government department, which provides public services. We provide services for the delivery of welfare, pensions, and child maintenance. We're doing those now using ‘digital by default’, and we are delivering those to over 20 million people. Those people get their money every month or of a frequency about that. We're paying about 20 million customers every month.
Our project Universal Credit is taking six of those benefits—if working age—and making one monthly payment to make things simpler for everyone. In April this year, there were over 4 million people claiming Universal Credit—that has increased because of COVID. But ultimately, you can see the size and the scale that we're working with.
Today we're going to talk to you about Vault, and the first question is why did our business need Vault? In the beginning, we were a small team. The team started out managing their secrets manually, but as we got bigger, things became complicated. We'll talk about that later. However, some of those secrets that we had were actually dangerous; they paid money, they managed and protected large amounts of personal data—we needed to do something about it.
We knew at the time that our team was starting to get bigger. The team was originally one team—it was going in the direction of about four teams. We had a small DevOps team that was shared across those teams, and there was reliance on a small number of those people. We had one occasion where—due to people being on leave—we had only one person who would be responsible for doing the delivery of secrets management, but also all of the delivery of actual releases.
It was becoming a little bit messy. But the key for us was that the product teams themselves didn't have that ownership and ability to manage their own products. There was an awful lot of transference of risk onto our central DevOps team.
We wanted to be secure by design. We realized at the beginning that we had a lot of procedural controls, and those procedural controls were being done on trust. We wanted to change that and make sure we could manage that properly throughout all of our processes.
We wanted to give the teams complete control of their products, and they could make decisions about it. But also they would have ownership of their secrets. That was important to us because—at the time—there were lots of processes going on that we didn't particularly like.
We wanted to make things easy. We wanted to make a release process that would allow people to do what they wanted within a framework. And we wanted that to be repeatable and built completely in the cloud.
I think one thing we could see was the ability—because of the processes that we were doing—for team members to be accused of things that possibly weren't them. But there was a gap—and we hoped that whatever we did, we would be able to fill that and remove that problem.
Let’s think a little more about it: In our DevOps team, we were managing those secrets using common tools. We were using scripts, and we were using an awful lot of memory; how do we do it the last time? When do we need to do it?
The things we were doing were not out of the ordinary. But it did require people to understand what they were doing—and also to try and keep track of that thing that we did back—whenever-it-was maybe—12 months ago. Do we remember all of the commands that we did? The corporate memory—particularly when you have a group of DevOps that change—was part of this.
Our biggest worry was that the way we were managing secrets. It involved using Git, and all of our DevOps had a personal copy of those secrets. We couldn't partition what individual people had access to.
Going on to deployment, we had quite a good method of doing things. But even then, it was taking us a week to design and build a new environment. It was also taking around six hours to do an individual application deployment. The cadence, therefore, was slowed, and we wanted to speed that up.
So, ultimately, what was it that we wanted to end up with? We wanted–wherever we went—to store those secrets securely, of course. But we wanted to change the way that we managed access, and we wanted to place the ownership into the people who needed to be—and we believe that should be the teams. For many secrets that are individual to an application, they probably—we believe —belong with the product team and, therefore, the product owner.
Our organization would like everything to be cloud-agnostic. At the moment, AWS is used an awful lot, but we also do have some things in Azure and GCP. We also have some major parts of our infrastructure still on-premise. We wanted to keep some form of cloud agnosticism. We’d like to make everything immutable. We’d like to be able to quickly update things—and when we do update them, we would like to do that with zero downtime.
Then—ultimately—we wanted services a bit like AWS, where we could manage that thing using APIs. We wanted to centralize PKI with API.
We went out, we did some work and we actually looked at the market—what the options were. We looked at the number of products. Some of them very quickly, some of them in detail. And we did that piece of suitability checking. Obviously, a government department doesn't make decisions lightly, and this was quite a serious piece of work.
Ultimately we selected Vault. There were things in there that we particularly liked. One of those is that our team in London already has got Vault—but equally, they are using HSMs for their backend. We aspire for that at some point—but equally, it had that ability to be across multiple providers.
We now want to pass on to Arun, who is going to tell us more about what it was that we actually did. Arun.
Arun Jayanth: Thanks, Chris. We had set up Vault using Terraform, and Vault is spread across Multi-AZ with a blue-green deployment so that we can do a rolling update, which is one of our requirements.
We have open sourced this Terraform code, and the link is found at the end of this slide. When we were doing Vault, we realized that—for the immutable infrastructure—we had to do little changes with the AMI and also the Docker level. We also did some standardizations to the application development, which will be coming up in the next slides. With the help of AWS and Vault and some scripts, we have achieved a zero-downtime deployment for everything.
To start with, we took the AMI; we did on a CentOS 7-hardened AMI. It is compliant on boot, and it has a very high security compliance. We use OpenSCAP as our measurement, and it scores at 90.5, which is a little higher than the CIS official hardened image. In the AMI—we have removed all the search access, so everything is secure by default.
For Docker—we made everything read-only. It is very lightweight. We removed all the admin commands and login commands. We have baked in Vault’s Auto-Auth agent into the Docker container, so that it can automatically authenticate against Vault’s IAM. We have introduced a process to inject applications. We also made the logs to ship automatically to CloudWatch—it's completely immutable.
All of our communications in Vault are encrypted except the Gossip protocol and everything is following the HashiCorp recommendation. We use LDAP as our access control to log into Vault and various nodes—we also use AWS IM in other roles for the Docker containers. As per the recommendation of HashiCorp, we do some customizations at the beginning, and we destroy the root token—reusing the user data script of EC2. And as I said earlier, SSH access is completely removed in our Vault.
The layered access has kept in mind that the product team owns the entire product, and the DevOps is responsible for only managing Vault. They don't have access to any of the feature teams’ or product teams’ secrets or configurations.
As you can see, our DevOps is primarily in managing Vault operations. They can do a little project operations—if a permission is given by the senior members using a privileged Vault token. In terms of the product operations of pre-prod and prod, you can see that the entire control is with the product team. In pre-prod, it's with Dev and QA, and in prod, it's completely with the product owner.
For the product standard setup, we had to do something for the project to make a standardization for the configuration and also for the secret management. We have written a Lambda function, with additional security to create those lone points and also do those additional configurations. But the Lambda does not have any access to any of the secrets of the feature teams.
We ended up doing a PKI backend which acts as an intermediate CA for the product, for each environment, a KV backend, where you can store the configuration and the secrets, and also the ACL policies, which gives the layer access. Those layered accesses are mapped to an LDAP group, which can be mapped to the respective product owners—Dev and QA—and also AWS Auth backend for the Docker containers.
We used a Consul template to standardize the application. Each team is given a standard template depending on the application type like Java, Node.js, or Nginx. This Consul template is rendered at startup. We have different paths in the KV backend to store the secrets and the configuration. Only the product owners own the secret—be it non-product or prod. The configuration in non-prod is owned by the Dev QA and some of the team members. But in prod, it's completely owned by the product team.
As the code is going through the pipeline, its environment-specific config is pulled from the perspective Vault, and the template can be only tested once. That saves a lot of time and standardization.
We also gave the prod team control to place their microservices on a different NTL architecture. It's a simple JSON file. We have written some scripts, which convert the JSON file into a Terraform format. And then convert that into AWS ECS templates.
The non-prod environment has in control of Dev and QA of the product team. There are three stages of the deployment. One is the configuration stage and the pre-deployment stage, and that's the actual deployment. In the configuration stage, we request product owners (or the product team) to enter their configuration or those secrets in the respective predefined places—and the access control is applied on those paths.
Once the configuration is done, there is a pre-deployment stage, which will come later. Once the pre-deployment is completed, it's a simple application for a Jenkins job to run the Terraform; the deployment happens. As you can see here, there was a separation between non-prod and prod—and the product owner has to approve before moving to prod.
Pre-deployment—again, we are using the Lambda. When Jenkins kicks off that Lambda, it does everything needed for Terraform to deploy it. Like—for example—it creates the ACL policies for the microservice to access the configuration. It's a read-only configuration.
It also creates load balancer certificates, if needed, for the load balancer. And also, it keeps all the information which is required for the AWS ECS task definition—like application version, application name, or the log level. It keeps this kind of information ready—and this gets converted into the next stage of deployment.
So once a deployment—once Terraform is implemented—it's started up. It hands over to the AWS ECS, and we use the AWS ECS rolling update functionality to achieve the zero-downtime deployment. During the startup of the container, the Consul template authenticates Vault, gets the code, and renders the configuration. It also creates a single-use certificate for the application.
The container that is alive would live only when the container is alive. Once the container is destroyed, it's gone, the certificates are gone. We also do some customizations like, for example, Java. For Java, we need to create our key store and a trust store. For that, we have to generate some random passwords.
We also baked those customizations into the container. It depends on the application type. It does some customization on it, and that's also a single-time use. That's alive as long as the container is alive—it goes off, once it’s done. As I said all the certificates, everything is destroyed once it's done. I will handover to Chris Storey to talk more about did we achieve our goals.
Chris Storey: We now can release when we want—and quickly. We let—as you've seen—the teams manage their own secrets and the configuration. There is strong access control with all of that.
We have taken the load off our central team. And—I think it's quite important to say—we've retained some staff, and we've given them the space to develop further. When I say develop—I mean the actual product. It was becoming quite a mechanistic process for people, and here, we've given them the flexibility to have some more creativity again.
We've had some quite good feedback. Here, Ed, one of our product owners, has said that the current setup is good. If you look at the second point, I think it's quite interesting—he talks about the workflow, and—from a non-technical user perspective—it isn't great.
We've spoken to HashiCorp about that. I think it's been acknowledged that Vault wasn't seen as an end-user product, and we're using it differently. There's the potential there for a different user interface for some users. We're having a look at that, and we hope HashiCorp will also have a look at it.
Hayley—another one of our product owners. She says that she, “enjoys doing releases.” When we suggested that we might automate further and make some other changes. She said no. She said—no, I quite like getting involved and get my hands dirty.
Buster, one of our delivery managers, in a way, he says, “We're masters of our own destiny.” There they are. They can do things when they want to do it, “We know we do releases. We do releases during the day, and we do it when we want.”
Like with any product, we're always trying to move further forward. We are currently moving to HashiCorp Vault Enterprise. We are looking at how do we get that user experience right. We have technical users; we have non-technical users. Where's the balance, and which bits need to be done for each?
We want to further reduce the time it takes to deploy. We've already gotten over 50 times faster. We're down to a few hours compared to what was days.
We've done that. We believe there is further to be done. We believe we should hopefully be able to get down to half an hour or even less.
We are now looking at how we can deliver self-service environments for people so that people can take their own environment, spin it up and then destroy it and make more of these environments completely immutable, but also be on-demand, which is, I think one of the good positives about using cloud infrastructure.
We are also moving to HCL Version 2; that's currently ongoing—and we want to upgrade our base operating system to CentOS 8.
Thank you for listening. If anybody has got questions, Arun and I will be happy to answer that.
Arun Jayanth: Thank you.
Building NAB Engineering Foundations with Terraform Enterprise
Accelerating Cloud Adoption in the Highly Regulated Public Sector
Lessons Learned: Migration to the Cloud in A Highly Regulated Public Sector
Automating FedRAMP Security Compliance with Terraform