See how Sky Betting & Gaming uses pull requests, ChatOps, and Terraform to manage Vault clusters.
When Sky Betting & Gaming first started with Vault for secret management, the admins were doing everything manually. This was an audit nightmare, and with increasingly more users, the workload became unmanageable.
In this talk, Senior Automation Engineer for Sky Betting & Gaming Lucy Davinhart will outline how her team started chipping away at the manual processes, keeping track of changes over time, and automating parts of their configuration. Today, you'll see how their current system works with pull requests, ChatOps, and Terraform to manage Vault clusters.
Good morning, HashiConf. Thanks for all coming here. Say you are part of a small team, maybe a couple of people, and you're in charge of managing your company's Vault cluster. You’ve got hundreds of people across the company, all wanting to make use of Vault across thousands of systems—many of which are production. So you expect to get around a dozen requests a day for changes within Vault. But you're a small team, and you manage several things, so you don't have the time to dedicate to just administrating Vault. What are you going to do about that?
You don't want to hand out too many admin permissions within Vault because then you end up with the too many admins problem—and the gods know we've been there.
Let's say for the sake of argument that it's 2017, so Vault Namespaces hasn't been invented yet. Not that you've convinced your finance department to pay for Vault Enterprise yet anyway. How are you going to manage this problem? Well, I'm going to tell you what we did.
My name is Lucy Davinhart. I'm a senior automation engineer at Sky Betting & Gaming. I've been there for about three and a half years, about as long as we've been using Vault. My team looks after several things—but pertinent to this talk—we look after our Vault clusters.
We maintain integrations with Vault, we handle some tooling around Vault, and we support all our internal customers using the tool. We are also the AWS gatekeepers because we manage access to that via Vault as well.
So people are using Vault. We've got thousands of VMs, hundreds of humans, various scheduled jobs making use of AppRoles and such—and several other things that are difficult to count using AWS and Kubernetes Auth.
We started using Vault in late 2016. We had a couple of problems back then that we needed to deal with before we could get it into production. I'll tell you what those problems were, how we initially addressed them—and how things improved for us as a result of doing it with Terraform. As this is a journey that never ends, I'm going to talk about a couple of things that we haven't done yet.
Before getting Vault into production—a couple of issues we needed to deal with. First of these, we were doing everything manually. I like to think we're good at config management at SBG—so installing and running Vault was reasonably well automated. But then configuring it once it was live—we did all that manually.
Doing stuff manually—especially when we're brand new to the product—was naturally going to take a while. But even as we gained experience with the product, certain things just took a long time. I'm not just talking about the time that it takes to configure Vault, I'm also talking about how long it takes to debug things. When someone comes to you and says, “I don't have access to this secret. Why not?” That sort of thing took a while to debug.
We’ve also got making changes to Vault. When we're storing our configuration in Vault and nowhere else, it was often time-consuming to properly compare to what already existed. So we did things differently depending on who was making changes—and things were a bit inconsistent. We were averse to doing things manually. We have the word automation in our job titles, after all.
Too much power
As a side effect of doing this all manually, we had phenomenal cosmic power. There was nothing we couldn't do within Vault. Our admin policy back in the day was— let's call it generous and leave it at that. It was scary, and not something we wanted when we were at production.
** Lack of an audit trail**
There were also audits to consider. We used Vault audit logs, and we're shipping them off to Elastic—and have been since very early on. If you're using Vault in production, you absolutely should be using the audit logs. They're great, but you can only answer so many questions with them.
For example, you can reasonably easily see when someone has written to a secret. You can see who they are, and you can see when they did it. But you can't see what changes they've made, which is a good thing. You don't want secrets appearing in your audit logs, and Vault isn't psychic. You have no idea what changes they've made.
We don't have infinite storage space, so we don't have infinite retention on our audit logs. Anything past the retention period is gone—it’s lost to history. We needed to put something in place to deal with this issue. It didn't have to be perfect, it had to vaguely work and be good enough for us to get to production.
The first thing we tackled is keeping track of what was changing over time.
Vault Config RubyGem
We use Chef, so we write a lot of Ruby. We wrote a RubyGem for this. It would iterate over several paths within Vault, such as policies, LDAP groups, AppRoles—and it would save them to a Git repository. We’d run that as part of a scheduled Jenkins job.
Specifically, not using this to keep track of secrets over time because if we're saving up those in the Git repository, they're not secrets anymore, are they? But we put this together fairly quickly. What it's doing is not particularly complicated. As a result, we have the ability to see what was changing over time. This didn't give us any more visibility over who was making changes or why, and we were still having to make changes manually, but it was an improvement.
This tool we've made was originally intended to be used to make changes back to Vault, but it wasn't very good at that. I'm allowed to say that because I wrote the thing. It was clunky at that part, and we didn't have the confidence in it to give it the write access to Vault. That never ended up happening.
Goldfish Vault UI
Instead, we deployed the tool called Goldfish to tackle the problem with making changes to Vault. Goldfish was a Vault User Interface that existed prior to the open source version of Vault having one. That, on its own was useful to us because our users love User Interfaces, but that wasn't the reason we deployed it. The reason was it had a policy request feature. Our users could go into this tool, they could edit policies, or add new policies. Obviously, they don't have write access to these things, so those requests would come to us—and, assuming they've not asked for something silly, we will approve it.
With those two things in place, we had enough to get Vault into production—enough to satisfy our users that we were not going to do anything silly. It gave us enough time to figure out how to do things in a better way, which is where Terraform comes in.
Now, you've come to a talk that is primarily a Vault talk, but it's also a Terraform talk. In case there's someone in the audience who doesn't know what Terraform is, it's a tool that allows you to define your resources in a declarative way. That's typically stuff like your cloud infrastructure, but you can do it for anything with an API.
So you write some code to define how you want your resources to look. Terraform keeps track of the state of your resources, and it uses these two things to figure out how to get from where you are now to where you want to be.
We've been using it for a while to manage all sorts of things, and we discovered that it has a Vault provider, which means we can use it to configure Vault. But we didn't want to give our users a repository full of raw Terraform code. For a start, that would mean that our users would have to learn how Terraform works at the same time as learning how Vault works.
Instead, we wanted something which resembled the Vault API on disk. Not abstracting stuff like that away means our users get a chance to learn more about how Vault works. This is useful when they're requesting multiple interacting resources. And it reduces the learning curve on them, I think.
Take this example—these two files. The thing on the left is some minimal Terraform code that you need to put a policy into Vault. The thing on the right is just that policy. We don't want our users to have to learn this syntax on the left when all they care about is the thing on the right. There was also going to be a bit of overlap between us configuring things manually—or by Goldfish—and then doing it with Terraform. So we wanted it to take the output of that RubyGem we'd written as input.
We wanted people to be able to raise pull quests to make changes. We put it in Bitbucket. Then we can keep track of who's requesting changes, who's approving them when they were merged, and what Jira tickets they're linked to—all that good stuff. While we did have dreams of Terraforming as much as possible, we were going to start small. We were only going to start with policies, initially. When we started this project back in May 2018, we had about 300 Vault policies. Today, we have over a 1000— quite a few.
Finally, we wanted to make sure that anything in Vault that wasn't in this repository, it was deleted from Vault. The idea being we shouldn't have any unexpected or unexplained changes within Vault. In theory—in production—this isn't possible anymore. But we don't want to take that risk. In test, the rules are a bit more relaxed, so i.t’s nice to be able to reset Vault to a known state.
As far as our users are concerned, this looks a bit like this. They go into Bitbucket, they make their changes directly in the editor if it's a simple change. For more complicated changes—clone the repo, make the changes—push it back.
Either way, they end up with a pull request. We ask our users to get those pull requests approved within their team before they send them to us, which often picks up little issues before they get to us. It also gives us some assurance that people are asking for things that they're supposed to have access to. It will also give Jenkins enough time to approve the pull request as well.
Once this has happened, they come to one of our Slack channels and say, “I have this pull request for Vault. Can you approve it, please?” They interact with a Slackbot that we use throughout the company that handles this for us. It creates a SlackThread, it keeps track of these requests in that thread, and keeps track of it in the Jira tickets—so we can keep track of it later—and it notifies whoever is on duty that day.
Funnily enough, in this example, it's me. I come along, I take a look at that pull request. Now, assuming it's good, I'll approve it, and I'll merge it. Then I'll tell them, “Terraform is doing its thing. It should be live soon.” Shortly after that, we get another notification from Jenkins in a Slack channel. It tells us it's come up with a plan. It wants to make some changes to Vault. It'll give me a brief summary of what those changes are, as well as a link that I can click to view the entire plan. Then I can compare that to what the pull request said, and make sure it's doing what it's supposed to.
It also gives me a Vault CLI command, so I don't have to think what commands I need to run. I run that command, and that gives Jenkins write access to Vault. I take the output of that command, I paste it back into Jenkins, and it goes away and makes those changes. That bit is really quick. I go back and tell the user, “All your stuff is live. Go and enjoy your secrets.”
Depending on how familiar they are with this process, I'll be more or less verbose. If this is their first pull request, I will explain a lot—I’ll give them helpful tips. I'll tell them if they need to re-authenticate with Vault, for example. This particular user has done it hundreds of times before, so they get emojis.
We use Jenkins for this. You could theoretically use any other CI/CD tool if you wanted. I believe Circle and Harness are here this week, so. I recommend one of those. But there's not actually that much logic in the Jenkins job itself. Each of these stages corresponds to a target within a Makefile. We have Make on all our laptops, so we should be able to run this whole thing locally. If I run Makehelp, we can see all those stages again. Being able to run this whole thing locally means that I can test new changes to it—test any new features I want to add to it.
The first of these stages is init. This is where we make sure we have access to our Terraform state—where Terraform is keeping track of our resources. We keep our Terraform state remotely in Amazon S3, so naturally, we need to get some AWS credentials out of Vault for that. We do
terraform init, so that'll download any other Terraform dependencies that we need, and we make use of Terraform workspaces. This means we can maintain separate Terraform states for each of our Vault clusters.
Next, we have the import phase—this is our fail-secure mechanism. This is where we make sure that there's nothing in Vault that shouldn't be there. We have a script that lists all the resources of that type within Vault. It lists all the resources of that type in the Terraform state, i.e. everything that Terraform knows about—and it'll compare those two things.
If there's something in Vault that Terraform doesn't already know about, we import it into the Terraform state—we tell Terraform that it exists. The idea being that if you tell Terraform something exists, but you haven't written the code to say that it's supposed to exist, then Terraform will want to delete it—which is the whole point of this stage of the job.
Next, we write some Terraform code. W—e have the generate phase. In the case of policies, this will iterate through all our policy files. It will use the file name as the name of the policy, and it will generate the necessary Terraform code for these—and all the resources are similar to this. We save that to a Terraform file that we can then use within Terraform later.
We do a bit of validation. A lot of the validation is done during the generate phase just because it's easier to write it that way—so some resource-specific checks are done there. We also do
terraform validate, which makes sure that we've not come up with gibberish—that we've actually written some Terraform code. Now, if you are a Terraform Enterprise user, you could do all sorts of fancy Sentinel (https://www.hashicorp.com/resources/sentinel-terraform-enterprise-demo/) policy stuff here.
At this point, we have a Terraform state which corresponds to everything that we know about within Vault. We have a bunch of generated Terraform code, which corresponds to everything that we want to be within Vault. If we run
terraform plan, it will compare those two things and figure out if it needs to make any changes—and it will save that to a plan file. The idea being that when we tell Terraform to make those changes later, it's not going to try anything unexpected, and it doesn't have to think about it. Applying it will be super quick. If we're validating a pull request, that's all we need to do at this point. We can mark it as successful, and that will contribute to the pull request being able to be merged.
Thus far, this entire job has been running with read-only access to the bits of Vault that it looks after. We're fairly confident allowing it to do this without human supervision because the things that it manages, we don't consider to be secrets. We're talking LDAP groups and policies and AppRoles. As it's potentially been writing to a Terraform state, we can regenerate that entire thing from scratch using the steps in the import stage. We don't want to because it'll take a while, but it's theoretically doable.
At this point, it'll prompt us to grant it write access, and it gives us this Vault CLI command
pscli vault write -- - f auth/approle/role/jenkins-terraform_vault-readwrite/secret-id
This is part of an AppRole that has read and write access to Vault. Running this command will give me a secret-id, which is single-use and available for 30 seconds.
It's prefixed with pscli, which is our perplexingly snazzy command line interface, which is another tool my team looks after—which is an entire talk in itself. But all you need to know is it makes sure that everyone across the company is using the same version of the Vault CLI. Whenever we're interacting with any repos full of Terraform code, we're all using the same version of Terraform for that repository—and it handles all our Vault auth automatically.
We give Jenkins that secret-id, and it applies those changes. It uses that plan file that it generated earlier. This is so quick that, in the time it's taken me to explain it, it could have done it 10 times over because it's not had to think about anything.
Commit and merge
Finally, we do a little bit of cleanup in our repo. We take all that generated Terraform code that we've generated, and we commit that back to the repository. We don't really need it to do this. We don't tend to refer to it all that often, but it's nice to have it around permanently in case we need to refer to it later. We also merge release—which is our branch where we keep track of things that we want to be live—into master, where we keep track of things that are live; a nice thing, in the Git repository to keep track of the fact that there has been a release.
That was our minimal viable product. It was solving part of the problem for us, so we started adding more to it. We started adding additional resources—and we did something interesting for each of those resources, I think, starting with LDAP groups.
What policies do individual groups of humans have access to within Vault? This is where we have to get a little bit creative. We added these to the pipeline back in July 2018. There were about 90 of these at the time. There are about 250 now, but back then, there wasn't a dedicated resource for managing these in the Terraform provider.
Fortunately, though, there was a resource called Vault generic secret. This allows you to manage arbitrary paths within Vault as a Terraform resource—which is very powerful, you can do all sorts with that. But it's also very dangerous if you're not being careful. You can expose your secrets in places you don't want them exposed. If you do ever use this for anything, pay attention to the warnings in the documentation.
We were being careful. But we were also using it to manage things we didn't consider secrets. So we were using LDAP groups and what policies they're mapped to. We weren't too worried. But still, later on, when they released the dedicated resource for managing these things, we switched over to using that instead.
Our users weren't writing raw Terraform code, so this was completely invisible to them. They didn't have to do anything different for this. Also, there was a change that happened recently at our place; there are now certain types of LDAP groups that aren't allowed to be mapped to permissions within systems such as Vault.
That confused the heck out of a lot of our users for a while, and we ended up getting pull requests for the wrong kind of LDAP group. But, fortunately, it's very easy to detect, and because it's very easy to detect, we now have something in our pipeline. If a user tries to request that sort of thing, we can mark the pull request as failed, and give them an explanation as to why. We don't have to explain it to them before that pull quest comes to us.
This is our second most common authentication mechanism for Vault. We had about 160 of these when we added it to the pipeline back in September 2018. Since then, that number has almost doubled.
The majority of these are used by us to give Jenkins jobs access to Vault, so they should all be configured fairly similarly. But depending on who was implementing these things—because it took a while to compare them—things looked a little bit different; usually around the IP addresses that we have for the Jenkins agents.
When we added this to the pipeline, we documented for our users and recommended they start using Terraform variables for that sort of thing. That means whenever the team that manages those Jenkins agents wants to add any new ones. They just need to update one file, and then that will update hundreds of AppRoles in one go. It also means that when we're looking in this repository, we don't have a list of IP addresses that we need to figure out what it means. We have a human-readable variable name.
Kubernetes Auth Roles
Next we added Kubernetes Auth—a cool authentication mechanism for Vault. The team that manages our Kubernetes clusters came to us and said, “We want to use K8s as an Auth mechanism for Vault.” We said, “Fine, that sounds fantastic, with two problems. The first being that we don't know how K8s works, and we don't manage it.”
We told them that they were free to add it to this pipeline if they wanted to. Because of the way we had designed this, they didn't have to do that much work. They only needed to add an import script to keep track of what was already in Vault, and a generate script to write the Terraform code.
That didn't take them much work, and it took us hardly any time at all to approve. I love it when our users go away and add features like this for us—it’s great when that happens.
AWS Auth Roles
We’ve got AWS Auth as well—we added it reasonably recently—some point this year. We use Vault to manage access to AWS. This is the other way around—using AWS to manage access to Vault. I'm not going to go into the details. But for each of our 100+ AWS accounts, before we even get to mapping roles to policies, etc., there’s something that we need to configure in Vault for every single one of our 100+ accounts.
And fortunately, it's the same thing for every account—all we need to know is the account number. Now the pipeline has a step which will list all of the accounts in our organization, get the account numbers, and generate these resources. That means we don’t have to think about this when we add any new AWS accounts—it just happens automatically.
Active Directory Users
This was the last thing we added to the pipeline. Prompted by the fact that we had this LDAP restructure, we now had many more of these—and people wanted to manage their passwords with Vault.
This is another case where we had to get creative because there wasn't a dedicated resource for managing these. You might think that we just did the same thing as we did for LDAP groups. But we can't do that in this case because these things are tricky. There are certain parameters—when you're writing these things in Vault—that you either don't know what they are. Or they change so frequently that if you were to use generic secrets, then Terraform would end up in a loop attempting to try to write the same thing to Vault over and over again.
Fortunately, there's a much better, more flexible resource called Vault generic endpoint, and it basically does the same thing. You can use it to manage arbitrary Vault endpoints. But it has a parameter, and if you specify that, then Terraform only cares if the keys that you have specified are correct. In this case, as long as the account name is fine, we don't care when the password was last set—Terraform is happy. As far as Terraform is concerned, it's unchanged—and so it won't get stuck in a loop.
The big obvious improvement, first, is time. Individual changes like this to Vault take less time because our users are doing the bulk of the work. That means that we're able to accept more of their requests.
We've got more visibility of what's going on within Vault. .We can see all these pull requests going in, and the Jenkins jobs notifications. It also means that our users can debug their own access with a combination of this repo and an internal Wiki page. That, again, means we have more time. They can figure stuff out and, potentially, raise pull requests to fix their own issues.
We're also able to answer some more of the audit questions; the who, what, when, how, and why of historical Vault setup. If our auditors come to us and say, “This sneaky user here; what access did they have in Vault on this particular date?” As long as we can find out what LDAP groups they were in, that is a question that we can now answer.
We can search through this repository and look for patterns, look for things we can make more efficient, and potentially find issues before they become problems. As a result of doing all this automation, there are now certain things that individual humans just simply cannot do within Vault. This is a lot less scary than having that phenomenal cosmic power we had before.
I can't tell you exactly because priorities are constantly shifting, and I left my crystal ball in the hotel room. I can tell you some ideas that we have—things that we could add to this.
First, we could do more with it. We could add more resources to it. There are lots of things you can do with Vault and lots of things that we are doing with Vault that we're not Terraforming yet. For example, PKI roles. We have quite a few of these, but our users haven't asked for any new ones in the best part of half a year. Our users are fairly happy reusing the ones that they already have. So we've deprioritized Terraforming that.
The Sentinel policies—we have Vault Enterprise. We can make use of these, and they are significantly more powerful than the standard ACL policies that you may be familiar with. But for the most part, our users don't need them, and we haven't got many people that are using these. However, when our users discover them and want to make use of them, then it will naturally fit into the pipeline. They are very similar to configure to regular policies.
This is a cool Enterprise feature. It gives you a Vault within a Vault—a dedicated section of Vault that certain people could administrate. They need to be consistent and correct with each other, which sounds like the sort of job that Terraform would do really well.
We can do a lot more around Auto-Generation. Our 100+ AWS accounts are all configured very similarly. Just by telling me the name of the account—as far as Vault is concerned—I can theoretically tell you what policies should exist, what LDAP groups they should be mapped to, and potentially, what AppRoles they're mapped to as well, because we're trying to be very consistent. And if we're being consistent—if I can tell you that by the name alone—I can write a script to tell you that as well. In fact, I have. It's just not live yet.
The next point—it’s in the future section of my talk—but the team that manages our Kubernetes clusters, they're fantastic, and in the past few weeks, they have been experimenting with this sort of thing already. Whenever they need to grant people access to bits of K8s, they have a template pull request that they get their users to raise with us. Well, a template pull request, that sounds like the sort of thing you can automate. They're experimenting with that, and it looks to be working really well.
There's also some service discovery stuff that we could be doing. Back to AppRoles and using Jenkins as the example: If the team that manages those Jenkins agents starts adding lots of these or starts scaling up and down every couple of days, then they don't want to have to raise a pull request every time they do that. We could pull that list of IP addresses out of, say Consul. Now, if they are doing it several times a day, several times an hour perhaps, then, perhaps, we want to use a different authentication mechanism.
Reviewing security trade-offs
We’ve also got security to consider. We’re fairly happy with the trade-off we have at the moment between security and convenience. But if we ever needed to, we could lean more on the security side of things. There are some additional safeguards we can add. For example, Vault supports two-factor authentication. We could have something in place where before one of my team can grant Jenkins write access to Vault, we could require a two-factor Auth prompt—and that's easy to add in the policies for this.
There are also Vault control groups, which is a cool Enterprise feature that has been on my shiny-cool-features-to-play-with list for a while. If we wanted to require more than one person to approve these changes to Vault, it's the sort of feature we could use for that.
More validation ahead of merging a PR
We don't do that much validation at the moment. We have a couple of resource-specific checks, and we have terraform validate—which makes sure that we've actually generated some Terraform code. But there's loads more we could be doing here. The main reason we haven't is that the things people get wrong when they're raising these pull requests don't happen often enough to be worth our time to write checks for. Now, I'm not just talking the time that it would take for us to write these checks, I'm also talking about the time that it would take to run these checks.
A recent example is with LDAP groups. A couple of people were raising requests for groups, and they got the case sensitivity not quite right. A naive check for that would check against all the LDAP groups, and make sure they're correct—that took about three and a half minutes, which is not that long. But considering the entire job takes three and a half minutes, probably not worth it. If you have Terraform Enterprise, this is the sort of place where you could have a lot of Sentinel policies as well to make sure people are requesting the right things.
Three and a half minutes isn't that long, but as we grow, that number is going to get longer—it’s going to take longer to manage things. There are a couple of inefficiencies in this pipeline that I'm aware of that we can make faster—I have ideas for how we can do that. That's something I'm going to have to tackle reasonably soon-ish.
Make it generic
Back to namespaces—we could make this job generic, which is a common pattern across the company. We have generic Jenkins jobs that run against arbitrary Git repositories. When we set up namespaces, the admins of those namespaces are going to be in a similar position to where we were a couple of years ago. They might be happy writing their own automation, but if they don't want to spend the time doing that, they could reuse our job, and have their own Git repository that runs with this job—then it'll just work for them.
Terraform is great at managing Vault, but should you go away and try to emulate what we've done here? Honestly, probably not. I'm not saying that because I think what we've done is bad. I wouldn't be on this stage if I thought that. What we've done works well for us, but it was based on initial experimentation and incremental improvements over the course of a couple of years. And it was based on some limitations that existed that don't exist now.
But hopefully, I've given you some insight into how we've tackled this problem, and maybe some inspiration to try something like this yourselves.
Thank you all for watching. You have been a fantastic audience. If you want to ask me anything later, or just get some of my stickers, you can find me on Twitter or on this HashiConf Slack.
Thank you all. You've been great.
Deploying Terraform Enterprise in a Highly Secure Environment at Morgan Stanley
Unlocking the Cloud Operating Model on Microsoft Azure for Financial Services
Building a Cloud Platform Using Terraform Enterprise, Ansible Tower, and Service Now at FIS
Building NAB Engineering Foundations with Terraform Enterprise