In this talk, learn how Sky Betting & Gaming helps its developers seamlessly grab dynamic credentials from HashiCorp Vault without having to specify which credentials they need.
Speaker: Lucy Davinhart
Vault is pretty great, and one reason among many is something called dynamic secrets engines.
Unlike static key-value secrets, dynamic secrets do not necessarily exist until you request them.
They can be time-limited, and you can revoke them early through Vault if you no longer require them. Which means it's less of a panic if you discover someone accidentally committed them to a public GitHub repo months ago.
One example of this is AWS credentials. In this example, Vault has its own set of credentials that it uses to log into AWS. It will then generate an IAM user and associated credentials, and pass those back to the user along with a lease.
This lease is valid for a certain amount of time, after which Vault will delete the IAM user, and then those credentials are no longer valid. You can also renew it before it expires if you need those credentials a bit longer.
I'm Lucy Davinhart, part of the Strawb System. Pronouns, they/them. I'm a HashiCorp ambassador, and for my day job, I'm a senior automation engineer at Sky Betting & Gaming.
I've been here since 2016, and both myself and my team have changed in structure and name a couple of times since then.
But one thing has remained constant, which is that we look after our Vault clusters. This includes maintaining integrations and tooling, and supporting all of our internal users.
Today, I'm going to be giving you an overview of how we started using Vault and its dynamic secrets engines. I'll talk about 3 of those in particular, AWS, PKI, and GCP, and what we've done to make the developer experience as easy as possible.
I'll talk about a couple of ways that we can improve upon what we've already got and explore what things might look like if we started again from scratch today.
Back in the day, we were using Vault open source 0.6. AWS had only around 32 products and services instead of the 200 it has these days, and an internal dev tool named PSCLI was a mere twinkle in the eye of our principal engineer.
We had a few AWS accounts that we managed by hand. Our users would log into those with their individual IAM users, and for 1 or 2 accounts, that's an OK proof of concept. But as we were going to start scaling things up, we needed a better solution.
Around the same time, we were looking into Vault as a replacement for our legacy secrets store. Conveniently, Vault also had a potential solution for our AWS issue. As you saw earlier, Vault can generate AWS credentials.
But what do our users do with these once Vault has generated them?
There are a couple of options. You could store them in environment variables, or put them in your AWS credentials file.
We did some initial tooling for our developers to help them out with that. But it was very proof of concept, and we could do something better.
This is where PSCLI (a Perplexingly Snazzy Command-Line Interface) comes in.
PSCLI is an internal dev tool that we manage on our team. It helps standardize and simplify developers running a bunch of DevOps tools on their laptops and in our CI/CD environment.
We've made it as easy as possible to install, to encourage people to use it, and it self-updates,so we can reliably assume that people are on the latest version.
I could do and have done an entire talk on this alone, but all you really need to know for now is it launches tools like Kitchen, Terraform, AWS, CLI, etc., in Docker containers, binding in whatever directories and credentials the user needs automatically.
The aim is that you have the developers run the same thing locally as they do in Jenkins to reduce any unexpected issues.
Fast forward to today, and we now have over 150 AWS accounts. Each of these is configured very similarly, with a read-only and an admin role in it.
Admin has some restrictions on it to prevent modifying our guardrails. But beyond that, it's pretty open. Several accounts have a few more granular roles, but for the most part, our users are fine with the default.
Each account has an IAM user in it for Vault, which means that each account corresponds to a single secrets engine in Vault.
This means that the canonical name for a particular AWS account is the name of that secrets engine.
This also allows us to reuse role names across different accounts and configure leases differently per account.
It's also useful from a security perspective because it limits the scope of an individual secrets engine.
We've set things up so that each role corresponds to a single policy. And each of those policies maps to a single LDAP group with a predictable name. This way, gaining access to a specific AWS account just requires requesting that your team gets added to the development LDAP group.
We also set things up so that if you have the admin policy, you also have access to the read-only role. This is to encourage good practice and allows people to not use the admin role all the time, if they don't actually need it.
The first thing we did was reimplement that AWS Vault helper tool. When a user runs PSCLI AWS keysbecause I've not used it yet today, it's going to prompt me for my LDAP credentials—it will show me a menu of all the accounts that I have access to. I'll pick one of those.
Then I'm asked to pick a role, and it will generate credentials for me from Vault and store them in my AWS credentials file.
Our users don’t need to worry about the full path within Vault. They just need to know which account and which role they want. And they can also specify these with flags to the commands so that it can integrate this with any scripts that they may want to write.
That menu on the screen is fairly new. Back when there weren't too many accounts, it wasn't that big of a deal to show all of that in the menu, even if users only had access to 1 or 2 of them. But when we got to the point where we started filling up half the user's screen, we realized we probably wanted to start filtering that down a bit.
At the time Vault's sys/capabilities endpoint only allowed you to specify 1 path at a time, which is a shame, because otherwise that would have made for a really elegant solution.
What we ended up doing instead is that PSCLI itself will parse the user's policies and figure out which AWS accounts they had access to.
Now we have a means of generating credentials, storing them in a location, available to our users, and then they can use their preferred tooling to interact with AWS from there.
But it's still 2 commands, one to get the credentials, and then the next one to run the tools.
But if we think back to what PSCLI fundamentally is, it's a thing that runs Docker containers with any prerequisites that you need. And we have within it the means to generate these AWS credentials.
So why not just have it also run AWS CLI? And so we did, and it works really well.
At the time, AWS didn't have the same level of multi-account single sign-on as it does these days, but a lot of our users want to be able to sign in to the AWS console in their web browser.
We want people to exclusively use Vault for their AWS credentials. So we needed to find a way of supporting that.
AWS, as it happens, does provide a method to do this. along with some example code, and we have our little tool that generates AWS credentials from Vault.
Now we can also take those credentials and launch one of these magic URLs in your browser and leave you locked into the AWS console.
It looks a bit like this, on your screen. As before, I'm given a menu of accounts and roles that I can access at this time. In addition to storing my credentials in my AWS credentials file, it generates one of these AWS federated login URLs and opens it for me in my browser.
Around the same time, we were looking into how we manage things that we're putting into AWS with some sort of infrastructure as code. Our job titles are variations upon "automation engineer," after all, which is kind of the opposite of clicks and buttons in the AWS console.
It shouldn't be much of a surprise that we went with Terraform for this. Again, we had the same issue of making sure that things that developers could run on their laptops would also run in Jenkins, which made it the natural fit to go into PSCLI.
We could leverage the existing Vault and AWS code to help with that. Now devs don't need to specify anything in particular in their Terraform code or set any environment variables and so on. They just need to specify the account and the role that they need.
But wait! Didn't my talk description say something about users not needing to specify? I did say that, and there are a couple of things that we have to do that most of which apply to Terraform, and that makes things easier for our users.
Firstly, a simple Terraform code repository tends to only need access to a single AWS account.
Therefore PSCLI can look for a conflict file in that repository. If you specify the name of the AWS account in that file, PSCLI will use that automatically.
The other thing we do is, because we know how AWS works and how Terraform works and because all our accounts are configured similarly, we can make the assumption that unless the user tells us otherwise, they're probably only going to need read-only credentials, if they're doing a Terraform plan, for example.
Similarly, if they're doing an apply, they're probably going to want admin credentials. There are some use cases where this assumption wouldn't hold true, but for the majority of people, it works.
Next up is PKI, and more specifically, how we use it for Kubernetes authentication.
We started with K8s back in 2017-ish. Our initial implementation didn't even have any role-based access control, but we definitely needed something in place before we could use this for production.
At the time, OpenID Connect existed, but there weren't really any great implementations for it yet, or at least none that were user-friendly and could be used in Jenkins.
Our K8s admins were already using certificates authentication for node-level credentials. Using that for user authentication seemed reasonable and already exists and has this nice LDAP integration through Vaulto they could piggyback off of that
In K8s certificates authentication is based on X.509, public certificates, and private keys. The K8s cluster trusts a particular certificate authority to generate these.
The example in the Kubernetes certificates authentication docs uses this example, specifying a username as the common name, and 2 groups, the organizations, and those groups are then mapped to Kubernetes roles and associated permissions.
This sounds a lot like something that Vault PKI can do now. Our implementation is a bit more flexible than what we've got for AWS in that it's not tied to specific paths.
We have some static secrets where we define some cluster-level metadata, which includes things like the API server for the cluster, whether it's production, and the location of the secrets engine in Vault, that you can generate your certificates from.
As for the PKI side of things, each secrets engine corresponds to a particular K8s cluster and then roles within that secrets engine are set up to grant specific permissions.
This one, for example, generates certificates with 2 organizations, infra-vault-admin, and common-read-only. Again, these correspond to groups in K8s and, from there, roles.
On the policy side, we've learned from AWS pain points and managed to make the onboarding process a lot smoother.
AWS was sort of the Wild West for us, and with K8s, we provided people with standard templates for what to request in terms of policies, groups AppRoles, etc.
Later we even automated that. And like with AWS, if we grant access to admin, we also grant access to read-only, but also automatically set up a Jenkins AppRole, which corresponds to specific K8s roles.
Then we include that in our policies as well. So there's much less effort on the part of our developers and it's much more standardized.
It's also integrated with PSCLI and works similar to what we've already seen for AWS. PSCLI runs QCTL with an auth plugin that integrates with Vault.
It has a similar menu where you specify a cluster and then evolve within that cluster. Then it automatically populates your K8s configuration file with any certificates that you need. And because this was developed a lot more recently, we can filter that menu by making use of Vault's sys/capabilities, to tell us what permissions we have across all of our clusters in a single Vault call.
We've only been using GCP (Google Cloud) for a matter of months, so it's not as mature as our other integrations. And we've not quite found all the pain points yet.
Our GCP org is set up with various folders, each of which has sub-folders and maybe even sub-folders within those.
But eventually we'll get to projects. We have a single Vault project where we store all of our Vault service accounts. Service accounts in thatproject correspond to a secrets engine in Vault.
Then those are mapped to specific sub-projects within GCP, where they have permission to generate credentials.
We could technically do all of this with a single secrets engine mapped to the root of the org, but we decided not to do that, to keep those secrets engines isolated, similar to what we do with AWS accounts.
In Vault, our secrets engines are named org folder/sub-folder.
These are automatically configured using Terraform, both on the GCP side and the Vault side, including automatic 30-day key rotation
RoleSets in GCP correspond to service accounts in particular customer projects, and then are mapped to permissions within that and/or any other customer projects in that folder.
For the PSCLI integration side, we were initially required to have PSCLI support passing in GCP credentials before we were ready for Vault to generate those.
The solution that we came up with is that tools like G-Cloud and Terraform pay attention to specific environment variables to decide where to look for GCP credentials.
The simple PSCLI integration for that is, if the user has specified one of those environment variables, then we bind that directory into the running container. And that works well enough.
And so when it came to generating the GCP credentials from Vault, we were able to add that functionality relatively easily.
PSCLI will request some GCP credentials from Vault, which includes the content of a Google credentials file. It'll store that in a temporary location and then set the relevant environment variables.
at the moment, we're only making use of this type of credential called "service account keys" because they work with more tools. But this does mean that we have a limitation of only being able to generate 10 of these keys per service account at any one time.
For this reason, our PSCLI integration automatically revokes those credentials as soon as they've been used. They're single-use credentials.
This is not a solution I'm massively happy with, which as I wrote it, I'm allowed to say, but I know that this is going to evolve over time and become more mature in its implementation, like what we have for AWS and K8s.
But with the limited time we had to get something working, we focused on getting the developer experience right first, because if we needed to make changes to that down the line, it may not be backwards-compatible for our users.
And that meant we could get something out that our users could use sooner, identify the problems, and then improve the technical implementation later.
I've been calling the way our tooling works "magic" because, for the most part, it's invisible to our users. As far as they're concerned, it just works.
And developers get used to things working in a certain way. But what that means is that when we hit edge cases or start adding new functionality which doesn't quite fit the assumptions the magic relies on, when stuff breaks, our developers don't know why because they don't know how the magic works.
For starters, there's a lot less magic in how our GCP integration works. It doesn't matter so much because, for the most part, Vault-generated GCP credentials aren't meant to be used by machines, and machines will never see the magic.
For example, like in the early days of AWS, that menu where you can select your GCP folder and RoleSet, it'll just show you everything.
Like I say, though, this is going to mature over time and we'll take the example of what we do for K8s roles and figure out what permissions users have in a similar way.
Another notable example is that in AWS, each account has an admin and a readonly role, which means that for Terraform, we can automatically pick the role for the user, depending on which Terraform command they're running.
In GCP, we have much less control over that. So there are no guarantees, and we're deliberately trying to avoid having generic editor and viewer roles.
While we can specify which GCP org folder and role sets to use in a conflict file, we can only specify a single RoleSet, which would then apply to all Terraform commands.
And we may be making some changes to AWS at some point. So that assumption of readonly and admin may not hold true there for long.
I'm not exactly sure how things are going to end up looking for us, but it feels like a problem that is solvable, if we think it's worth it. It just means that we need to take a step back, take a look at those assumptions that we've made, and establish a more generic and flexible way of doing things.
But even at Sky Betting & Gaming, this is not the be all, end all, "you must do things this way, because we say so." We don't want to do that. Even if we did, we have no way of enforcing it.
What we want to do is make it as easy as possible for people to use Vault. We want to provide and maintain golden paths, get people and especially new users saying hello to the world as quickly as possible without making lives difficult for people who do want to do something more complicated.
There are many ways of doing similar things, even without our integrations. Terraform, for example, can pull in AWS and GCP credentials from Vault without the need for any fancy AppRole tools. Vaultagents can template out an entire AWS credentials file if that's appropriate for your use case.
And if people want to do that sort of thing instead, as long as they've thought through the implications of that, it's not really our place to say no.
I'm left wondering, if we were starting again from scratch today, with everything we know from our 5 years of running Vault and all the shiny new features it has these days, what would our solutions look like now?
No doubt they'd be different, but I suspect the core principles would be the same, that being maintaining golden paths, but not making it difficult for people who want to stray from the path.
But we'd want to put a bit more thought into the assumptions that our magic relies on so that our tooling is more flexible for the situations where those assumptions do not hold true.
It may even be the case that our implementations are more lightweight. The Hashi stack does a lot of stuff itself these days.
But whatever we would end up developing, I prioritize making that configuration on the Vault side as flexible as possible, and then getting experimental integrations out to end users and seeing how it works, which is basically the approach we tried to take with GCP secrets engines.
I'd also want to look into what the secrets engines have in common. It may be that we can create 1 generic integration that is flexible enough for a variety of secrets engines.
We couldn't really do that before. An established user base is not something we had the luxury of when we started. Our AWS integration was built for us first, and real users from the rest of the business came quite a while later.
I think because we had some established integrations in place for quite a while, and they were presented to our users fully formed by necessity, we were worried about breaking things when we make changes to the magic.
With the number of users that relied on things working a certain way, we put off important changes possibly a bit longer than we should have done.
It turns out, though, when we did need to make changes to how things worked, and we knew what would likely break for people, we were mostly overthinking it.
If we knew what was likely to break and how people could prepare for that or repair afterwards, it wasn't really that big a deal so long as we gave people advance warning.
How about yourselves? This wouldn't be much of a talk if I didn't proclaim to drop some pearls of wisdom at the end.
Besides the obvious answer of "It depends," mama's general advice would be: Make your configurations flexible and intuitive.
If you do decide to add some magic around them, make sure you question the assumptions that the magic relies upon. If you can, get things in front of real users as soon as possible, and don't be afraid to break things.
Thank you all for listening. I'm Lucy Davinhart. I hope you enjoy the rest of the conference.
Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones
How Discover Manages 2000+ Terraform Enterprise Workspaces
Architecting Geo-Distributed Mobile Edge Applications with Consul
A Field Guide to Zero Trust Security in the Public Sector