Learn how Bench Accounting uses HashiCorp Vault's dynamic credential rotation capabilities to drastically mitigate the potential damage of a data breach.
The longer passwords exist, the higher the chance they will be compromised. At Bench Accounting, HashiCorp Vault has allowed them to use role-based access control model to store and read secrets, by authenticating with AWS, Kubernetes, and SAML. Vault can also generate database credentials based on configured roles, allowing engineers to provision ephemeral credentials (also called dynamic secrets) across both Postgres and MySQL databases.
This includes roles for Kubernetes services and engineer read, write, and admin access. Different roles have permissions in the form of SQL grants and TTLs depending on the requirements.
In this talk, Bench engineers will address the technical choices they made, some issues we encountered, comparing the workflow of Vault and MySQL vs Vault and Postgres, and mechanisms they use for provisioning and renewing ephemeral database credentials. Finally, they will conclude with a demo showing how this is managed with Terraform and Vault.
Phil Whelan: Welcome to your new job at Bench Accounting. To start, you're going to need a username and password to our main database. The username is admin; password is X1234QRT. If you can't remember that, you'll find it in the main application repo. It's in a config file on GitHub. It's also scattered across lots of scripts and the marketing team has the same username and password. You can always ask them. I'm not sure why they need it.
I feel there's a better way to do this—and there is; which is what we're going to be talking about.
This is Alberto Alvarez. He's the platform leader at Bench Accounting. He makes sure everything runs smoothly. I'm Phil Whelan, principal engineer at Bench Accounting and I make sure Alberto runs smoothly.
We will tell you a little bit about who we are, why we use Vault and then getting to how we manage ephemeral database credentials. Then Alberto has got a good demo using Docker Compose, Terraform and Postgres database. It’s on GitHub and you can download it.
We're from Bench Accounting, as we mentioned. Otherwise known as Bench. We're not a clothing company, which most people think we are. We're one of Vancouver's best-funded startups. Lots of bookkeepers and lots of services. And underneath, Software as a Service—that's using a lot of HashiCorp tools such as Vault, Packer, Terraform and Vagrant.
How do you go about managing credentials? You can put them in a config file and put them in your Git repo and upload it to GitHub, but that’s generally not recommended.
Or you can take them out of your config file and put them somewhere else—maybe in a database? But then how do you manage the credentials to get the credentials? Or you could encrypt them, but then where do you put the decryption keys? This is what HashiCorp Vault does for those unfamiliar.
We've been using a HashiCorp Vault for a couple of years now for lots of reasons. We think it's secure. It encrypts all our secrets and credentials. It's got good policy-based access, and we can authenticate with a single sign-on. We use Okta—or applications can also authenticate with Kubernetes backends, or AWS.
It's got lots of flexibility. You can do lots of things with tokens such as a single-use wrapped tokens—which are cool—and it has lots of backends to plug in. There's a quote from a user of HashiCorp Vault. It's very popular.
When we started using HashiCorp Vault for persistent credentials. We took all our special secrets, and we put them in Vault, which has been great. The problem with persistent credentials is they're very secure while they're in Vault, but at some point, you've got to take them out of Vault to use them. That may be applications taking them out. That might be engineers. You can't control how they're passed around. One engineer might give it to another engineer because the engineer is not set up properly.
Maybe you have an outage, and someone shares them on Slack, or maybe they accidentally put them somewhere. Alberto might have put them on his personal laptop because he needed to use that. Over time they spread around. Generally, the idea is the longer credentials live, the less secure they are.
What if the credentials didn't live forever and they expired after a certain amount of time? What would that look like? This is what Vault's database secrets engine does. It allows you to set up Vault to create credentials on the fly using SQL. You can give it SQL to say: “For this particular role, provision these credentials, and then when the TTL (time-to-live) expires, revoke it with this SQL.”
If you're going to have Vault provisioning credentials for all kinds of purposes—admin read, write—then you need to trust Vault because you're giving it very high-level access. By default, you'll probably give it the root credentials. You can give Vault your root credentials and have Vault rotate those root credentials for you.
We use everything on AWS RDS. The worst-case scenario is if something went wrong with Vault, we could always get back into the database because AWS has a dialogue. If you have the right access, you can reset the root password. That’s good—Jeff Bezos to the rescue
It's very easy to set this up on the command line. It's three commands. You enable the database plugin, and you give it the access, the username, and password. Then you tell it what the role is—what SQL to run when you create the credentials, what SQL to run when you revoke the credentials, and how long you want those credentials to live. Then anyone who accesses those credentials needs to have the policies to access that role.
We don't do this manually; we manage it all through Terraform. We have about 22 databases across two environments, and we have 20 or so engineers. Managing all these credentials across all these databases—especially if we were going to have different levels of access for each engineer—that would be lots and lots of credentials to manage. But using Terraform and Vault makes it very easy. This is what Alberto is going to demo in a moment.
Getting onto the application. If your credentials are going to expire and your application is running, at some point your application is not going to be too happy when the credentials expire— it's going to start doing some bad things. What do you do in that situation?
Well, there are a couple of options. You can make your applications Vault-aware, you can embed some clients in there that know when the credential is going to expire—request new credentials and switch over seamlessly.
This isn't what we do—primarily because we didn't want our applications to be Vault-aware. We have lots of applications that are in different languages. Python, Scala, Java, NodeJS, Golang. We wanted to keep our Vault knowledge—not necessarily away from engineers—but not having them having to deal with it.
We created a service that runs in Kubernetes—like all our applications—and it looks for pods that have database credentials in them that are about to expire. They kill those pods. Kubernetes brings up a new instance of the pod, and when that new instance comes up, it will request new credentials and they'll have a new expiry. This has been working well.
The general flow is demonstrated here. All our applications are running in Docker containers in Kubernetes. Kubernetes will bring up the Docker container. When that happens, the first thing to run is an application called vaultenv—vaultenv is our interface with Vault.
You give it a config file—it's got a list of,the environment variables and the secrets I want to map it to. So vaultenv will connect to Vault. Vault will then check with Kubernetes to make sure the service account that the application is using is valid and has access. They will check the policies in Vault to make sure that whatever that server account has is mapped to those roles.
If it is—and we're requesting database credentials—it will talk to the application database, provision those new credentials by running the SQL to create them. It will then pass everything back to vaultenv, which will then map them to the environment variables. Vaultenv will exec the application process. The application then becomes PID 1 inside the Docker container, and then the application can talk to the database with those credentials. Simple.
If you do use vaultenv, you will need to know about
secrets.conf. It's the file that tells you what environment variables to map to what secrets. You can see down the bottom; those are the ephemeral credentials there—under the main database—but it works in the same way.
With engineers, we created a script called
db-creds. This allows any engineer in any of our environments to request any level access to any database. They say
db-creds write, and then the name of the database. That will authenticate them with our Single Sign-On which is Okta. If they have the right access—which they all do—they can get those credentials, and they can get into that database.
The flow is simple. I won't go into too much detail with it. The main difference is they're authenticating with Okta—our SSO in the background—instead of Kubernetes. One of the things that came up when we were doing this is that engineers didn't need to request credentials before. This was quite a change—making them repeatedly fetch credentials—and there wasn't an easy way to integrate this with their tooling. We had to find a sweet spot between having them ask too much and having a long TTL. That was something we didn't even consider when we started.
We really like Vault’s auditability. Everything that happens in Vault goes into an audit log. We were able to pump that all into Splunk and then we can monitor that audit log for certain events. For instance, people asking for admin credentials or write credentials on production databases. That's a rare situation.
It could happen in an outage, for instance. But we have a Slack channel where we see a feed of these things. If we see any strange patterns, we can ask people, “What are you doing? Why do you keep asking for write credentials on the database?” That's good. But it also allows us to let our engineers do what they need to do.
We do have MySQL and Postgres databases. MySQL was straightforward. We wrote the create and the revoke SQL for provisioning credentials. Postgres was a little bit different in that when you create ephemeral credentials, it will create something in your schema as such as a table. You can't delete the credentials because it's still attached to the table. You have to have it assume a role—that everything in the schema is owned by that role—and then the credentials can be revoked when they expire.
Generally, when you add this, it's another piece in your architecture. But it adds complexity. We think it's well worth the complexity, but it's something to keep in mind.
We did have one outage where we thought there was something wrong with Vault because it couldn't provision credentials so the application couldn't start up. It turned that the database was overloaded in a certain way and it wasn't able to provide those credentials to Vault. It had us looking in the wrong place for a short while.
Vault has a hierarchy of TTLs. It’s good to be aware of all these TTLs because you might think you're sending a TTL that's a certain length, but then another TTL that takes precedence might override that—which hit us a few times.
There are lots of things we could do with this. We have a lot of AWS services fetching data from our databases. We'd like to integrate that a lot more to be fetching ephemeral credentials.
We're not yet having Vault rotate the root password even though I insinuated that earlier. There's nothing stopping us doing that—we just haven't done it yet.
Then continuously reviewing how long credentials live and trying to move as people get more comfortable or the tooling gets better—or we make a smoother experience for engineers. There are a lot more things we could do with ephemeral credentials—obviously, one being more databases.
We use Redshift a lot. It would be nice if we were also provisioning credentials with Vault. That's on our to-do list. You can do many things—like any credentials Vault has backends for. SSH access is a very interesting one that we wouldn't mind looking at.
Alberto has a demo. You can find this demo at this URL, and the slides will be available. So, over to you, Alberto.
Alberto Alvarez: I'm going to quickly show here the demo in GitHub. This is a public repo, and we made sure this was our working example. You are free to try it out and let us know. The idea is that we have a Docker Compose file here, which starts up a Vault server and Postgres database server.
Then we have a few Terraform resources defining here for a Vault Mount, a secret backend, and then a couple of roles to illustrate the concepts that Phil was talking about. I've got a write role which grants write permissions to the ephemeral user—and then a read role.
We are going to try them both and understand some of the little gotchas. This is the README in here, and I've got a few terminals open. I'm going to walk through the README—it’s probably the easiest.
First, we need to have a couple of environment variables on all of our terminals. This is to tell our Vault client, where the Vault server is—also, what token to authenticate. We are using this insecure Vault token, which we've also added in the server, in the Docker compose definition. Make sure that this is valid and let's start up Docker Compose.
We are going to use Terraform to create a bunch of resources in the Vault server then we can use our Vault client to generate credentials. Here's one little thing in here. I've got a tvars file with the root equivalent, username, and password for the database. I didn't want it to add this into the code. Even if it's like Postgres and password to have some level of best practice here. Make sure that you have this file because this is on the gitignore—but this is in the README.
Let's initialize our backend. Then make sure that you have these environment variables and let's run Terraform plan. It's amazing when you have a remote state; it's fast, right? There are four resources in here—what I showed you earlier. We've got our backend connection, and then a couple of role—the read role, and the write role. You have the revocation and the grant statement.
Note that Vault doesn't have a predefined write or read role. It only runs the grant or revoke statement that you give it. It's completely up to you—what it means to your company, what is read, what is write, what is admin.
Then our mount. Let's go ahead and apply. You can see something happening here on the Docker Compose logs. Let's quickly use the Vault client and read our config—see if it created the resources in the Vault server.
We can see here that the database configuration—the secret backend for Postgres—has been created. You have some of the attributes there. Let's generate a pair of credentials. We're going to generate the write credentials for now.
We have the lease ID here, which we will use later. This is the lease duration—the TTL that Phil was talking about. This is going to last for 10 days. Then the username and password. To make things easier, I'm going to assign a couple of variables here, so I don't have to go copying and pasting all the time. Then I'm going to use a Postgres client to access my database running on Docker Compose.
Again, I'm using the default port. Localhost—use my write user connect to the Postgres database, and request a password. So I'm going to paste the password. Okay. We're in—let's display the users. Postgres and my write user. Shouldn't be any tables here. Let's create a simple table—three columns—and then we are going to insert one row in there. Going to see if the cheese is in there.
We can write into the database. I'm going to exit Postgres, and I'm going to generate a set of read credentials now. It's the same process. We use our Vault key path, as you are probably familiar with. I'm also going to assign a couple of variables in here—a read user, a lease. Later we can revoke the user—which I'm also going to be showing; in case you're wondering why I'm not using it.
Let's do the same. Connect to our Postgres database. Now we are using the read user. First of all, let's see the users. We see there are a couple of users here. Display the tables and then let's try to do the same. Select. It's able to read now.
Let's see if we can insert. I've got here an insert command. You think this is going to work? Let's see—we've got an error. We can see it here in the logs. Obviously, the user doesn't have permissions. This is because on the grant statements—on the role’s configuration—we didn't grant any update permissions, we only granted select. This is what we want. Let's quickly cover the revocation because this is one of the little issues that we hate when using Postgres. I think it's nice to show. Let's exit—and we are going to use the Vault lease revoke command.
This is part of the Vault client. Imagine that your credentials got compromised. You could go into the database and delete the user, but you can also control this from Vault. Simple command—I'm going to revoke the read lease—which is going to end up revoking the read user. We see these methods—these are actually asynchronous—the client is not going to wait for the revocation statement to run into the database. I'll show a better example of that with the write one.
Let's try to revoke the write user now. We see the same output on our client. The revocation operation has been queued. If you look at the logs, we see an error which is telling us that it cannot be revoked because some objects depend on it. This is what Phil covered earlier.
Keep in mind the Postgres permissions model and the idea that when an object is created by a user, then it will be owned by that user. There are a few ways to solve it. I'll go over what we do at Bench. For now, let's drop the table and then try it again and see if it works.
I'm going to connect with the write user. I'm going to grab the password. Quickly display the user—make sure I haven't tricked you. You see that we don't have the read user anymore. There are only two. Dropping artist's table, shouldn't be there.
If we try now to resolve the user and we don't get the error. Now in production we don't go and drop the table—technically you could also alter the table and transfer the ownership to another user. But that's also not ideal.
There is this concept in Postgres called roles. You create a role which works like a group. Then there is a query called
SET ROLE. Every time we have to write something onto our database—which is easy—because we only do that operation when we run a database migration. Our app doesn't create objects; it's always doing migrations.
We use a tool called Flyway and Flyway supports using
SET ROLE. We pre-create a role maybe called database owner, and then the ephemeral user gets generated by the migration tool. Then it runs
set role database owner. Then inherits that role. Any objects created by that user will be owned by the shared role, which means that user can be revoked without having that dependency. That's us—as far as the demo goes.
In terms of the talk, we gave you 10 extra minutes. I don't think there is any room for questions, but we will be outside if you want to come and talk to us. Thanks very much.