Securely manage hundreds of databases, with HashiCorp Vault, Puppet, GPG and LDAP
Lee Briggs shows us how Apptio uses HashiCorp Vault, Puppet, GPG and LDAP to securely manage access to hundreds of databases across 15 data centers around the world.
When you’re authorizing access by many different people to a resource such as a MySQL database, it's important to be able to individually authenticate each person. That way, you can log and audit who used which resource and when.
If you share a single password for a resource, you won't be able to distinguish between the people accessing your databases. It also becomes much harder to rotate credentials, or to revoke authorization of people if they leave the company.
Apptio has 15 data centers, with thousands of VMs, and hundreds of databases. So it’s a very real problem for the team. This is a perfect use-case for HashiCorp Vault.
Watch Lee Briggs describe and demo how Apptio:
- Uses Puppet to deploy Consul and Vault
- Uses GPG to initialize Vault securely with unseal keys
- Allows unprivileged users to securely request a Vault instance be unsealed without a helpdesk call
- Integrates with LDAP
Speaker
- Lee BriggsSenior Infrastructure Engineer, Apptio
Transcript
First, before I talk about the vault infrastructure, I want to talk to you a little about the App infrastructure. The reason I want to do this is because when you make a decision to deploy something in Apptio, you can't just decide willy-nilly to throw Vault into production. The reason is because we operate at a reasonable scale.
Obviously we're not AWS or Google Cloud, but we're running around 6,000 VMs. We have what we would consider data centers in 15 global regions, whether that is a physical data center for an AWS VPC ... Hundreds and hundreds of MySQL databases. Our raw storage footprint as of this morning was about 3.5 terabytes and we've got about 178 terabytes of memory and a 170,000 CPU cores.
When you make a decision to install something like Consul or Vault into that kind of environment, you can't just do your install or download the go binary and just hope for the best. You really have to think about the kind of things that you're going to do. We have 5 to 600 employees, most of which are in the engineering department, so we need to think about these things a little bit.
The actual problem that we were trying to solve was how are we going to allow people to get access to these hundreds of MySQL databases and do it in a way where we know who logged in where. If you think about some of the ways you might access databases at the moment in your current role, you might pass about a root password ... I hope you don't, but you might. How do you know which person used that root password at which time? If a new person starts, how do you share that with them? How are you gonna rotate that root password and then share it out again in a safe manner?
I watched a talk by Liz Reiss yesterday, who said the longer that a secret is valid, the more chance it has of being compromised, and this was very much on our agenda.
If you're not using the root password, something that I've seen in previous companies is that you might just go and look at the application configuration file and see what the username and password for that database is. If you're guilty of that, you have the same problem. You're thinking about how you're gonna share the password, if you can log in and look at it anybody with the same permissions can look at that password. You really need to think about what the implications would be if an attacker got onto one of your machine and could see that password, especially in the industry like we are, where we're taking care of companies' serious financial data.
So we looked at Vault, and as you may or may not know, if you use Vault, it provides audit logging. It will tell me who requested a credential, it has database support and we use MySQL and it's already built in, which is really really useful. It can be made highly available if you configure it in a certain way, and it provides us with a secure way to store credentials. This seemed like a perfect opportunity for us to try and find something, a used case, to use Vault. If you're currently looking at a situation where you would like to use Vault and you're not sure why, human user access to databases is a perfect opportunity.
We needed to figure some things out. As I said we've got 15 locations, how are we gonna deploy that many Vault instances, how are we gonna connect all these hundreds of databases to them, how are we gonna do all the operational stuff, highly available sane backup, and make it safer. The thought process that we were thinking of here is that if you make this harder than looking at the configuration file for the database, people will not use it. If you give an operator the ability to log in and look at the configuration file, they're not gonna use the tool you provide for them. So we had to make it even easier than that.
We went on a journey, and the first step was to deploy Vault. We already had deployed Consul and were using in all of our data centers, and we spread is across racks in the data center so we don't lose quorum when we lose a whole rack, which happens. We spread it across availability zones in AWS, where it's available, and we connect all these things using WAN federation.
In terms of actually how we deploy them, we use Puppet for configuration management. We already used Puppet to deploy Consul so if you want to use Puppet, or I'm sure there's Salt configuration, Ansible configuration, Chef configuration, to download and install Vault, we simply used this Puppet module that's there on the bottom left and deployed it in that way.
What Puppet also provides is if you're not familiar with Puppet, Puppet uses TLS to do authentication between the agent and the master, so we used that quite heavily to provide TLS in all of our data centers. Every VM already has a TLS certificate that's signed by an authority that we trust, which is the Puppet Master, so we use TLS for Vault which is provided by Puppet.
The second step was we needed to initialize it, and at this point you kind of think, "Okay, I've got a lot of Vault instance around, and we want to make sure that we can initialize it in a safe manner." We looked at all the different options. When you initialize Vault, the default Vault init command will create a plain text unseal key that you then need to distribute to other people. That felt really unsafe to us, because then you have to find a secure way to distribute that key, and anybody with all of those keys or a quorum of those keys can then get into your vault instance. We were already using GPG for Puppet and for a variety of other things to check secrets into git, SO WE looked at the GPG configuration that's available in Vault and we thought that was perfect.
What that does is it creates an unseal key, which is encrypted by each user's public key, and only their private key can read it back. It also has key-based support if you're using KeyBase at all. I would highly recommend using it. We used the API to initialize all these vaults. You only need to send it once to each cluster, and then you provide 9 GPG keys from different users and different teams in different departments in different regions, and then we specify that we need 3 users to unseal every Vault.
The Vault is initialized and we've got it deployed in 15 data centers, and now it's like "Okay, now I've gotta unseal every single one of these things." That seemed really daunting to me. We tried to do this manually and I'll show you in a minute ... Doing it manually involves echoing the key and then basic before decrypting it, then piping it to GPG, and people are like "I don't know if my GPG key is correct. Every time I sent an unseal command it's not working. What's happening?"
We were trying to do it manually, and it was just painful. We got to the stage where we were waking people up in the middle of the night and saying "Can you get on the call and unseal the Vault? Please?" People don't generally tend to like that too much.
So what is the middle ground? You can't automate this thing. You can't exactly leave the unseal keys in plain text on your file system. That kind of defeats the point.
Unsealing Vault automatically is difficult but I think I found a middle ground. I wrote a tool called Unseal. It's my first goal-line binary, so if you read the code, please help me make it better. What you do is you add all of the Vault service to a configuration file and then you add your encrypted, unseal key and what it will do it automate the process of piping that in basic before decrypting it than sending it to the Vault. If will unseal all the Vaults and it runs each one of those unseal commands in a goroutine. I can unseal every single Vault. I'm not gonna close them all down, because that would cause a problem, but I can unseal all the Vaults, about 75 of them, is around 15 seconds.
I'd like to give you a quick demo. I've provisioned some infrastructure in DigitalOcean. Let me just quickly show you what I've got here. I have multiple nodes, as you can see I've got 3 Consul servers that are running Vault, and I got 3 Vault servers that are also there. We're in a pretty bad spot here, because only one of the Vaults is currently active, and all of them are sealed. Maybe we offline migrated it to VM, or maybe we upgraded the Vault to a new version, and I now need to unseal them.
I have a configuration file. As you can see here, I set the Sierra Path at the top and I'm using the Puppet certificate. I have a GPG flag that says I'm using GPG. If you're not using GPG, you could put your plain text keys in there, but if I like to compromise so I'm just gonna have them. I just specify in a ray of hosts, each of the console servers and each of the Vault servers are specified to PAW and then the encrypted and seal key. This is usually what I have to do to get my own seal key. I would echo the actual base64 encrypted string, then decrypt it and then pipe it to GPG and that's my actual unseal key.
However, if I had to do that multiple times even in a loop I could get it wrong, and all that kind of stuff. So I have this command Unseal, I run status, and it will tell me that 5 of them are sealed. If I run Unseal, because I have GPG agent running, which is keeping my key, usually it will prompt me but I didn't want to risk it this time ... and I've unsealed all those vaults immediately.
What this means is that any user that comes online in the morning ... What's become a little bit of a habit is that rather than reading back slack to figure out if somebody live migrated a vault server in one of the data centers, or upgraded something, what has become a bit of a habit for me at least is I run Unseal status to see if there's any of the vault servers needing. If they do, I just run Unseal.
So we did the demo ... oh, what's happened here. There we go ... And now we need to configure vault. We need to add things like authentication policies, we need to add users, we need to add LDAP configuration, all that kind of stuff.
When we started doing this, there wasn't a lot of good options around. I found a tool online called Vault CTL written by the UK home office, which was really useful, and that's the path we went down. Seth wrote a fantastic blog post about how to do this in a codified manner, and then recently, Terraform now has a Vault provider which makes this much, much easier.
I am going to show you a little bit about Vault CTL in a second, but basically what you can do is define your vault configuration in YAML files and then run the Vault CTL tool and it syncs that configuration with each vault in each data center.
This makes it much, much easier. It's like configuration management for Vault. Let me stress, Terraform is a much better way at doing this, but at the time when we needed to do this, we didn't really have much better options.
After that, we have all these MySQL databases. We've enabled authentication, we've added a bunch of policies, we've added a bunch of auditing, and we've enabled MySQL back end. Now I've got like 600 databases that I've gotta ad.
We cheated a little bit here, because we use a tool that provisions VMs internally called self-serve. When we create a new MySQL VM, basically what happens is Puppet runs and it installs MySQL. It adds a Vault user with the relevant grants, and then it adds the roles to each database config so we have a read only role, which allows developers to go in and do select statements to the heart's content ... Or you have a full role, which allows them to drop databases to the heart's content.
What happens then is self-serve makes an API call to the vault in that region and adds it under a mount point. We mount the databases under the path "MySQL" because that's the old back end in the old version of Vault, and they thankfully deprecated it. Thanks HashiCorp. Then we add it under the host name. We can add multiple databases into each Vault in each region.
What next?
So now all my databases are enabled, and they're all in Vault. How do I make it easier for each operator, as we'll call them, to log into those databases? We use active directory LDAP. We obviously have a bunch of role based access groups, and what we do is we map each developer or each operations person to a role based access group and configure all that from there.
However, once this is done, you've given a bunch of people access to Vault because you've mapped it grouped the policy, but people were starting to get really frustrated downloading the Vault binary and then adding -method=ldap ... Like they forget to do that, and their username=lbriggs and they forget to do that, and then they're like "Oh, what's the path again? I forgot what it is? How do I get my credentials again?" And you could document this until the cows come home, but if it's 3:00 in the morning and you just want to log into a database to fix a problem, you're probably not going to read that much documentation.
This became tedious for a variety of reasons. I busted out my terrible Golang skills again and wrote another tool called Breakglass. This is now open-source. It's under the actual GitHub repository. Basically what this does is it automates this process away for you. It's kind of inspired by the Vault SSH command, if you've ever used it, which will automatically automated your credentials and log you into a box. It currently supports MySQL, SSH access, and also Docker access. I have plans to add IAM support and any other back-end that seems reasonable that we use. I will happily, again, accept poll requests. This is not good code. Please let me stress that as much as possible.
Let me show you a quick demo. I have the Breakglass tool installed. Again, because it a Golang binary it's completely available and you can see here you have a bunch of available commands. This also has a configuration file. You can use the user path authentication method or the LDAP authentication method. Know the authentication method's support, so if you use GitHub it's not supported yet, because it expects you to provide a username and password. Because I'm logged in as Root I've specified the username Lbriggs because I'm on DigitalOcean and nobody cares. I've specified the local Vault in the data center, so Vault.Service.console.
Here you can see I've got a MySQL instance and I want to Breakglass with MySQL on the host MySQL-0-.Briggs.Lan. It will prompt me for my username or password. This would probably be my LDAP password, but in this case it's not, and it's creating me some credentials. I can then go use those. However, if I'm lazy, which because I'm a system administrator I am, I can provide an exec and it will log me right into the box. This is really, really helpful when you're trying to log into a database in the early hours of the morning. Let me assure you, because I've used it many, many times.
If I look at the audit logs, you can see there's a bunch of information here about who requested the password, when, and why. You can see here the display name username Lbriggs with these policies requested a bunch of information. This is instantly better than passing around a root password. If you get some kind of breach and you get the root password and you have no idea who is responsible for this, I personally believe this is much better and I think it's perfect for what the Vault was designed for.
I've absolutely molted through this, so I'm usually on final thoughts. What I would like to do real quickly is show you the Vault CTL configuration. Here you can see there's a bunch ... I've specified a file with the authentication types. In this case I've got user pass, I have a bunch of policies, and in this particular case I've called the policy "operations." Because I'm in charge of Vault, I have a bunch of policies that I probably shouldn't have access to, like Sudo to CIFS, and then users ... I've created myself a user with a super secure password, and then I assigned myself some policies.
To run that, all I need to do is run the Vault CTL command from this directory. You can see I'm specifying Vault.service.console on Paw 8200. That's my root token. I'm gonna tear this down so you can't use it, and then I'm gonna sync this directory. It's updated all the configuration of Vault for me.
Obviously you probably wouldn't use the root token is you're using Vault CTL usually, you will probably use it locally from your laptop or something like that, but in this case because I'm doing a demonstration I didn't want to deal with all the policy stuff.
You can see here it's added to back end which was already created, the policies were already there, the users were already there, and I don't have any back ends or secrets to sync, so it hasn't done any of that stuff.
So that was the journey, and at this stage we have a bunch of operators that are able to log into databases really easily. We have distributed this over multiple data centers and we now have a vault set up which we feel comfortable with. The nice things about this in terms of the first thing you're doing with it is at the end of the day if you somehow manage to screw up Vault in some way, you're not losing access to a bunch of secrets, all you're doing is loses access for humans to log into databases. The way that I was thinking about this was ... We can to be comfortable running vault in production before we start throwing our TLS certificates into there and our root password and all that kind of stuff. We ran like this for quite a period of time before we felt Comfortable adding other things into Vault.
So some of the final thoughts we came up with after going through this journey. Auditing is an interesting thing. You can see there that's an example audit log. We ship all of our logs off to Splunk, and as you can see at the bottom there, it's kind of small but the data is obfuscated. The actual username or password that gets generated by Vault is not put into the logs. What this means is when you actually want to find the person who logs into a database, you need to provide the string either from the username or the password, and Vault will tell you which audit log entry it is from. And you use the CIS audit has end point.
You can do this manually, but I am currently working on a tool that makes this much, much easier, so watch this space. The second thing we thought of is ... So we have this Vault end point, and we now have ACLs turn on. This is how vault looks in the key value store.
No, icloud, go away.
We've initialized Vault under a path of the data center name in DigitalOcean. You can see here if I have access to port 8500 on console, I can delete everything immediately with no recourse, so you have to turn on ACLs if you're gonna do this and if you're gonna use Vault as your back end.
There is a great write up here about somebody doing a bug bounty and they're using console and they don't have ACLs turn on. There's other things you need to consider here. Like by default up until recently, vault console exed was turned on or you could just add a new health check end point, and people could just register health check that ran scripts. So please, if you take nothing away from this presentation, please turn ACLs on. If you're storing these secrets in console, don't give them access to this thing so they can delete all of them. It's a little bit of an investment, but if you haven't deployed console up until now, I would highly recommend that you think about it.
Backups. Once we've deployed all these vaults, we need to think about the story of how do we restore all this stuff to another data center during a disaster recovery event? As you can see, earlier we initialized Vault under a prefix. By default it's "Vault," but we added the data center underneath that as well. We never replicate secrets between data centers. We have things like customers who don't want their information being replicated out of region, we have customers who don't want their information to come into the United States for example, and for that reason we made the decision not to replicate things into other regions.
What that means is you might need to think about how you deal with backups. Consul now has a snapshot command, it didn't used to but it does now, which is really, really useful. We take a snapshot once an hour, then we copy them to another data center, which is reasonableness for that region. If it's in Europe it goes to a European data center, and if it's from the U. S. It goes into the U. S. Data center.
Then we test the restores regularly. All you have to do is start the vault binary on a different port, connect it to an existing Consul, so you already have Consul in that data center, but you can initialize it under a new prefix. For example, here, if I have a DigitalOceans regions of the San Francisco I could initialize it under a different prefix, and I've got two Vaults connected to one Consul.
We do that by Ansible so it starts a Vault on a new port, it updates the configuration, starts it up, and then we get everybody to run Unseal with the new configuration file. Then you can verify the integrity of those secrets, and then you can shut it down and come back to it later. Having an Ansible playbook to do that means your disaster recovery store looks a lot better when you're doing those wonderful DR tests you're supposed to be doing.
Final things. We learned a bunch of lessons here. I learned a bunch of lessons here. I think the first thing that I will take away from this is, if you want to use Vault, don't try to consider it as the be all and end all of your secret storage. Pick one thing and go and Vault it. In my case it was human access to databases, but if you have a specific thing you think would be much, much safer, I would recommend just picking one thing and iterating. Think about how you're going to do configuration backups and unsealing stuff. Hopefully the tools that I've written will help you, but you do need to think of how you're going to unseal the vault in the time of emergency, especially if you have a distributing team, and you have three users who need to unseal it.
Vault has a bunch of back ends. If you're already using Consul, I would recommend using that as your back end because it gives you HA immediately. If you're in Amazon and you want to use one of the Amazon back ends, then of course that makes sense, but Consul gives you services discovery and all that kind of stuff.
Any kind of automated secret management has these kind of trade offs. As long as you're aware of them and you can abstract and make them safer where possible, I think it makes more sense to automate them as much as possible. You will always have the secret zero problem. You will always have that "How are we going to make this safe and distribute this kind of stuff without potentially jeopardizing the situation?"
At the end of the day, as an operator, you are the biggest attack victor. Here it gets said all the time, I'm not a security engineer, but at the end of the day, I'm fairly sure if there's any infosec people in here they're probably going to sprinkle USB sticks in the car park before they start thinking about how they're going to get through your firewall. So think about that. Unseal is useful, but it's only as useful as how safe you keep your personal laptop.
The final thing I'd like to say as a takeaway ... Engineers, at least, from what I've seen, I personally think engineers love Vault for its HTTP API. I don't believe there's anything out there that does secret management the same way that vault does. I've seen hardware KMSs, I've seen a bunch of other open source tools, but having a HTTP API for your secrets that you can feel comfortable and secure with is a really compelling story. You might be in Amazon and using the Amazon KMS, and that's fantastic too, but if you're in the physical data center game like we are, you need to come up with other solutions.
Unfortunately that's all I have, I have about 11 minutes left. Apparently I talk too fast. It was fantastic to share this with everybody and I'd like to thank Nick for giving me the opportunity to do this.