Hear the story of Hippo's journey adopting HashiCorp Vault and how it fit into their larger DevSecOps transformation.
Hippo Technologies uses the Vault provider for Terraform. Every month, new features and capabilities are added to the provider, allowing them to improve their Vault configuration management continuously. In this talk, Andrey will share Hippo's journey, from the first, basic, steps of Terraforming Vault, to where they are now. He'll delve into what went well, what didn't work, and what you should consider before you embark on this journey, like incorporating DevSecOps.
Hello to you people from around the world. Thank you for joining today.
Today we're going to be talking about HashiCorp Vault and its configuration via HashiCorp Terraform, based on my experience working for the last 3 years with Vault and implementing that for Hippo Technologies.
Before we start, though, I wanted to bring your attention to this: There are different educational techniques. One development technique is to find different patterns. With kids, it's fun, but it's not so fun when it's happening in your production servers.
You don't want to guess why 2 are always different, especially if there is some outage going on or, what is worse, you have an intrusion of security. It is particularly important to know why things look a particular way, and that they are as you intended them to be.
This is what you can achieve with infrastructure as code and configuring Vault via Terraform.
I'm Andrey Devyatkin, an independent consultant in the industry for more than 10 years. I have been doing conferences, speaking at meetups, also organizing them. I am the sole host on the DevSecOps Talks Podcast.
If you want to hear me, then you can tune in there. I'm helping organizations to build projects and take the time and fix those projects in organizations. Also, I am training people in Terraform, Vault, and Kubernetes.
About this presentation and what to expect: I'm not pretending to be an expert. It's all based on my experience in recent projects, as well as me talking to other people, seeing what they do.
There are going to be a lot of details and technical references, but you don't need to screenshot them. Some of the slides will be available online.
Terraform, one of the tools that we're going to be covering today, is an infrastructure-as-code tool from HashiCorp that comes with a lot of creative capabilities. The great thing is that Terraform nowadays has providers for everything, so you can configure not only the public cloud providers, but you can configure, I don't know, ping dumps, Cisco routers, a lot of providers.
It's growing in popularity as the go-to tool for infrastructure as code.
Vault is a secrets manager that gives you quite a lot of capabilities. It reduces secrets sprawl. You put all your gems or the crown jewels into Vault, and they stay safe in there. It allows you to shift from using static secrets to dynamic secrets. We're going to be talking more about that.
You're also getting better audit capabilities, since everyone gets a temporary secret and you can trace who got that secret. With static secrets, you don't really know who exactly is using it. For instance, everyone has a root database password, you don't really know who is logging into your database and who is doing what.
Also HashiCorp tools are focused on building multi-cloud capabilities. And while there's no encryption, it allows you to set up unified identity and abstract from your cloud providers and your private datacenter. So you have all the identities and all the policies managed in one place.
Plus you get a break-glass procedure. Like, if you have an issue somewhere, if you have a security incident, then there is a possibility for you to rework some of those temporary credentials that you generated using Vault. That wouldn't be possible with the shared static credential that was used everywhere. You can just break everything if you only use that one.
With Vault, every workload gets its own credential, and it's safe to work one of them. You can do all of them, of course, but usually you would get them through in one place and then Vault that one, and you can also spot it easier.
There are many more good things you can say about Vault that I'm not going to do today since the time is limited. If you're interested, just search for this talk on YouTube and you will get all of it.
Assuming that I convinced you that you have to start with Vault, where do you start? It's good to clarify the context and connect your requirements. And there are all types of questions that you can ask. How are you going to deploy? Are they going to be running on bare metal, VM, container? Will you patch or scratch?
With that I mean, Will you do immutable infrastructure, where you break the golden image and just re-deploy the new golden image that already has everything inside, or will you constantly update your installation by logging in via some automated tool and installing updates and stuff like that?
How do you access? Will it be available on VPN, or will it be available on the internet?
How will you unseal? How do you get the initial secrets, like TLS certificates from a TLS termination, for instance? And then you also want to lock down this box as much as you can, which means that everything that operators might need from me, you want to stream to somewhere else, like logs, telemetry, audit files.
Also consider your disaster recovery requirements. That will also affect the way you set up Vault.
That might be quite a lot to take in at once. So seek out available best practices, like Terraform modules and helm charts.
Also, HashiCorp has an amazing resource, learn.hashicorp.com. On that resource you can find Vault reference architectures that can give you an idea of what you can do.
So there is quite a lot out there which is fine to use. It might not exactly meet your requirements, but it still is a good start, better than you implementing everything from scratch. You can just borrow some ideas from those templates or Terraform specs.
Also, you want to harden your Vault installation, and there are quite a lot of things that you want to do, and that's like a minimal readiness checklist, in my opinion. There is more on the link at learn.hashicorp.com.
A few things to highlight. One of them is somewhat obvious but not always. It's like TLS termination: You want Vault to terminate your TLS connections. If you do terminate the TLS connection on the load balancer, then you'll be sending the traffic to the Easy Tool Box, where we have Vault. That traffic should also be encrypted, because if you terminate a TLS on the load balancer, and if there's someone in the middle sniffing in the network, they're going to see everything that goes there.
So you want Vault to terminate TLS for you so you will just be passing through the whole traffic to Vault.
Also, you need to enable the audit files. And last time I checked, Vault was syncing those to the disk, and then you will need to find a way to sync those out from the box where you're running the Vault container.
And for AWS S3, have a setting so people cannot delete those files. Like, if there is an intruder lurking in your infrastructure and trying to cover its traces and they may attempt to delete audit files. But if you have cannot delete enabled it's harder.
You could even have cross-account triplication for the S3 bucket. So you're sending everything to the black hole that the users in this account don't have access to. And that will prevent you from losing the audit files.
This assumes that you do both parts with Terraform. If you do deploy it to Kubernetes, that doesn't really apply. Well, it's actually like you're splitting your Vault deployment and your Vault configuration, 2 separate specs.
With Kubernetes you already have both the old helm chart, which deploys your Vault instance, and then you have Terraform specs that configure your Vault instance.
If you do Easy Tool, for instance, then that's what I mean here. You want to separate. One will be for the Vault deployment. There you will have your Easy Tools, elastic load balancing (ELB) security groups, other scaling groups, what have you.
The second one is where you engage the Vault provider for Terraform and configure Vault itself. The reason for that is that if you do a rolling update, for instance, it might be for a fraction of a second your Vault connection blinks, because one Easy Tool instance went away, another one is coming up, and perhaps something fishy is going on in ELB.
In that second, Terraform tries to access the Vault, the API will be unavailable, and you will be in a no-man's land. It's like unfinished deployment. You changed some resources, but you haven't finished it, and then you will most probably have to manually report back or repeat it again. So you might get yourself into trouble and get your Vault completely locked out.
That's why it makes sense to run those separately, and obviously you want to run them often, like weekly, to make sure that there is no configuration you need or no malicious changes down in the configuration. So your Vault configuration is exactly as you intend it to be.
Now you're up and running. What do you do next?
To start, you would need to have a little bit of settings on the CLI, in particular we will need to have environment variable VAULT_ADDR, which points out to the Vault API, and you will need to have a token.
Actually, if you just start with your Vault, you need to initialize it. And when you're initializing the Vault, you will get a couple of things. You're going to get 5 keys, and then you will need 3 of those to unseal it every time it starts.
If that seems tedious, and it is, then you could fall back to that seal using cloud KMS, or if you have HSM as device, something like that. And a root token, you will get a root token when initiating the Vault, and that's like a god mode, this token has a token god mode and you can do whatever.
The last thing you want is for the token to be stolen from you. In order to prevent that, when you create the policies that allow you to manage Vault using the regular token, just rework your root token. Vault, token, Vault, and then the root token ID. Because it's impossible to steal from you something that you don't have.
You can also generate new ones using those 5 keys I mentioned before.
Assuming that you did everything on the CLI and you have an idea what you're after, you can start. But you need to understand what Vault is, how it works, and that will give you an idea of where to start.
I usually use this slide to quickly explain to people what Vault is about.
On the left, you have auth methods that allow you to come into Vault and get a token. Based on that auth method you use and metadata provided, your token will get a set of policies. And those policies allow you to access secrets engines.
Secrets engines allow you to generate those temporary credentials, maybe retrieve static credentials that you can store in the key-value store or also old accounts with PKI capabilities, public key infrastructure. They could generate certificates for your infrastructure using Vault.
That's an example of how the token looks. In this example, I go to token on the CLI and in the middle, you can see there is
id, so that's important. So then you have
issue_time, while you are in your Vault. You have
ttl, and that's important, because every token has a limited lifetime, which means that if this token is leaked, the attacker will have a limited opportunity window to explore this token.
Also, you can see that there was issued using the LDAP auth method, so my username is attached to that. I can see that, yeah, it was me, this token was me. It's basically me acting in the Vault using this token, but my LDAP identity is not there so I can trace through.
In the end you see, second from the bottom, you can see the details. That is how long this token is going to be valid. And I can renew it if needed.
It's kind of boring, but it's important, since you want to get rid of your root token. That's why we need to have auth methods where you get a token and you can get the policies that would download it to manage Vault.
There are many auth methods available in Vault nowadays, and you most probably won't need more than 1. You will need to have something for the humans, developers and operators, and you will need to have something for things like your applications, like your CI/CD system, maybe some bots.
It will be different for any application. And you're going to see why. As an example, I'm going to use LDAP here, and LDAP is quite a good example, since with LDAP, you're leveraging an existing directory of the identities that you could use in other places.
Vault is just falling back to that to validate the credential you're sending it. And if they're valid, it will give you back the token. And it's convenient to use for humans because it is in many ways interactive.
This slide shows how you configure the auth backend in Terraform. If you have ever configured LDAP, everything you see here will be very familiar for you. Your LDAP server URL, that's your organization structure, the credit strings.
You will need to provide the bind users that are going to be using this to verify the credentials you are sending. And you see you have a bind password here. It's good to pass it as a variable, or get it from Vault, but then you have the chicken-and-egg problem.
Keep in mind that all secrets that you share with Terraform will stay in Terraform states, and you need to protect Terraform states.
This is the LDAP role, which is a backend group, actually. The purpose of that is to map the LDAP group to the Vault policy. So, for instance, in this example, if a person is in the DBA LDAP group, it'll give DBA policy attached to its token when it's plugging in.
The policies are complex, and I'm not going to go into details here because it could be a topic for a whole talk.
Also, it's specific for every organization, how they do things. But what you need to know is, as I said before, you need the policy to manage policies, and you need to consider getting those first.
The policies are denied by default, so you have to explicitly allow something. If it's not explicitly allowed, they'll deny it. And the LDAP groups don't have to have policy names, but it makes sense to have them because then things are predictable.
Say I'm in the DBA group. It most probably will have the DBA policy. And if you are a member of multiple groups, you will get multiple policies attached to the tokens.
When you are setting this up, you can specify TTL that people get on the tokens. For instance, if it's a developer and he gets a token in the beginning of his workday, he might want to set it to 6 or 8 hours, the length of a productive day.
Also, where it could be issued from. So you can bind it to the network, you could also set it for how long it could be renewed. Those are all good things to consider, and they will vary from organization to organization.
Use AppRole if you have to, because this one is basically login and password and comes with the same disadvantages of missing auth. It's mostly used for CI that you host outside of your infrastructure, for instance, all the SaaS offerings.
This slide shows how you set it up. You have the backend, install configuration in this example, and then you plan for backend role that basically maps the
role_id to the policies you get. And then you get
secret _id and
role_id using CLI.
You log in, the
secret _id being the password. You can get the same through Terraform. You just use a backend role
secret _id, then use localhost to prepare the map and store it to the KV, so it can be driven from the KV later on.
The cloud IAM and Kubernetes are more useful for the non-interactive authorizations, like your applications, for instance. I'm not going to go into a lot of details about Kubernetes apps. I have a separate video that you can check out on YouTube.
So you got the token. What's next? The secrets engines. Those are really nice because they can get those dynamic secrets, and they have limited lifetimes. It prevents sharing of credentials because every workload gets its own.
And you have a break-glass procedure and improved audit.
Regarding KV versus storage for the static secrets, I suggest you have 2. You have 1 for humans, where they can have a messy structure and then you would read those using Terraform, install some in the new KV where you have a predictable structure for your applications to read from. And in that sense you can make sure that it's the same across all the environments.
You can issue temporary AWS credentials. This is how you do it.
You first create an AWS user, because that will be used by Google to issue the credentials, and you attach the policy. That's the policy that you need to attach. You can learn more on the HashiCorp learning documentation site.
Then you have a backend, where you specify the keys that you just generated and the TTLs. You describe the roles that will be provided to the people, and then you have 2 different credentials to get:
assumed-role, where you will get Security Token Service (STS) credentials, or you could have a user and then Vault will create a user and give you user credentials.
In this particular example, I'm using
assumed_role. When I am jumping over to the CLI and reading those, I am getting 3 pieces. I am getting access key ID, secrets key, and security token, which is part of the session ID.
The cool thing is that you could use those temporary credentials to generate sign-in for the AWS console URLs so you don't need any SSO. You just get the temporary credentials, run some script to generate that login link, and then you paste it to your browser, and you're automatically logged in. No need to have any users manage passwords, make sure they even use their keys, and so on.
On this link, you can see how I do it.
For inspiration, I have more stuff.
This is a talk from the conference in 2019, and I find it very insightful and very inspirational about what you can do in Vault. So check that out.
The database credentials are more complicated, I would say, because the creation and the revocation statements are hard.
When you're creating the temporary credentials and a third application uses those credentials to create tables, update tables, and so on and so on, and then you revoke them, the revocation step needs to be really clever in order to be able to transfer ownership of those objects that were created to some other entities.
It is quite complex, but when you nail it, it works like a charm.
But it takes time, so be ready for that.
Another option you have in AWS is RDS IAM, but there is no access out of it because RDS IAM, at this moment of time, doesn't surface in the CloudTrail. So from an audit perspective, it's more or less the same as a shared username and password.
You could rotate all the secrets like for database. This slide shows some snippets that you could use that would allow you to change your root password that you gave the Vault to generate your temporary credentials.
This way only Vault knows the root password, no one else. But you can still have a way back of going in the AWS console and diverting root password if you really have to, if something happens to Vault.
Same with the AWS credentials. I would recommend to taint them using Terraform. And then Terraform will just regenerate those keys.
There is an API involved to do so, but then Terraform gets confused because it has no idea that you created a new key. It will try to recreate the keys that were deleted, and then you might get conflict because at some point you can get just too many keys attached to the same user.
As far as I recall, that limits this tool. Again, you have to encrypt and protect your Terraform state to make sure that no one gets in there.
The Vault introduction is a journey and it takes quite a long time. I would recommend starting small, identifying some use case that would improve the security for your organization.
For instance, the temporary AWS credential, so you don't have to have static users and manage those. When you're done with that, move on to databases or whatever you hold there.
And when you have Vault and you start to build something, that creates the possibility for developers to build security in and start gluing it afterwards.
And since you're doing it all as code, it's all repeatable. You have the possibility to check that it's what you want, to see the plan, to see the difference. So if is there some issue or some incident going on, you'll have a better way of understanding what's going on.
And since you do it as code, it's possible to reuse, to share what you do, like the models I showed you before. And I think it's important that we do share and that we do reuse because, by doing so, we are getting more secure, we're getting our respective organizations in a better place. Meaning there are fewer security incidents and fewer data breaches.
So let's help each other get more secure and make the internet a better place.
With that, I'm closing. If you want to ask something, my credentials are:
Thank you for tuning in, and enjoy the conference.
Multi-Tenant Workloads & Automated Compliance with Nomad & the HashiStack at Exact Sciences
How Whiterabbit.ai uses Terraform and Packer to Fight Cancer With Machine Learning
Scalable CI at Oscar Health Insurance with Nomad and Docker
The NCBI's legacy migration to hybrid cloud with Consul and Nomad