How Fleetsmith deploys Vault on Google Cloud Platform
This talk from HashiConf 2017 explains a typical deployment of HashiCorp Vault on Google Cloud Platform for secrets management. It focuses on how Fleetsmith deployed Vault, including lessons learned. It also details Google's new GCP auth backend.
Fleetsmith uses HashiCorp Vault on Google Cloud Platform to manage a few dozen critical secrets, including API keys, OAuth tokens, Postgres credentials and signing keys.
Join Google's Emily Ye and Fleetsmith's Jesse Endahl in their talk from HashiConf 2017 to learn: * What's new in GCP to support Vault * How Fleetsmith deployed Vault on GCP * Lessons learned
Speakers
 Emily YeSoftware Engineer, Google Cloud Emily YeSoftware Engineer, Google Cloud
 Jesse EndahlCo-founder, CPO, and CSO, Fleetsmith Jesse EndahlCo-founder, CPO, and CSO, Fleetsmith
Transcript
Emily Ye: I'm Emily. I work at Google. I work on open-source integration for Google Cloud.
Jesse Endhal: Hi, everyone. I'm Jesse Endhal. I am the CPO and CSO at Fleetsmith.
Emily: And our talk is about authenticating to Vault with Google Platform and Fleetsmith. So, just before we begin, how many of you know what Vault is and have used it? Cool. Awesome. Like, a lot of you.
If you haven't heard, Vault is a tool for managing secrets. How are Google Cloud users using Vault?
Two primary use cases:
- We have secrets management, and that's the most common case. These are for customer stories, like web hosting companies trying to save their database credentials in a much more secure way. They're currently being kept as plain text on their servers or a media company would like an easy way to manage their API keys, which are being stored as environment variables. There are a lot of secrets that you can store other than that, but developers want something simpler to manage than a mix of all these environment variables, source code, deployment managers, etc. 
- A less common use-case we see is an internal public key infrastructure or PKI for service-to-service auth. We're seeing this a little more as more companies enter into the microservices world and start to think about security and their infrastructure. Google Cloud is doing a lot in this space right now, and we most recently announced the Istio project, which is a service mesh, allowing for authentication and encrypted communication between services. We'll be focusing more on the secrets management use case in this talk, but this setup would also work for PKI. 
We'll talk a little bit about how to use HashiCorp Vault on Google Cloud…
What does a typical deployment look like?
Before, you usually start up a HashiCorp Vault server in Compute Engine—or TKE sometimes [but] it's more commonly in Compute. You have an authentication backend, which is how people obtain tokens that talk to Vault and how secrets in Vault are ACLed. Most commonly we saw people using a username or password that they generated on Vault, and so that's a separate identity from anything in GCP.
Once you authenticate, then you get access to secrets. So, secret backends are essentially any sort of third-party way to generate secrets or store secrets. The most common one is just a generic one where you can store arbitrary secrets, and that's what most people are using. Another feature in Vault is a storage-configuration backend: This essentially allows you to use any sort of storage to long-term store your secrets encrypted. There's a Google Cloud storage backend, but it's community contributed and it doesn't support high-availability currently, so we're not actually seeing a lot of people use it, but there are a lot of HA storage options available, like etcd, Zookeeper or Consul.
And finally, we have audit backends. One of the benefits of Vault is that everything is centralized and so you have a very good audit story where you can track all of the accesses to secrets in Vault. And so commonly most of the Google Cloud users are simply just using the syslog audit backend and then storing their logs in Google Cloud storage, but they're not really integrating it with the rest of their login pipeline.
Another non-backend feature is—a lot of people have to worry about their unsealed keys. So, when you start up a server, you have to unseal the Vault before it can be used by anyone. And this tends to be an issue because you need a set of keys. Its default is three out of five unsealed keys, I think, to unseal a vault. And this is usually done manually and someone else has the keys set up in a separate password manager, which leads us to our next point, which is…
Pain points
There are a couple of pain points that we've run into:
- The first one being authentication options. Users end up having to create new identities like we mentioned with username and password for their GCP entities within Vault. We got a lot of questions like, "can we just use our GCP credentials to authenticate to Vault?" There was no native way to use Cloud identities to authenticate in Vault up until a couple weeks ago, which is what I've been working on. So, we released the new GCP auth backend with two brand new methods for authenticating GCP entities in Vault. So, that's gone, and I'll be talking little about that more in depth later on. 
- Some other pain points are unsealed key management. So, to unseal Vault, as we mentioned, you need a certain set of keys and that's used to bootstrap the entire system. Where do you manage those? A lot of people end up using separate password managers, which completely defeats the purpose of having Vault. Do you really want to have more than one way to store your secrets? 
- Another pain point is it's all self-managed. Vault is not a natively hosted app in Cloud, and you still need to run the service by yourself. You still need to manage it, and if it goes down you need to bring it back up yourself. 
- Finally, we have issues with permission models. In a GCP setup, you would have both Cloud IAM permissions and Vault permissions to manage, and so how do you really do that? Source of truth is not really an easy issue to fix, but a tentative recommendation that we would say is to use Vault roles to ACL everything and then use service account identities as identities. The reason being that within Vault you have access to a bunch of other secrets, for example multi-cloud credentials or things like that. And so by using that and giving out permissions as you need them, rather then saying, "Oh, this IAM service account has this permission and now it has this one, but we forgot the other one, so now it has access to more permissions than it needs." 
So, other things to consider when you're running Vault in GCP: - Are you doing multi-Cloud or hybrid? - Where do you have sources of truth, and how do you do failovers? - Are you using the same Vault for secrets management in PKI? - Are you having different services for each? [i.e., does] each service have its own Vault server and its own secrets?
Some people actually are doing multiple Vault instances, which is maybe inadvisable but it's less common. But we have seen people use it. Overall, this sort of gets at where do you want the root of trust in your secrets to be? We'll focus on the use case where you're all in GCP and you're running a single Vault shared by all of your applications.
So we've been working a lot to make this better. And now we can talk about a…
Best-practice deployment for GCP
As of this week we've published a solutions guide. It's actually by Dan Isla, who's in the audience. He's a solution architect at Google and so shout out to him for doing this great blog post.
As a brief overview, we'll go over how that works, but we'll also link to that at the end so you can go look at it. At a high level, you spin up your Vault servers on GCE. You authenticate to Vault using the new GCPB auth backend—which is what I've been working on, so I'm very pleased with that. You have a syslog audit backend, and you store the logs in cloud storage, and you integrate that with Stackdriver along with your other logs, and you can use Google Cloud storage as a storage backend along with other HA storage backends. And finally, you can wrap your unsealed key with a key from Cloud KMS for safekeeping.
We'll jump around a bit. I'm gonna save the auth backend for last, because I worked on it and therefore I know the most about it and I wanna talk about it a lot more.
- First things first, as we mentioned, we started the server in GCE or possibly GKE. The unsealed key's kind of a hard problem since they did the bootstrapping secret. Customers choose a lot of different options, but we recommend that you store it in GCS or cloud storage encrypted with Cloud KMS.
- For storage backends, we still haven't quite solved this problem. The downside of the GCS background is that it's not currently available in a HA, but you might wanna consider actually using it in conjunction with other HA storage backends because you can always use more than one.
- For logging, the syslog logs are actually very easily passed into Stackdriver. You can use Google Stackdriver logging agents pretty much out of the box, and so that's what we would recommend.
- And finally the auth backend. We've released this backend. It was two authentication methods: The first one came out a couple weeks ago and it allows you to authenticate to Vault using a self-signed or system-signed IAM service account JWT—JSON Web Token. The second one allows you to authenticate Google Compute engine instances using what's called an Instance Identity Metadata JWT to authenticate the Vault. That feature was in beta but it's actually being launched today—they were originally supposed to do it end of month, but they were like, "Hey, it's out." So, go play. In both cases, you're having GCP sign something to provide identity to Vault. As a reminder, we've written about this solution in the solutions doc, which is available on Google Cloud platform, and we'll share the link at the end.
But now we're gonna dig deeper into the…
Auth backend
A quick overview of the auth backend if you haven't really worked with auth backends in Vault: We use the concept of a Vault role, which binds a policy or whatever secrets you can access when you have an auth token to authorization restrictions. So in the case of IAM, this would just be, IAM rule has a certain number of service accounts that can log in as that role. For GCE roles, we restrict them by what instance group that instance is in, zone or region, GCP labels and other things like that. And so users will log in using assigned token, as we mentioned, specific to their entity. IAM service accounts will have a service account key signed JWT and a GCE is an identity instance metadataed token.
We'll talk a little bit about each of these in a little more depth, so how it works.
This was the first one, and it came out a couple weeks ago so I don't know if any of you have had a chance to look at it, but essentially what you would do is you would create a service account to act as your identity. This could either be you give the service account the ability to sign JWTs for itself, which is maybe not recommended depending on what you use the JWTs for, or you give another higher-up admin account access to act as like the service account and sign that JWT.
Following the principal of least privilege, you can create a new custom IAM role that only has the IAM service accounts signed JWT permission and then assign it to the actor and so depending on what credentials you want to use to sign the JWT, that's the actor you would use. So then you say, "Google, sign my JWT." You send some claims along and Google will use a system managed private key to sign the JWT. The reason why this is nice is because it's managed by Google, so it's rotated by Google, it never gets revealed to the public, and in general it's a lot safer than just using your own credentials to assert identity.
Once Google has returned the signed JWT to you, you can just say to Vault, "Here's my token. I am the service account." Vault will take that JWT, go back to Google, say, "Give me the service account key that you use to sign it," and gets the public cert and verifies that token. And if everything works out, then you've logged into vault.
And we have…
A small demo.
I have a couple scripts that I wrote. I'm just going to run them in debug mode and then explain what I'm doing.
- The first step you would take, before any of your users actually try to log into Vault is you set up the server, I've already started a server but it's done nothing else. So, I'm going to run a setup script. So, this will just print out the steps I'm running. 
- You start up a Vault server and you say this is where I should store my secrets. This script will go through the steps that you would just need to do to set up a Vault server with the GCP auth backend. Initially I'm just setting some variables. This is a service account I'm going to create and IAM role I'm going to write to Vault. 
- First things first, you just need to enable the auth backend, so there you go. 
- The second step is you write a role. So, this one is just writing an IAM role. You can see the type, IAM, right here. You've got the project ID, which is my GCP project. You've got the bound service account, which is a service account I'm going to create for this purpose, and then that's all you really need. And so it's written the role. And I'm also going to write a role for Jesse to use later for a GCE demonstration. And so that's another type of role, as we mentioned. 
- Finally I'm going to write config. You give Vault a certain set of credentials to verify and to make calls to GCP to verify the JWTs. So this only requires read-only access and needs to determine, is the service account still running? Can I get this service account public key? And things like that. 
So, once you've got it set up, then you would essentially run this script. I'm gonna just run through it here and then quickly go through it in the same way. You can see I set up the variables. I create the service account. I then created a custom role and added an IAM policy binding, which essentially just binds that role to the actor. In this case, the actor is me, my emilyye[at]google.com account. This is probably not advisable, but since it's a demo I don't really care.
So, we then generate the JWT claims. So, there are three claims that you, at minimum, need. - The first is the audience, which is just in the JWT saying like, this is where the token is supposed to go. Usually this is a URI, but in Vault it doesn't really understand what that is, so we just have it be Vault and then the role you're trying to log in as. It's not a required claim, so we just kind of YOLO'ed it. - The next one is expiration. So, because this is a self-signed JWT technically, we want to enforce a shorter expiration just so that the JWT can be treated like it's only being used once and not really being stolen. So, the minimum right now is about 15 minutes. Or, the maximum, sorry. But you can configure that on the role. - And then finally, the subject is the service account that you're trying to log in as.
And so then you would call out to sign JWT in the IAM API, and then once you have that JWT, you would just log in. So, I'm gonna run that and kind of quickly run through it again. So, writing all the variables, I created a service account, I create a role and then I assign it and then … This is generating the expiration date for the JWT. And then this is creating my OS token. And finally I sign the JWT and it will return the signed one right here.
And then I extract that—unless of course my demo doesn't work. But at the very end there, it usually would just log in. So, it might just be a permission thing. Usually it works.
At the very end, you would just see auth login as that. Yeah. So, I think right now it's that I haven't logged in as the actor and so it doesn't actually have the token, but the error message is a little off. I'm just gonna skip over that, but it should work.
Going back to the presentation. We're just gonna move on, I guess, to…
Compute instances
The second method would be… It just got added today. This allows Compute Instances to authenticate to Vault as well, and so this is how it works: The GC Instance has access to a metadata server without needing further permissions. This can be accessed during startup. It requests identity tokens from the metadata server, and then the metadata server will return the signed JWT and then GCE passes this to Vault, which returns an auth response.
I'm gonna let Jesse demo this, so he can talk about how this works within his production infrastructure, but I'll talk a little more about why you would wanna use one over the other first.
IAM or GCE: IAM can be used by multiple resources. You don't need a VM to run it within, whereas GCE only works within the instance. On the other hand, you don't need further permissions to allow a GCE Instance to authenticate, where you would need either the service account or a separate actor to have permissions to sign a JWT, or you would need to give it the credential file and have it sign the JWT using that private key.
I wanna make it clear this is one auth backend. So, why is that good? We had the concept of the two roles. So, in this sense, a GCE Instance can actually log in to an IAM role because it has access. The token is associated automatically with a service account, so you can just say, "I want this to log in as if it were a service account," but it would still not require the signed JWT.
Now I'll hand it off to Jesse to talk a little more about GCE and how they're using it at Fleetsmith.
Jesse: Thank you. Just a quick re-introduction: My name is Jesse Endhal. I'm at Fleetsmith as the Chief Security Officer and Chief Product, talking today about how we use the new GCE auth method to authenticate to Vault.
Quick background on what Fleetsmith is if you haven't heard of us: We do secure Mac management from the Cloud, so things like patching, enforcement of things like disk encryption, as well as inventory. And as you might imagine for something that's got control over laptops, we take security very seriously. So when we initially set out to build our infrastructure, we had a high bar for basically infrastructure security and there were…
3 things we were looking for:
- The first is just a general secret store. This is something that Vault obviously does very well. 
- The second was we wanted to be able to secure our services, communication and authentication, including Vault itself with TLS certificates as well as our Kubernetes cluster that we run on GCE. 
- And then, last but not least, for any type of customer sensitive data, we wanted to be able to basically have crypto as a service so that we could encrypt it at rest. 
Vault met all our needs here, so we went ahead and moved forward with that.
This is what our infrastructure looks like today. - We use Consul as our storage backend essentially. That's in an HA, multi-region configuration. - We snapshot all of our Consul state periodically and back that up to Google Cloud storage for disaster recovery. - We use three of the secrets backends: the generic, of course the PKI for issuing certs, and then the transit backend for Crypto as a service. - We deploy all this, like I said, on GCE. We configure everything currently with Salt and Google Deployment Manager. We're looking at moving to Terraform, just haven't had the time. - For the audit backend, we grab all of the syslog output and collect that with Filebeat and send it into our Elastic Stack cluster, formerly known as ELK. - We store all of our unsealed keys encrypted at rest as well.
So that is a quick overview of our infrastructure.
Demo
Quick overview of the demo I'm about to give: It's a simple 4-step process that you're gonna see reflected in the script I'm gonna run. 1. The first thing that happens is you have to make a request to the Google metadata server—you receive back some signed metadata.
- You extract a JWT from that. 
- You pass that to Vault 
- What you get in return from Vault is the Vault authentication token. 
So, pretty straightforward. So, let's go ahead and jump in.
We're just gonna see the output of the script here. We're gonna set a bunch of variables that we use throughout the script: - secrets directory, - where we want the output, - the Vault token to actually be stored—I suggest you make that a temporary memdisk of some sort so you're not writing that to real storage, - set your Vault server address—since we're running in dev mode, that's why you see just HTTP localhost there - set your Vault server URL—your Vault role.
It's worth pointing out here with the Vault role that we've hard coded this for the demo, but in a real world deployment, you would want this to be some type of dynamic information, so if you're using Salt you could have your Vault roles mapped to Salt roles or like GCE Instance groups could be mapped to Vault roles. But that way you have some automated mechanism for your new Instances that are coming up to know which secrets they should try to be fetching from Vault.
Creating a temp file here. So, this is really the beginning of the process that I just described: 1. We're going and requesting a signed JWT from the metadata server.
- We're going ahead and populating that into a variable that we're calling - gce_token.
- Now we're gonna go ahead and use the signed JWT to request an auth token from the Vault server—that's what you see there. 
- Now we're extracting that token using jq. Checking that it's a valid UUID there. Setting some aggressive permissions so that no one can read this token except for root, in this case. 
And there we go. We've got the token. Remove that temp file, and now we're gonna go ahead and authenticate to Vault using that token.
And there we go. This was the result back from Vault showing what that token is actually able to access.
So, that is the demo. Going back to the slides here, we had a few…
Lessons learned
…from our use of Vault over the last roughly a year, and there's really two learnings: - One is that protecting ... One of our goals, like I said at the start, was to protect Vault's communications, Consul communications, Kubernetes cluster communications, all with TLS certificates issues from the Vault PKI. It, until today, was pretty cumbersome to actually bootstrap all that yourself. It's a lot easier now, so one of the things that the Google Cloud team actually released today is a Terraform module that automates all of that for you, so that's a really nice improvement.
- The second thing is just this problem that we're actually talking about today. How do you actually bootstrap that initial trust? You really need your cloud vendor to help solve that problem for you if you want it to be done in a secure manner. And so we're really excited and happy about the work that the Google team and Emily has done on this. It's pretty awesome.
Emily: So that's our talk. You can always learn more: We have a bunch of blog posts about everything and Dan's solution is the second link there.
Jesse: Oh yeah, and we did a blog post as well on this, and specifically going more in depth into what I just described. We also included a little demo script, basically what you saw on the screen there, just as a kind of example starting point for people trying to do this in a real world scenario, so you can find that on our blog post.
Emily: I wanna thank everyone who worked with me on this. Thanks to the HashiCorp people who reviewed all my FAILs or PRs—sorry. And thanks to the Graphite team who are here, so you should come bother them. It's Vincent, Dana, Joe, and me. Thanks to Maya, my PM, and thanks to Dan for writing the solution. And I think that's it. There are probably more people to thank, but yeah.



