Case Study

How HashiCorp Vault Solves The Top 3 Cloud Security Challenges for Atlassian

HashiCorp's Armon Dadgar and Atlassian's Derek Chamorro discuss today's top 3 cloud security challenges—and how HashiCorp Vault helps organizations address the challenges of secret sprawl, data breaches, and identity and access control.

Speakers

  • Armon Dadgar
    Armon DadgarCo-founder & CTO, HashiCorp
  • Derek Chamorro
    Derek ChamorroVirtualization Architect, Atlassian

Transcript

Armon Dadgar Welcome, everybody, to our webinar on Vault, and how we can use it to solve our top security challenges. Briefly, a little bit of context. My name is Armon Dadgar. You'll find me all around the internet as just @Armon, and I am one of the co-founders and CTO of HashiCorp. For those of you who are new to HashiCorp, or just learning about Vault, HashiCorp as a company is focused on the entire application delivery process. The way we like to think about it is there's three distinct layers in terms of how we manage and deliver our applications. There's the provisioning challenge of providing the core underlying compute, whether that's public cloud, whether that's private cloud, or some mix. There's the security challenge of how do we provide access to secret material and secure our applications and date on that infrastructure? At the highest levels, the runtime of how do we run our applications, our services, and our appliances that we need for our infrastructure to work together.

The way this comes together is we have six different opensource tools. Many, many developers are familiar with or use Vagrant on a daily basis. That's where our journey starts. Then, we have tools like Packer, Terraform, Vault, Nomad, and Consul. Today, we're going to spend some time talking about Vault specifically. For several of these products, there is an enterprise equivalent that's designed to be used for organizations and teams that are leveraging the open-source version. With that, let's talk about today's agenda.

We're going to start of by talking about what are the top level use cases for Vault. Under what scenarios would you even consider Vault as a possible solution. Then, I'll do a brief introduction to what Vault is and how it works. Then, we're going to spend some time on the new features of Vault 0.7 and a few of them from Vault 0.65, which was the last major release. Then, we're going to do a quick demo and run through of what replication actually looks like with this service, and how you'd leverage something like encryption as a service.

The top three use cases we see that drive people to adopting Vault is really in secrets management, encryption as a service, and privileged access management. To talk briefly on each of them, secret management is the challenge of how do we distribute to our end-machines and applications, the secret material they need to function. This could be things like database credentials, API tokens such as your AWS access key and secret key. It could be TLS certificates to serve secure traffic. Anything that you might use to authorize or authenticate an application falls into this category.

The next major use case is encryption as a service. If we have data at rest to data at transit that we'd like to protect, how do we do that? This has challenges of key management, so, if we have various encryption keys, how do we ensure that only a subset of people have access to those keys? How do we rotate them? How do we ensure the key lifecycle is properly handled. Then, in addition to just the keys is the cryptography itself where Vault can help us by doing cryptographic offload. Instead of implementing cryptography in our end-applications and make sure the producers and consumers all implement it the same way, we can upload the challenge to Vault and use its APIs to do encryption for us, as a service.

The last one is privileged access management. If we have a central system that's storing our encryption keys, our database password, so on and so forth, how do human operators get access to it as well? If I need to go and perform some sort of maintenance operation against the database, I don't want to maintain that secret in a separate location and have one system for my apps, and a separate system for my operators. I'd like to centrally manage all of that. The more human oriented aspect of it is the privileged access management.

These are the top challenges in the space. What I often find useful is to give a brief primer on how should we think about the private space itself. When we talk about secrets management, what are the challenges we're talking about? What this really starts with is defining what is a secret. The way we think about it is it's anything that's used for authentication or authorization. I can use a username and password to authenticate myself. I could use API tokens, TLS is being used to verify my identity. These are very sensitive pieces of information. I'm sorry, these are very secret pieces of information.

Sensitive, on the other hand, is anything that we'd like to hold confidential, but can't really be used directly for authentication or authorization. For example, Social Security numbers, credit cards, emails. These are sensitive pieces of information, but they're not secret. There's a delineation there, because the volume we have of secret material where we maybe have a thousand, or ten thousand, or on the outside, a hundred thousand pieces of secret material. We might easily have millions, tens of millions, or billions of pieces of sensitive information.

As we're talking about secret management, there is a number of important questions that we have to be able to answer, right? How do our applications get these secrets? How do our humans, our operators, our DBAs get access to these secrets? How are these secrets updated? If I change the database's password for example, or my Amazon token expires, how do I revoke it? In the case that an employee who had access to it leaves or we just have a rotation policy, and we need to revoke it, and or if something is compromised.Lastly, what do we do in the event of a compromise? If we find that our database password has found its way to a forum somewhere, what do we do now?

The state of the world that we generally see in answering these questions is what we refer to as Secret Sprawl, is that the keys, the secrets, the tokens, the certificates are distributed all over the place. They live hard-coded in applications, they live in configuration files, they're in Chef and Puppet, they're in GitHub, they're in Dropbox, they're on Wikis, so, sort of all over the place. The challenge is there's this extreme decentralization of where this material actually lives. There's very limited visibility into where they live, how they're used, if they're updated, who has access to them. They're a challenge in terms of, from a security perspective, how we reason about that secret material. Lastly, there's a very poorly defined break-glass procedure. In the event of a compromise, what we actually do, the procedure that we follow is very hard to define, because we really don't have much visibility, or even central control, because of the secret sprawl problem.

This where Vault comes in. This is where we look to improve the state of the world. Vault's goals from the onset were to be a single source for secrets, sort of merging the privilege access management of how our operators get access, centralizing our secret management, of how your applications get access to give us that centrality. We want to have programmatic access for applications that are doing it in an automated way. We want to be able to have a friendly access path for operators who are doing it manually or on an as-needed basis. For this to work, we really wanted to focus on practical security. What we mean by that is, there's always a trade-off between theoretical security. You know, if I have access to physical hardware, I can freeze the memory, can I pull out a key in transit versus how difficult is this system to use, or how expensive is it to operate. Our goal is to find a practical medium there.

Lastly, when we say "Modern data center friendly," we really mean thinking about cloud, so, how do we do this in a pure software way that doesn't have dependencies on special hardware that we probably don't have access to in cloud environments. So, the key features of Vault, at its most basic core is secure storage of secrets. This is making sure that in transit and at rest, everything is encrypted. Then, as we get passed the basic level of storage, there's Dynamic Secrets, which we'll talk about. It's a mechanism for generating credentials on the fly. There's a leasing, renewal, and revocation lifecycle, which allows us to have audit ability, visibility, and the key rolling and revocation story. Auditing is a critical part of this whole story, so we want to make sure that we have that visibility get away from that low visibility state of the world.

Rich ACLs, so, we want to have very, very fine grain access controls over who can do what. Lastly, multiple authentication methods. This is important, because we're going to have humans logging in with things like username and password versus applications, which won't be logging in that way. The most basic level is how do we just make sure Vault can operate like a secure storage site. This promise starts by saying all data is encrypted in transit, on the way from the client to Vault, as well as at rest, when Vault is actually storing data on disks, everything is always encrypted. Everything is done with 256 bit AES in GCM mode. TLS 1.2 is mandated between Vault and its clients. All of this is what is done in pure software, so there's no hardware requirement to make this work.

The goal here is to really encourage using the state of the art, the highest recommendation in terms of things like TLS and Cipher modes. What does this actually look like? A very simplistic example is here we're just going to write a key to the path secret/foo and we're just going to write the value bar is equal to bacon. The take away here is that Vault lets us store relatively arbitrary things, so bar is equal to bacon doesn't actually mean anything to Vault, but maybe it means something to our application in terms of it being a username or password. It lays out its data in a hierarchical file system. Here in the secret forward slash as the directory and foo is an entry within that directory.

Every secret has a lease associated with. What this lets us do is automatically revoke at the end of a lease, unless it's renewed. What's this lets us do is a twofold. One is it gives us certainty that the client is checking in every once in a while to renew its lease, so we have an audit trail of when a secret was accessed. In this case, if a refresh interval was every 30 days, we can see the client is still making use of it, but we could also change that value and time, how long it's going to take clients to upgrade into the newest version. We can say we know within 30 days all their clients are going to move into the newest version of the secret.

Lastly, if something comes up, we either leave the secrets compromised, something is an issue, this gives a break-glass procedure. We can say we know where my secret credential was leaked, we're not going to revoke that, and use that as our break-glass procedure. Every client that has access to it, that credential gets revoked, and they can no longer make use of it. The question is how do we make this enforceable? How does Vault actually have the ability to revoke it? This is where the idea behind Dynamic Secrets comes in. If we actually gave the end-client the true credentials, if we gave out the root credential password, there's no way to revoke it once the client knows that we can't wipe its memory remotely.

What we'd like to be able to do is give it a different set of credentials that's specific to that client that can be revoked. What Vault does instead is it generates, on-demand, a set of credentials that are limited access, following the principle of least privilege. We'll give you the lease privilege capability into the system. Then, that token, that username and password becomes enforceable via replication. Every client has a unique credential that is generated on-demand that is tracked. If we decide to revoke it, we can delete their particular credential without affecting all of the clients at once.

What this lets us do is to have an audit trail that can pinpoint a point of compromise, 'cause there's only a single client, for example, that has that username and password. Under the hood, what that flow looks like is, suppose we have an operator. They request access to a database, that request goes to Vault, Vault verifies that the user has the right privileges to actually be able to do this. Then, Vault connects to the database and issues a create command, so we can create a dynamic new username and password that has just the appropriate permissions the user needs. This is all flowing through the audit brokers, so, we get audit visibility. "The user requested that we generate such and such credential," and then we provide this dynamic credential back to our user. Vault sits in the pre-auth flow, then the user goes and connects to the database, and uses that credential. Critically, the data isn't flowing through Vault, just with a pre-authentication flow.

The way we support this is via this notion of pluggable backends within Vault, so this way there's that last mile integration glue where Vault needs to be able to understand the API of the particular system. By making this very pluggable, it's easy for us to grow support over time. Today, there's almost a dozen providers, it's growing all the time. Everything from cloud providers like AWS to RDBMS systems like MYSQL, PostgreSQL, and message queues like Rabbit, so on, and so forth. There's a broad range of support for generating the sub-credential.

The key with all of this is really what we're trying to provide as we're centralizing our secret access is a way of providing authentication, authorization, and auditing in a uniform way across all of these different systems. To that end, we have a few challenges. On the authorization side, we have two different real classes of entities that need to authenticate. There's machines, which need things like mutual TLS, tokens. We have a mechanism called app roll, which integrates with configuration management. These are automated workflows for authenticating machines and applications. Then, there's user oriented methods. Username and password, LDAP, GitHub, which are far more username password MSA, traditional authorization methods.

Then, the system exposes a single rich ACL system for doing authorization. Everything in the system is default to , so access is provided on a need to know basis by a White List. In our experience, this tends to be much, much more scalable as your organization grows, just because it's easier to reason about what you need access to versus what you don't need access to. Lastly, you've got auditing across everything. You have request response auditing and the system is designed to fail closed. If it can't audit any of the configured auditing backends, it will reject the client request, preferring to reject an operation as opposed to do an operation that can't be audited, that has no visibility.

As you might imagine, a system like Vault, which is in this pre-authorization request flow is highly availability sensitive. If you need access to your database, Vault is down, and you can't get your credential, that's a problem. From the very get go, the system was designed with an active standby model. It could do a leader election with Consul, elect a primary instance. If that primary fails, it will fail over to any number of standbys, if you can have a high level of availability of the service.

Going one step further, 0.7, the latest release, adds multi-data center replication. Before, you'd have one data center with an active instance and many standbys. Now, you actually have many clusters and those operate at a primary secondary model. One cluster is the source of truth, the authority of what the records should be. Then, it mirrors to any number of secondaries so that you can lose entire data centers, and Vault can continue to function.

One of the interesting challenges with Vault is if your data at rest is encrypted, it itself needs a decryption key. That's the chicken and egg of how does Vault access its own data, if it's encrypted. What Vault forces is that this key must be provided online to the system. The reason we do that is to avoid that key itself ending up in a config file that's handed to Vault, which itself is then in plain text managed, eventually ending up in something like GitHub, where your root encryption key is there. That key, that key that needs to be provided to the system, the master key, is in a sense the key to the kingdom. If you have that key, you can ultimate decrypt all of the data at rest. How do we protect against insider threats where we have an operator who has this key that decides they want to bypass the ACLs and access the data at rest.

The way we do this is using what's known as the two man rule, or in our case, it's sort of an any person rule, but you can imagine the Red October scene of turning two keys to launch the nuke. How this actually works is there's the encryption key, which protects data at rest. There's the master key, which itself protects the encryption key. Then, this master key gets split into different pieces. So, we have indifferent shares, two of which are required to recombine into that master key. The default is that we'll generate five shares, three of which are required to recombine into the master key, so we need a majority of our key holders to be present to do this.

We're really distributing to our operators is this key share that on its own isn't particularly valuable. You need a majority of these keys to really be able to reconstruct access to the system. This lets us avoid being concerned with a single malicious operator attacking data at rest. In brief summary, the challenge the Vault is looking to solve is the secret sprawl problem. It really looks at two classes of threat, protecting its insider threats largely through the ACL system, and the secret sharing mechanism, and it protects us against external threats using a modern and sophisticated crypto system.

Briefly looking at encryption as a service, this is a bit different than both privileged access and secret management, is there's a different set of challenges here, which is we have a large volume of sensitive data that needs to be protect. Unfortunately, cryptography is hard. It's very easy to get the subtle nuances wrong. Securely storing keys and managing their lifecycle is also a challenge. What Vault does is have a backend called Transit, where it allows us to name these encryption keys, you know, foo, bar, or name it after an application, maybe API. Then, there's APIs that will allow you to do operations using these named keys such as encryption, decryption.

What this lets us do is prevent our applications from ever having access to the underlying encryption keys so we don't have to worry about them exposing it. By leveraging the APIs, we don't have to worry about the applications implementing encryption and decryption, and these other operations correctly. Data can now be encrypted and the keys managed by Vault, but then the ultimate data that it has generated, the sensitive data can be centralized from Vault and stored in our traditional databases, or the Duke, or other mechanisms restoring potentially a very large number of records.

This lets our applications outsource some of the heavy duty challenges of secret management encryption to Vault. What it looks like from a high level perspective is potentially we have a user that submits a request for our web servers with sensitive data. The application sends the plain text to Vault and says please encrypt using, let's say Q Web Server. Vault audits that request, but then ultimately generates the Ciphertext, and returns that back to the web server. The web server is then free to store it in its database.

The decryption route is to fetch the ciphertext, flow that to Vault, ask it to do the decryption, and the web server receives the plain text. In this way, Vault is only seeing sensitive data in transit, it's not actually storing it at rest, so Vault doesn't necessarily have to be able to scale to store billions of pieces of sensitive data. You can continue to use your existing scale-out storage that leverages Vault to do the encryption.

Cool, so, that was a high level intro to the use cases and the architecture of Vault. Now, I want to spend a little bit of time just highlighting some of the new features that have just landed. The biggest release that came in 0.7, as I said is multi-datacenter replication. It's model is one of a primary secondary, so, one cluster is the force of truth that gets mirrored into all the other ones. What this is really focusing on is the availability story. It allows us to lose the primary data center, or lose connectivity to it, or lose any of our secondaries, and have all of the other sites continue to function.

One of the big challenges, especially if you're using things like encryption as a service, you might be doing thousands or tens of thousands of requests per second, so, this replication model allows us to scale the requests by load sharing across multiple clusters. The design of the system is fundamentally asynchronous, so the replication is not synchronous. That's because availability is a top priority for us. If we're unable to replicate to a secondary, 'cause it's offline or there's a network connectivity challenge, we want the system to continue to function.

Replication is transparent to the clients, so, clients of Vault continue to just speak the same protocol unmodified. Their reads can be serviced locally by a primary or secondary. Any write they do gets forwarded to the primaries so the source of truth gets updated. The core of the implementation uses a mechanism called Write Ahead Logging. This is very familiar for folks in the database world. Its transactions get written into this write only audit log, and we ship that log down to our secondary sites.

If our two sites become too far out of sync either 'cause it's a brand new secondary, or maybe a data center has been offline for hours, or days, we may have to resort to reconciling the underlying data, 'cause there's too many logs to ship. The system makes use of a hash index to recover. Then, we make use of an ARIES Recovery algorithm from a database to be able to deal with things like power loss in the middle of transactions.

The general model is we have our primary and secondary clusters. Within each cluster, we have our active and standby instances, and they're sharing access to their storage backend. In this case, for example, Consul, and we're simply shipping our logs. As new things are changing on the primary, we ship the log to all of our secondaries. If things get overly out of date instead of the very lightweight log ship, they'll switch into a more active index reconcile to figure out exactly what keys are utilized and use this process to bring the secondary back up to speed, so they can switch back to log-shipping.

The next big feature is an overhaul of the Enterprise UI. I'll show that briefly when I demonstrate replication. Specifically, some of the new changes around being able to do the encryption as a service, and the UI, as well as management of replication itself. Some other cool tweaks that have come through is enhancing the ACL system. It's already an incredibly fine grain system in terms of what you can control over in terms of can I create a key versus use the key versus delete the key. Now, we can go over more fine grain. Particularly, we can allow and disallow, so whitelist and blacklist different parameters to the API endpoints of Vault. A brief example of that is with the transit backend that I mentioned, we can use that to create named keys, so, we're using those keys to do encrypt and decrypt operations.

We might actually specify our policy to say the only types of keys you're allowed to create with the system, in this sense we're going to restrict the allowed parameters to only AS-typed keys, so Vault supports a number of other typed keys, as well. What we're going to deny is you being able to specify that the key can ever be exported from the system. In the rare cases, you may want to support exporting a key, if you need to be able to share that with third-parties, but if you don't, why even allow an application to potentially export? Here, we can use the ACL system to enforce that a key doesn't get exported.

Another option is forced wrapping. One mode of Vault support is any time we do a read against it, we can have Vault wrap the response. Instead of actually giving us the response directly, you can think about it putting the answer in this one-time, unwrapped shell. This lets us do things like have an audit trail to ensure that only the app that was supposed to expose it and look at what the database password is, for example, was the one that did it. Multiple people along the chain weren't exposed to the database password. Largely, this is a mode that when you do the read, you can issue it to the database and it will do it for you, but now you can force it through the ACL system, as well.

Here's an example where maybe were generating a certificate authority, what we would like to say is at minimum, you have to wrap the certificate authority for 15 seconds, and at maximum, 300 seconds. We're going to time-box the availability of this thing and we're going to force you to make sure you're wrapping it, and there's only a single person who got exposed to it. The SSH Certificate Authority is a very neat extension. The SSH backend lets Vault broker access to SSH-ing Intel machines. Instead of giving every developer access to SSH everywhere by distributing the one config file, or putting every developer's certificate on every machine.

Instead, we give Vault the ability to SSH into these machines and it brokers access. It can do it in three different ways. One is dynamic generation of RSA keys, it can just generate SSH keys on the fly. One is a one-time password-based mechanism and the newest mode is a Certificate Authority. In this mode, what happens is a developer sends their SSH key to Vault and says, "I'd like to SSH into this target machine." Vault verifies they have the permission to do that and then it signs that key. It signs it using a well-known public certificate authority key. The user then SSH-es into that machine normally, but now using their signed key as opposed to the plain key.

Then, the dashed [inaudible 00:29:10] is representing the server, it's doing a verification on that signature. It's not actually communicating with Vault, it's just making sure the cryptographic signature checks out. As long as that looks good, the user is allowed to SSH in. Now, what's nice about this approach is a few things. It's that the client only communicates with Vault pre-flight, Vault doesn't have to be in the SSH path. There's very minimal computational overhead of doing this. Vault's basically just doing a signature on top of the existing public key. There's no operating system-specific integration and it's a very simple, and secure mechanism. It uses a lot of well-known and well-studied cryptographic primitives.

Batched encryption and decryption. If we're making heavy use of the transit backend to do our cryptographic offload, it could opt to be more efficient to encrypt or decrypt many pieces of data at the same time, as opposed to one request per. This is a relatively new enhancement. We can basically batch many different inputs to be encrypted or decrypted. Improved auditing of limited-use tokens. We can generate access to Vault that says this user allowed to perform one operation, or five operations, and it was hard to audit exactly how many automations were used before. Now, this just shows up in the audit log in a very obvious way. You can see the number of remaining uses for every request.

Finally, a number of new backends were added. The Okta backend has been added to allow us to do authentication against Okta, Radius similarly, and ETCD V3 has been added to support storing data at rest.

I'm just going to briefly do a very, very quick demo of this. I'm going to start by just bringing Vault in two instances. These two instances are just fresh, locally running instances. Nothing special about them. I'm going to fire up the web UI for both of them. In tab one, we have Vault one, and in tab two, we have Vault two. What I'm going to do is configure Vault one to be my primary instance, so I'm going to go to replication. Currently, we're not enabling replication, so we've got to turn that off. Then, I'm going to generate a secondary token. This is going to allow us to authenticate a secondary. Test, generate that, and then you copy this activation token.

Then, I'm going to come back over to our other instance of Vault that's running currently its own independent instance, go to application, enable this as a secondary, hit enable, and great. Vault one and Vault two are now connect to each other. This one is configured as a secondary, this one is configured as a primary. Now, click in. We should see that they both have the same set of backends mounted. We're going to add a new backend. Here, we're just going to mount it on our primary at our transit backend, put a little description.

Now, if we go to our secondary and refresh, we should see the new backend has replicated, and it has showed up. Over here, I'm going to create a new encryption key, call it foo, just put in my name. Go ahead and encrypt that. I'm going to copy the Ciphertext. We put in plain text and got our Ciphertext back. Now I'm going to actually just decrypt it on the secondary cluster. I'm going to flip back over here, we'll go into our transit backend. We can see the encryption key "Foo" has replicated. We'll go into that, go to decryption. I'm just going to paste in the CipherText that Vault one encrypted, decrypt. We can see we got the same plain text back.

In zooming out, we can think about Vault one and Vault two as having been two different data centers. We replicated from data center one to date center two that we are going to use Vault as an encryption as a service using the transit backend. We defined an encryption key named "Foo" here, that got replicated across, and now we can see how an application could simply span multiple data centers using Vault's encrypt and decrypt data in a way that's going to ensure there's consistency, and both sides can access it.

That's all I had for my demo. I'm going to flip back and I'm going to hand it over to Derek now, who is going to tell us about how Atlassian is using Vault.

Derek Chamorro I am the senior security architect at Atlassian, and I'm based out of Austin, Texas. Like Armon, you can find me pretty much everywhere as theredinthesky. I'm talking today about how we at Atlassian manage secrets at scale.

For those of you who are unfamiliar with who we are, we make software development and collaboration tools for both on-premise and in the cloud. Here is a bit of our portfolio. We make Bitbucket, which is the Git solution for the enterprise confluence which is team collaboration software. HipChat, which is instant messaging and chat for teams, and JIRA issue and project tracking for teams. Through some recent acquisitions we acquired a company called StatusPage, which monitors your page status, as well as Trello for creating project boards.

Where are we? We are over 1700 employees now and although we started in Sydney, we now have seven offices across three other countries. We have many physical data centers, but we are rapidly increasing our cloud presences in Native US with services running in multiple regions. This is a quote from Roald Dahl's last book, but it also reminds us that we all have secrets and in some ways, we hide them in the most unlikely places, as well.

That being said, how did we get here? Meaning, how did we realize that we had a problem with secrets management. Well, we grew up, meaning we grew as a company. When you go from 10 employees to over 1700 employees, the tools you develop should grow and scale, as well. Our development teams grew considerably and unfortunately, we realized that secrets management had become a bit of an oversight. When we looked into what our dev teams were using for secrets management, we found that we lacked a solution that the entire company could use. Some teams were using ad hoc solutions, or homegrown things for secret management, but others were not.

Without a global solution, it also leaves one vulnerable to recovery procedures. Imagine having to update autonomous secret stores for multiple services, and that whole process being extremely time consuming. It was at that point we realized that building a cloud-ready secrets management solution would become a requirement. Without a global solution, we couldn't enforce consistent policies. Concepts like key rolling, revocation, and auditing were not consistent with homegrown solutions developed internally. At that point, we realized that we could do a lot better.

First, we laid out some use cases for our solution. First and foremost, we needed a secure storage solution for existing keys. Second, we wanted the ability to create keys dynamically for encryption use. This would allow us to encrypt data from applications and then store the encrypted data securely. Third, we wanted to extend our existing PKI program to build out trusted service level CAs. This will allow our services to get X509 certificates dynamically based off of some kind of roll.

Finally, we wanted the ability to generate and store AWS access keys and temporary credentials. These credentials could be generated when needed and then revoked when no longer in use. From our use cases, we started defining what our feature criteria would be. We wanted a solution that had fine-grained access control policies that our service teams could administer for their own specific use cases. We wanted the ability to bring your own key and this is a requirement for our existing services, as well as a lot of services that still provision their own API keys.

We needed the ability to rotate roll keys to comply with our internal password rotation policies. That's something that we share with our secrets management. From that, the ability to audit their uses. We can track things like expired key usage. Creating dynamic keys and wrapping them with a separate key was important, as well. As Armon had mentioned, it's bringing encryption as a service. At the same time, also we wanted a solution that could be accessed for services that lived either in our physical data centers or within a cloud environment.

We needed that solution to be highly available and resistant to service disruption. HSM integration was important to us, because we wanted to ensure that master keys were getting securely backed up. The last feature itself is important, because it's related to our service mirroring model, which I'll show you in the next slide. As a form of disaster recovery, or mirror services across multiple regions as a way of recovering from a total region collapse, should it happen. We didn't think it could happen previously, but as you're aware, 80 of us suffered a massive data outage a few weeks ago, so this type of design is meant to mitigate that type of collapse.

We first provision a set of service clusters within US-West-2 as an example here along with their relevant API keys. Then, what we do is we create the mirror, the same service clusters in a separate region. Since these are the same services, we want them technically to share these exact same keys securely, in some form of a secure fashion. Unfortunately, after this point, we couldn't find a solution that could do this for us.

We started the vendor selection process based off of our existing feature criteria and really came down to two potential vendors. First, we started with KMS, which is short for Key Management Service. For those who aren't familiar, it's an Amazon Web Services Managed service for creating and controlling encryption keys. Taking from our last feature criteria, we found that AWS did support per-service policies in the form of IN policies. While they can be fine grained, you really need to understand the full KMS object model to ensure that you are limiting specific key access.

Unfortunately, you cannot bring your own key, so this wouldn't address existing keys in use, or existing practices of being able to create your own API keys for your service. KMS does support rotation as well as auditing. It does support the ability to create dynamic keys and then wrap them with a customer master key per its customer master key attributes. But, it does not address keys or services hosted at our physical data centers. KMS is only meant for AWS resources and will only create keys and manage keys for you just for its own resources.

Although not needed, we want a solution that's always highly available, because it's integrated within an HSM, and it does allow that unconditional layer for protection for its master keys, but the one thing where it did not support from my service mirroring model was the ability to support key replication across multiple regions. KMS is backed by that HSM and does not allow acess outside of its respective region. After spending weeks with their product teams, we realized this wasn't something they could do.

After KMS, we looked at Vault, in particular Vault Enterprise, to see how it matched our criteria. Off the back, the access control policies involved are very fine-grained. Mapping authentication backend instead of policies to the user's service. In the next slide, I'll show you our design, in which we use Consul for a storage backend. This allows us to use our existing keys to be stored securely and still stick to our model of services bringing in their own keys. Keys are least secured, as well as revoked. All of which are actions that are auditable.

The transit backend as seen in the demo, allows Vault to handle data in transit, then making that envelop encryption. Since we present Vault as internally reachable service both our own prem, and cloud services are capable of reaching it. We've designed it to be highly available and have integrated Vault with our HM service, and that's the key feature of Vault Enterprise. As Armon had demonstrated, keys are capable of being replicated to different Vault clusters.

In the next slide, I'll show you our design. There's a lot of stuff going on here, so I'll break it down for you in a bit. It should show you the flow. First, we launch the Vault server and the Consul server in the same availability zone. A Consul agent sits on the Vault service, then per perceives and forwards requests to its respective Consul cluster. Partitions are created on the Cloud HSM and we use Cloud HSM as our HSM. It's a AWS service and we just use it, because it was something that was easily available for us.

We create partitions on Cloud HSM for the ranges that serve our Vault clusters, and redundancy is built in within the HSM client binary, as we point the client to go through KCS 11 interfaces for each HSM instance. This interface is used for master key generation in escrow, as well as to secure the auto-unseal Vault during a service we start. If there's a failure with one HSM, since the key is technically written to both partitions, we always have a redundancy built in.

The normal flow is as follows. Our internal clients at the top will make an API call through our share core infrastructure, and this share core infrastructure is connected to Amazon Web Services through the Amazon Native US Direct Connection and these are redundant links. If there's a failure on one link, we still have other links that connect to this per region. [inaudible 00:43:03] four to two in [inaudible 00:43:04] instance within their respective region. With the new replication feature, if a client is writing a secret, then it will replicate two Vault clusters.

Conclusion, by providing a single enterprise-wide solution, we avoid snowflake management models as well as provide a solution that is available to both cloud and physical environments. This allows our development teams to focus on building amazing services and not worry on where to store keys, thus avoiding secret sprawl. With a centralized solution, we can build policies that match our security [inaudible 00:43:42] tier service level access, key rolling and rotation. As well as providing a full audit trail.

Finally, as we expand to more regions, so will our platform thanks to key features like replication, so that leaves us as a company feeling pretty darn good. That's all I had. Thank you very much.

More resources like this one