Case Study

Secrets at Scale With HashiCorp Vault at Datadog

Datadog shares the technical architecture, business requirements, and design decisions that led to them choosing Vault for their secrets management.

From cloud providers to CDNs, when Datadog authenticates with third parties using customer supplied credentials, they must be kept safe and secure. This talk explores why and how a cross-functional team at Datadog chose to use HashiCorp Vault to securely store these third-party client credentials. It will cover the technical architecture, business requirements, and design decisions that lead to the decision to rely on Vault for these critical workloads.

Speakers

  • Andrew Glen-Young
    Andrew Glen-YoungSRE, Datadog

Transcript

Hi. I want to get a feel for the room. Who here is using Vault currently? Wow, quite a bit. Who's using it in production? Because it's a big difference. Who’s been using it for 6 months or more in production? OK. Less than half. A year or more? Two years or more? Wow, OK. That's not too bad. That's us.

Even if you work in a distributed systems environment, managing secrets can be a challenge. When your business involves using customer secrets, the challenge and stakes increase.

Seeking a better secrets engine

Previously, we were using a homegrown secrets engine to secure our secrets, but it's better not spoken of. We wanted to replace it with something supported that could handle our future growth. We looked at the landscape and saw that Vault came closest to what we were looking for.

One of Datadog's product strengths is the breadth of integrations with third-party services. This allows your systems and application metrics to be displayed alongside your infrastructure and integration metrics, to provide a global view for your teams. We have over 350 integrations, most of which are turnkey.

But in order to make these integrations work, many of them require credentials. This is the crux of Datadog's problem: how to safely manage but easily access customer secrets. If you're only concerned with your own secrets, this talk might not necessarily apply to you, but hopefully you'll still learn something. The system I'm going to describe has been in production for over 2 years.

For those of you who don't know, Datadog is a software-as-a-service-based solution for all your observability needs. We provide metrics, logs, APM (application performance monitoring), and so much more to ensure that you can easily observe the services, applications, and platforms that you're responsible for.

And, of course, we're hiring.

My name is Andrew Glen-Young. I'm a member of the SRE team at Datadog. Before Datadog, I worked for a number of small startups, large corporations, and open-source vendors.

Our secrets, and customers' too

Datadog is not your normal secrets use case. We're storing our secrets, but also yours, and yours, and an untold number of clients' secrets for many of the integrations and services. An example might be credentials for a cloud provider. Some require account IDs, API keys, and others require a certificate. As you might imagine, we have a great responsibility to obtain the trust of our customers.

Integrations work using a fairly standard distributed worker pattern. On an interval, a job is scheduled. One worker from a pool of workers accepts the job. The worker then queries the third-party API, and then the results are persisted back to Datadog.

Clearly, I'm skipping over a whole bunch of detail here. However, the basic concept holds. It's actually pretty simple.

In order for the worker to query that third-party API, we need credentials. These credentials are provided to us by our customers. We need to ensure that we can store these secrets securely.

As you can imagine, we use a lot of secrets, which leads to the question: Given how we use secrets and how our usage grows, how do you secure these secrets? In one word, carefully.

Problems with the old model

Let's look at the old model. While we were using all due caution and good security practices to secure these secrets, there were a few things we were dissatisfied with:

  • Scale

  • Intermingled secrets

  • Coarse-grained access control

  • Auditing pipeline

Scale. While the old system was performing adequately with our projected growth, we knew that it wasn't going to meet our needs. The old system intermingled third-party secrets with the secrets that we controlled. This made access control less granular than we had liked, which also made auditing a lot more tricky.

So we had these problems. How did we go about solving them? We decided to look at the problem from a slightly different angle.

We realized that third-party secrets have distinctly different properties from the other types of secrets that we store: their format, lifecycle, and revocation. We do not control their format. And by format I refer to the fact that some secrets may be a simple string, others may be a binary file, others are time-based tokens, etc. We need to be flexible in how we store them.

Since third parties dictate the authentication mechanism, we do not control their lifecycle. Customers provide us with the secrets, and they can update them at any time. Additionally, policies or mechanisms external to us may dictate the lifecycle of any given secret.

And we cannot revoke these secrets. Of course we can deny access to them, but our customers can only revoke them and make them invalid with that third party.

These differences in properties, access patterns, and threat models led us to believe that we needed a new, separate service to handle these secrets.

The criteria for a new model

Datadog operates in a mostly bottom-up manner, allowing engineers to help define and drive the product. This is a huge benefit that you get when you dog-food your own product. We formed a team consisting of engineers from integrations, security, and SRE in order to work out what we needed.

After plenty of discussion, this is basically what we came up with:

  • Fine-grained access control

  • Comprehensive auditing

  • Secure but easy to operate

  • Per-customer secrets revocation

  • Secrets versioning/rollback

  • Scales to 10x the current workload

Since we cannot control the secret's format, we needed a flexible storage model. We want to ensure that only the services that require access are the ones that are allowed access. Ideally, we'd like to limit read and write access, too.

Access must be audited. We need to know with as much detail as possible what services access the secrets, when, and what operations were performed.

It's easy to create a secure system, but it also needs to be easy to operate. Humans need to interact with the system, so this needs to be accounted for. Never skip the socio-technical aspects of a system.

We wanted the flexibility to be able to revoke secrets at a per-customer and even at a per-secret level.

And we needed to accept that errors happen. We wanted to be able to retrieve older versions of secrets and roll back if necessary.

And thinking about scale, we wanted to ensure that the system could handle at least 10 times our current workload.

As engineers, we often jump into using brand-new technologies and suffer from "not invented here" syndrome. These impulses are actually good, as it means you're very excited for what you're doing, but we need to temper this with our pragmatic side.

At Datadog, we operate in a manner where engineers own the full lifecycle of the services that they are responsible for, from development to deployment, to getting paged for issues. This neatly balances the incentive of running a stable, secure service with features and customer demand. It also means that I'm the one who's going to get paged at 3 AM if there's an issue, so I need to get this right.

Some of the secrets management contenders

We looked at various systems that were either open source or provided ready by cloud providers, and while many were great, they were either too limited or used software stacks that we had little experience with.

Pretty much from the get-go, we reviewed the code that we knew was accessing secrets and made sure that we had fully instrumented it. We made sure we had metrics and application performance monitors measuring that secrets access. This allowed us to create a baseline of the current usage of the older system before working on a replacement.

We considered some tools alongside Vault and, while they are great, they didn't quite meet our needs, so I'm just going to talk about a few of them.

AWS KMS is probably the closest to what we implemented using Vault, but the cost of KMS Keys and the API rate limits made it a non-starter.

Confidant by Lyft was very interesting, but ultimately the complexity of implementing authentication, the lack of per-customer revocation, and our KMS API rate limits concerns steered us away.

Cerberus by Nike was very interesting, but the stack wasn't something we were comfortable with running in production.

Again, I'll remind you, our use case is not necessarily typical, for all the reasons I've mentioned.

Vault came closest to meeting needs

When it came to Vault, it neatly met most of our requirements right out of the box, with a few bonus features we hadn't even considered.

Formats aren't a problem, since Vault has a flexible storage mechanism that we could use. The Vault ACLs provide us with a mechanism to limit which workers had access to which secrets. We could even limit services to read only, write only, and read write.

Audit logging was great, particularly since, if the audit log cannot be written to, the operation fails. This ensures that your audit log is always complete.

From the documentation and our experience operating Consul at scale, we are convinced that Vault would provide a similar ease of operation.

The storage layer allowed us to use Postgres, which we use extensively at Datadog. This reduced the operational burden.

Vault has a really flexible authentication model, which we didn't fully appreciate until later, when we needed to extend authentication from one cloud provider to multiple providers and even Kubernetes.

And of course, making sure that you can monitor Vault using Datadog. For example, on this slide we can see that we're meeting our SLO for the response times, that our Vault cluster has a leader, and we can see how the individual Vault handlers are performing.

Where Vault didn’t measure up (at the time)

If we revisit our original constraints, you can see that Vault didn't quite meet all our needs, but it gave us a lot. We worked so that we could achieve per-customer secrets revocation if we use the transit backend, which I'll cover a little later.

Secrets versioning was still an issue, and remember, this was 2 years ago. Vault didn't have the v2 KV store that it has today. And since Vault at the time didn't provide the great plugin API it has now, we briefly thought about forking Vault and adding the functionality directly. But we knew this would make upgrades and maintenance difficult over time, so we rejected this option.

And then scale. We couldn't really answer this without testing it with real workloads. This led us to the conclusion that we needed to create our own secrets service, but one that would leverage Vault as much as possible. Then we could expose a simpler API to clients.

This was an easy choice for us since we controlled all the clients that would use the service. Remember, this system is in the critical path for all our customers' third-party integrations. It was essential that we got this right.

The power of Vault's transit secrets engine

At this point, it's worth discussing Vault's transit secrets engine. For those unfamiliar, it allows Vault to act as an encryption as a service. The transit engine is primarily used to encrypt data from applications while still storing the encrypted data in some primary data store.

This means developers do not have to worry about the intricacies of proper encryption or decryption, as Vault handles this for you.

Enabling the transit secrets engine is really simple. Vault secrets enable transit, and that's it. The backend is mounted to the path of transit by default.

Next, we need to create an encryption key. In this example on the slide, I'm creating the key for Customer 1. Again, it's pretty simple. We write the transit key's paths and name the key "customer 1." Now, if I want to encrypt a secret using Customer 1's key, the process is pretty straightforward.

As I previously mentioned, we do not control the format of the secrets that we handle, but the backend expects a secret in a plaintext format. We work around this by first Base64-encoding the secret before encrypting it. After writing the secret data, Vault passes back the encrypted value, and then we can store this encrypted value.

But we need to access the secret at a later date, and all we have is that encrypted data. How do we decrypt the secret? By writing encrypted data to Vault and specifying which key to use to decrypt the data. This returns the secret that was encrypted in plaintext format.

But remember, we Base64-encoded the secret first before we encrypted it. So the last step is to Base64-decode this value to retrieve the original secret. And now, we have the original value that we encrypted.

Using Vault ACLs, we're able to then define which service is allowed to create keys, encrypt, and decrypt data. For example, we want the service that is accepting new secrets from customers to be able to write secrets but not read them.

Building on what Vault provides

Building upon the transit engine, we're able to implement the features that we needed for our secrets service. When creating the secrets service, we needed to ensure that we could identify the clients in a secure manner. We also needed to allow clients to read and write secrets.

So how does the secrets service identify valid clients? We just use Vault's built-in authentication methods. Authentication works pretty much the same way for all cloud providers and Kubernetes. The worker requests a Vault token from the secrets server and includes his identity credentials. On AWS, this is a node PKCS7 certificate, and on GCP and Kubernetes, it's the service account JWT.

The secrets server forwards the request to Vault. Vault then validates the credentials with the cloud provider or Kubernetes, and then the Vault token is issued back to the worker.

In fact, the worker isn't even aware that Vault is the one that issued the token. It just knows that it received a token from the secrets server with a TTL. As you can see, the secrets server acts much like a small shim in front of the Vault API. And while it sounds simple, it took a lot of thought and work to keep it simple—simple being relative.

How are secrets written to the secrets service? We just use the transit backend. An authenticated worker sends a secret to be encrypted to the secrets server. The secrets server then forwards the secret to Vault, which returns the secret in an encrypted format.

The secrets server then persists the encrypted secret in its database. The secrets server then creates a unique ID for the secret and passes it back to the client.

Using the simple method for storing secrets allows us to build per-customer secrets revocation and a mechanism for rollback.

Reading secrets works in a similar manner. An authenticated worker requests access to a client secret, passing in the secret's unique ID. The secrets server then retrieves an encrypted secret from its database using this unique ID. It then requests that Vault decrypt the secret using that client's key. If successful, the secret is then passed back to the worker. Remember, all of this uses Vault's ACLs and auditing system to gate access.

If we look at our design constraints again, we could see that the only thing left to work on was if this would scale for us. Artificial benchmarks and back-of-the-envelope calculations suggested that it would, but we've all been burnt by this kind of assumption.

The only way to truly know is to deploy this to production.

The rollout

It's pretty easy creating greenfield services and rolling them out. But in our case, we were replacing an existing system powering thousands of customers. How did we go about doing this in a safe and controlled manner? Two methods: feature flags and gamedays.

At Datadog, we've implemented a feature flag system allowing developers to roll out features in a safe manner. These flags are used at runtime and allow code to be turned on and off based on various criteria.

We also implement gamedays for all of our services. This is the component or philosophy of resilience engineering, but really, it's just a set of experiments that we create where we fail parts of the system and make sure that the failures match our expectations. To steal a little from Mitchell Hashimoto's keynote this morning, these methods offer progressive delivery.

Using feature flags, we were able to limit the blast radius at each step to a limited set of customers and a limited set of integrations. Since we strongly believe in dog-fooding at Datadog, we say that Datadog is our own first customer. By deploying only to ourselves, we could experiment with confidence.

Remember, this is in production. First, we tested writing secrets in parallel to the current and the new secrets service. Then we measured the throughput and the errors that we got. After we were confident with writes, we can do the same for reads.

Additionally, we could validate that secrets that were returned from the old system and the new system matched. And, of course, we used Datadog to measure and alert us to problems in case we got something wrong. We even created a burndown chart showing the progress of which integrations were migrated to the new system.

We gameday all services that we use at Datadog. These experiments help us understand how our services handle failure. This builds confidence in how our services operate. Examples of a gameday might be AZ loss, a loss of connectivity to the database, or DNS, since it's always a DNS issue.

These experiments are really easy to do. My advice is to start with the simple failure scenarios and gradually test more complex ones later. After running an experiment, you should validate that your results match what you expect. Failures are actually good, because this is a learning opportunity.

Of course, you cannot predict every failure scenario, and that's not the goal. What this does is allow you to gain confidence in your services in that you've dealt with the most common issues. We can't run this critical service and blindly trust that it's working. This is where Datadog comes in.

We track the general latency of reads and we look at the login times. We can track the overall reads, writes, and deletes in the system. We added a caching layer where it made sense, so we needed to track that too. Latency of reads and writes are important, as are the reads and writes of encrypted data. And then we found tracking the CPU usage of the server useful as well.

Then for Vault, we like to check the latency of requests from Vault and the Postgres backend, as well as the generated throughput of requests and network traffic. And since Vault's doing a lot of encryption, we keep an eye on the CPU usage as well. And for Postgres, we like to check on the number of queries, the connection limits, and the CPU usage to ensure everything looks healthy.

So does the secrets service scale? It seems to. This graph shows an initial baseline of requests to 1x and an increase to 7.6x, and we've not yet hit our limit.

Issues

Clearly, everything went to plan and we had no issues at all. Not quite. We had at least 2 issues that I'd like to share with you.

First is a performance issue. We were diligent in deploying the service to a test environment first before moving into production. However, as most people know, your test environment is not the same as production.

While running dark in production, we noticed that the authentication mechanism was sometimes taking as much as 13 times as long to complete, from 1.5 seconds in the test environment compared with a maximum of 20 seconds in production.

This was clearly a huge problem, but here are the benefits that we gained from using feature flags and having telemetry for all our services. If we had rolled this out live, it could have been disastrous.

After some debugging, we realized it was an issue with Vault and how it was using the AWS API. Vault was requesting a list of all the EC2 instances in an account, and then checking to see if the instance existed in the returned list. What it should have done is request the EC2 instance directly.

Of course, our production environment has far more instances than our development environment, so we only saw this issue in production. I was able to create a simple test case and file an issue in GitHub, including a PR. HashiCorp accepted and merged the PR within 1 day. Kudos to them.

After this fix, authentication times were reduced from 20 seconds to 0.3 seconds.

For the second issue, we noticed that at peak periods, the secrets service had a delayed response. This resulted in slower reads and writes from clients using the secrets service. Digging through our metrics, we couldn't find any issues with our application, so clearly it was a Vault issue. Those are famous last words.

Looking at the Vault metrics, we could see that the service seemed to be performing well. This was a bit of a head-scratcher. What was going on?

After looking at the network connections between the 2 services, we could see that we had many TCP connections open between them, much more than we expected.

It turned out, due to the number of requests that the secrets server was forwarding, it was eventually exhausting the Vault server. We were creating and destroying new TLS connections for every Vault request.

Once we knew what the problem was, the fix was relatively simple: maintain a pool of connections to Vault and reuse these connections from this pool. Again, the use of feature flags and instrumenting our code allowed us to find and fix these issues before making the system live. And we would have found the issue much quicker if we had the Datadog network monitoring that we have today.

Some of the good lessons learned

  • Collaborate early and often with your affected teams. We did this at every stage, from design, development, review, and rollout. This ensures that everyone is invested in the same outcomes.

  • Build services with telemetry in mind. Clearly, I work for a vendor, and I have biases about how and what you should use to do it, but however you do it, try to do it as early as possible, as it saves a lot of time later. For example, we have great integrations with Vault and all of the HashiCorp products.

  • Document not only the end result, but also the decisions that you made along the way. This makes onboarding much easier later, and helps inform someone about the constraints in the system which are not necessarily obvious. It also helps a future you give a talk about this.

  • This was our first deployment of Vault. The experience and confidence that we gained was invaluable when it came time to decide to use Vault elsewhere in our infrastructure.

In our time together, I've described the architecture, business requirements, and design choices that led to the decision to rely on Vault for these critical workloads.

This system has been in place for over 2 years at Datadog. It's performed incredibly well for us, allowing us to safely and confidently store secrets with a high degree of trust and scale.

Thanks so much for your time.

More resources like this one

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 1/20/2023
  • Case Study

Packaging security in Terraform modules

  • 12/22/2022
  • Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

  • 12/13/2022
  • Case Study

Nomad and Vault in a Post-Kubernetes World