Case Study

Vault Fastly secret engine design and integration at The New York Times

In this talk, software engineer Ling Zhang will show how she uses Vault for securing the Fastly CDN layer at The New York Times.

Speakers

The NYT has many services, each with many tokens. Managing a large amount of static tokens has become a burden. In order to address this they found a way to generate dynamic, short-lived tokens using HashiCorp Vault. Vault provides this functionality for GCP, AWS, and other cloud services, so they created a plugin that would do this for Fastly.

This talk walks through how Fastly tokens are stored and used. Learn how the NYT migrated to dynamic secrets, Vault's most secure method for secrets management. It also walks through how they developed the Vault plugin to do this, with a short demo.

Transcript

Today's topic is about Vault Fastly Secret Engine. This is an open-source project that the New York Times does during the Open Week. Open Week is a yearly event that New York Times has for its engineers. During this week, our engineers still need to do engineering stuff, but they can do personal projects too.

They can talk to their colleagues and other teams to do some brainstorming. They can watch online videos to learn some technology they haven't had time to see. The good part is your manager can't put any work-relevant stuff on your plate during that week—unless it is urgent. This turned out to be a successful project during last year's Open Week, and we've been grateful for that because we finally have time to put in on this project and to work on it.

I forgot to introduce myself. My name is Ling Zhang and I'm a software engineer from the New York Times. I work for the delivery engineering team, and, as you can see, there is another name there, Shawn Bower. He is my colleague. He's the lead engineer from the same team I work on.

He was supposed to be here presenting with me, but why is he not here? He's like, "Ling, I'm sorry. My passport expired." I was like, "Dude, you're a lead engineer on the Vault team. You work with Vault and secrets every day. Aren't you supposed to be mindful of TTL and expiration?"

I think it's the same idea with secrets. It's also probably the reason that people want to start using a lot of dynamic secrets. They don't always want to pay attention to expiration dates, want the TTL to be set to be more appropriate, and how many tokens you're creating. Or where they end up with, how people are using them, and where they're putting them. If you're using dynamic secrets, then you don't need to worry about any of this. You create when you need them, and you destroy them immediately after you're done with them.

Today's topic will be a specific use case. How we're using Vault as a platform, and how we use it to talk to the API to create dynamic usage tokens.

What is Fastly?

This is today's agenda. We're going to first talk about the current Fastly situation at the New York Times. We're going to talk about the first try of secret management improvements that we did. We're going to talk about the Vault plugin we created, which is the Vault Fastly Secret Engine. We're going to talk about the design of it, and the integration of it. The integration we did to our CI/CD pipeline. And last but not least, we're going to talk about the future plans for it.

The first thing we're going to talk about is the current Fastly situation at The New York Times. First things first: What is Fastly? Fastly is a CDN, which stands for a Content Delivery Network. The New York Times started using this during the election of 2016. As you guys know, the election of 2016 was a big change for the US. Same thing with the New York Times. Starting to use Fastly is a big change for our infrastructure and architecture. We've been really happy with it.

What is CDN? It's a network we put in between the end-user and the backend. It protects the backend and releases the pressure on the backend by serving the cacheable content. It improves the user experience by serving the cacheable content from the POP—which is the "Point of Present,”—from a closer location instead of serving directly from the backend.

Fastly provides more than 50 POPs globally and we've been happy with its behavior. It also provides a lot of security features, like DDoS protection and web application firewalls. The other important feature we've been using from Fastly is called purge service. This means whenever we want to update the cache content from the cached POPs, we'll be able to purge cached content from the POPs within milliseconds. We either mark the TTL as invalid or delete the cached content directly from the POPs. It can directly talk to the backend to get the most up-to-date content. We've been very happy with Fastly.

Managing tokens with Fastly

There are two different kinds of tokens we're managing for the Fastly service at the New York Times. There are Fastly global tokens, and Fastly purge tokens. The global tokens are the ones we're using for the daily deployment of the Fastly services. There are currently 32 apps sitting in the repository right now.

Each app has three environments, known as dev, staging, and production. Each environment also has its own designated Fastly service. If we create one token for each service—32 multiplied by three—there are already 96 tokens we're managing as global tokens.

There are also purge tokens. There are possibly one or more purge tokens per service, if the team requires it. Some teams maybe don't want a purge token at all. But for the more collaborative services, they probably would ask for more than one purge token. Let's say there are 10—there's definitely more than 10.

We're already managing more than 100 tokens. The delivery engineering team—the cache infrastructure team—is managing all the Fastly services. We have to manage all these tokens ourselves too.

Today we're going to be mostly talking about the Fastly global tokens, which are the ones we use for daily deployment. This is the CI/CD pipeline we use for Fastly services.

As I mentioned before, the apps are sitting in the GitHub repos. Each one has its own designated repository. We have all the configuration for dev, staging, and production in one repository and we're using Drone as the CI/CD deployment tool.

The same thing with Jenkins or Travis CI. We're defining all the CI/CD pipelines in the YAML file—for Drone, it's called drone.yml. The only difference is, Drone is a container-based CI/CD tool, so each step in the Drone YAML is a separate Docker container.

We're using Terraform—as most people do—to generate the statefile. We're using Amazon S3 as the backend to store the remote statefile, which includes the current Fastly service configuration.

We compared the new changes with the current Fastly configuration, and we deployed to the Fastly service. We’re putting all the secrets we're using during the pipeline into the Drone secret section. They were sitting in the Amazon RDS. All these tokens were sitting in there as plaintext and they totally depend on the access control of RDS to protect them. That's not ideal.

Addressing some of the key problems

The Fastly team is managing all these tokens. We want to consolidate all the tokens, and have one account managing all of them. But there's a limit on how many tokens you can have in a Fastly account—you can have 100. Apparently, we're way over the limit already. We've constantly been asking the Fastly support team to increase the limit for us. It's not ideal.

We needed a better place to store the tokens with an easier way to manage it. Drone secrets work perfectly with the pipeline, but we're not satisfied with the plaintext part, and we don't want to totally depend on the access control of RDS to protect it. We're looking for something like Vault.

We wanted to automate the process of retrieving tokens from where they're stored during deployment, and to avoid human operation. It works fine if we're using Drone secrets section. But if we want to use Vault, we want to find a nice way to integrate it with our CI/CD pipeline.

We also wanted to automate the process of rotating secrets without manual updates everywhere. That is a problem for us if we use the Drone secrets section. Whenever you want to rotate your secrets, you have to update them manually in the Drone section. That's inconvenient, and human operation always means mistakes.

Last year, the first improvement we tried was replacing the storage location from Drone secrets to Vault. That way, we solved two bullet points from the last slides. First, we find a more secure location for all the Fastly secrets. We use Vault instead, and we find a nice way to integrate Vault into our CI/CD pipeline. We use the Vault image in our Drone YAML, and we're logging the app in Vault using AppRole. It can retrieve the tokens during the pipeline when it's needed.

But there are still problems. We're still constantly hitting the limitation of tokens in the Fastly account, and we still need to update the tokens manually when we rotate them. We kept brainstorming, and we finally found a solution. Vault Fastly secret engine. We made a few small changes based on our initial solution. We were thinking; what if we used dynamic tokens instead? We created tokens using Vault, talking to the Fastly API pipeline when we need it. Then we dump them immediately after we're done with them. We're no longer hitting the limitation of tokens in the Fastly account, and we don't have to manually rotate and update them anymore. That's what we did with the secret engine.

Dealing with design problems

Several design problems came to light when we were first brainstorming:

How to programmatically provide the MFA token to Fastly

Fastly, like all the other platforms or tools you guys are using, you can enable MFA for Fastly users to log in. I think most companies will require their engineers to enable MFA for security. That will be a problem if you don't have a way to do this. We don't want to bypass it, we still want MFA.

Luckily, Vault provides a new TOTP functionality that can create TOTP tokens for you. That really benefits us. We can create the TOTP tokens within the plugin and talk to the Fastly API.

Ease of use

We did find a nice way to integrate Vault into the CI/CD pipeline. But it will be a bit different if we're not using static tokens in Vault, but using Vault as a platform to create a dynamic token.

We compile the Vault image with the Terraform image. We have a vault_terraform image, and we use this image in the Drone pipeline. This means that you can not only run the Terraform command, but you can also use the Vault API to create the tokens and ask for it as the environment part. Then do the terraform plan and the terraform apply later.

Creating dedicated access control tokens

This is important: Even though you can create dynamic tokens, you will be less concerned about the security of the tokens. But you still don't want to let any team create any tokens for any service.

When we were designing this, we're doing this for two layers. The first layer we're doing it in is the Fastly level. In the Fastly API we're using, we're specifying which service we're creating this token for. When you input the service ID for the tokens, the tokens can only be used for this service. The second layer is on the Vault level. We use this to specify the service field when calling the Fastly API to create tokens in the plugin.

A closer look at the plugin design

This is a diagram we pulled directly from the documentation that HashiCorp Vault provided online. That should be useful for you guys looking to create any Vault plugins. Let's work through this.

It's different from the plugins you create for other tools. You're not writing code directly into Vault’s codebase, you're writing a separate app. And after you complete the app, you're packing the app together with the Vault base image. You need to register your plugin with Vault so that you can use it.

In this diagram, the first step after you finish the code is to register the plugin with the pass in checksum, with Vault. You generate the checksum and you write into the right path under the catalog of Vault to register it. After you register it—every time you use it—Vault will look for the plugin to see if it's already been registered. And you will verify the checksum of the plugin.

After it's been verified, it will stand and wrap tokens to the plugin you're trying to use. After the plugin has got the wrapped tokens, you can use it to set up the RPC server with TLS and communicate with the Vault core via RPC over TLS.

That's the generic workflow of how the Vault plugin works. Here's a snippet of the code showing how we get the TOTP token. This is an important piece for our plugin.

How to generate a TOTP token

In this function called generateTOTPCode we're inputting one string called key. This key is the share key. Every time we set up the multi-factor authentication—whatever platform you're using—will give you this share key to set it up. You will need to input it here to generate a TOTP token.

We're calling this function provided by HashiCorp called GenerateCodeCustom in this TOTP live. We're using the key we pass in, and the current time. There are three different primaries you can customize here. We set the TTL at 30 seconds for the TOTP token. I think in most common cases, we're using 6 digit TOTP tokens.

If you have a special use case, you can also customize it. In general, people use algorithm SHA1. If you have a special use case, you have to customize it with your need—but it depends on what platform you're using and what you require. It's straightforward and convenient for us.

The Fastly API

Another important piece for our plugin is the Fastly API. I know this is a specific use case, but Fastly provides a way for us to create the tokens so we can make this happen. This API is providing the TOTP tokens we created from the last slide. And we're providing the username and password for it so that we can create the tokens. If you don't specify any other field, they would create global tokens for you, which means this token can be used for any services and can be used in a global scope.

That's not what we want, but it's okay as we can specify it with service ID. The service ID can be one single service ID, or it can be an array of services. If you want your dev, staging, production to all share one service key, you can do that., The scope can be global, which is the one we usually use for the deployment. It can also be purged—so purge select or purge all—depends on if you want to purge one single URL, or you want to purge everything for your service.

The TTL is optional. We have a default 5 minute TTL for those tokens we created. 5 minutes is usually enough for all the deployment we do for the Fastly services. If you need a longer one, you can also customize it. That's the Fastly API.

Creating API tokens

Before I start the magic—like any other magic you've seen—I have to show I have an empty hand. This is an account I created for this demo. I’ll refresh it to show that there are no tokens in this account yet. Okay, there's only one token. This is the one Fastly created for this browser session.

Let's start the magic. I have a terminal window here, and a cheat sheet here. I'm going to run this Docker command to use the binary that I've already built. All I did before this step is use go build and use Docker to create this binary, and we call it vault-plugin. That's it. To save us on time so we don't need to build it here.

We compiled the base Vault image for vault-plugin, with the plugin code we created. In this binary, it has the Vault base image and also the code of the plugin created. Let's run this command to spin up a local Vault. This command is the one I directly pulled from the HashiCorp Vault website. It runs a dev mode Vault for us, so we don't need to unseal it. As you can see here, dev mode is enabled in this mode. Vault runs entirely in memory and starts unsealed with a single unseal key.

We're naming this token to log in to this Vault called myroot. And as you can see it's a local Vault, we're using 1234 port for it. And we're using the image called vault-plugin we compiled.

We have a Vault now. Let's try to use it. The first thing to do is specify which Vault we're using. We're telling the terminal we're using this 1234 port Vault. We want to log into it using the token we specified.

There you go. We're in Vault. The first time we use it, we want to configure the plugin in this binary with the Vault we're using. You need to register this plugin. First you need to create a shasum for your plugin with this command. And let's verify if there's a shasum there.

Look like there is. And then we're going to register the plugin by writing this shasum into this sys/plugins/catalog/vault-fastly-secret-engine. The Vault we're using will know this plugin is there. Success. Great.

Then we're going to enable this path for this plugin. This command is telling Vault that whatever field is sent to fastly/ path, it can directly talk to the plugin that we registered. And as you will see in the following step, there's a subpathway defined in this plugin. config path is the one we're using to map into a function in the plugin. The plugin that we write to collect all these credentials for the Fastly API we're going to call for.

In this step, I'm going to configure this plugin with the Fastly credentials. I don't want you guys to know my password, username, or share key so I wrapped them up into shell script. I'm going to run it. Here we go. We write everything into the fastly/config. Now the plugin knows which username and password we're using for all the API calls. Looks like all the plugin's been configured. Let's try to see if it's working. Let's try to make a token for it.

For this demo, I created a fake service called test, and it's inactive because I haven't set up any backup for it. But it's fine, we're going to create a token for it. It looks like the service ID's already there. In this command, we're using Vault to write fastly/generate and we'll specify the scope to be global, the service ID to be the one for this test—the fake test service here. There you go. We got a token. Let's verify it—if it's really there. There it is.

I'm going to give a little bit more information, because as you can see, it's saying the token's been created July 10, and it's expiring July 10. We don't know what exactly the TTL is. There is a Fastly API we can use to verify it. I'm going to pass in the token we created here.

This token's being created at this time, I'm pretty sure it's not the same time zone with us. And it expires 10 minutes after that. We have a 5 minute default TTL. As you can see, the name matches the one we see in the UI called Vault Fastly secret engine. We hardcode it in the plugin. And the scope is global, as we defined. You cannot see the service ID because it's a fake service—it's inactive—so it's not showing here. Okay. Great. It's working. Let's go back to the slides. We finished the demo.

Reviewing our Drone YAML code

After we finished the plugin, we wanted to pack it together into a binary and send it to GCS so we can use it in other deployment pipelines. We're using Drone in all the deployments at New York Times. This is a snippet of the Drone YAML we're using to view the binary and deploy it into the GCS bucket. As you can see, the image we're using is golang:1.9-alpine. We want to keep it as light as possible so the pipeline will run as fast as we can.

Do the go build and define this ongoing environment of ours. In the deployment step, it’s providing the Google credentials, which have the right access to push the binary into the GCS bucket.

Now we're going to talk about integration. How do we really integrate this plugin into the Drone pipeline we're using? This is a snippet of how we created Vault tokens to log into Vault—to use Vault in all the steps in the Drone YAML. At the beginning of the Drone YAML for any service that we want to use for Vault, we have to log into Vault. We have to create a token that you can log into Vault with the following steps.

We're using AppRole to generate a Vault token. With AppRole, you have to provide the role ID and secret ID. We're providing the role ID in the environment part, and the role IDs are being provided in the anchors—in the command secret section. We put them separately to be more secure.

We're specifying the Vault address. This Vault address should be the one that you have the plugin configured for, and registered already. You’re running this command to create a Vault token that will allow you to log into Vault. And you're pulling into the root folder so you can share between different pipelines. After you do this step, you should be able to use Vault.

The image we're using is vault_terraform. It lets you run the Terraform command and the Vault command. This is the same thing we did in the demo. We're using this vault write-field=token fastly command to generate the global token for this dev service. We export this token that we generated as a Fastly API key so we can use this Fastly API key in the following command. You can use that to do the terraform plan and terraform apply.

Future plans

Last time I talked about this we had not approved by the Infosec in our company to do this as open source. But now we've been officially called an open-source project, yay! Soon we're going to post our blog about this open-source project at open.newyorktimes.com. I recommend you guys take a look at this website because there's tons of interesting stuff that the engineers at The New York Times have done.

We want to create a Drone plugin. As you can see in the Drone YAML I showed you guys, we're still doing a lot of command lines. In that sense, it could be tedious to show the Drone YAML. We don't want to do that. We want the Drone YAML to be more readable—to be cleaner. We were thinking that we should pack everything together, and, in the future, the user can pass through all the parameters as fields in the plugin.

We'd like to integrate the TOTP functionality in Vault into something other than Fastly. Fastly is a specific use case of how you're using Vault as a platform to talk to the API of another platform and create dynamic tokens for your pipeline. But we really want to use this as a starting point, and start to use more dynamic tokens in other use cases at The New York Times.

That would mean we don't have to deal with the secrets, expiration dates, TLS, stuff like that. We won't have the same problem, like my colleague Shawn had with his passport, I guess.

More resources like this one

  • 1/6/2021
  • Case Study

Self-service discovery at scale with Consul at Bloomberg

  • 1/5/2021
  • Case Study

How Roblox Developed and Uses the Windows IIS Nomad Driver

  • 12/17/2020
  • Case Study

Consistent development and deployment at Comcast with Terraform

  • 9/2/2020
  • Case Study

Service Mesh in the Real World