Case Study

How Capital One runs a large-scale private cloud infrastructure with Terraform

Published 8:00 AM UTC Jan 21, 2019

See how Capital One manages hundreds of VPCs in multiple regions with HashiCorp Terraform.

In this talk, Jeff Storey, the director of cloud tools development at CapitalOne, discusses how his company uses Terraform at "a very large scale."

At Capital One, Storey and his team works to deliver what they call "VPC Bones" which serves as private cloud infrastructure that lets application teams easily build their services on a self-service pool of infrastructure resources.

Capital One has millions of customers, thousands of developers and recently has migrated thousands of apps from their data center to the cloud. Their cloud is comprised of hundreds of VPCs across multiple regions.

The migration itself and the processes afterward presented many challenges. The company needed operations in their new cloud infrastructure to be fast, consistent, and secure.

Terraform was helpful in getting Capital One to this point. It's purpose-built for creating the self-service infrastructure that Capital One needed. The ease of use with HCL, reusable code, modularity, and integrations make it the ideal tool for managing infrastructure as code.

Watch the talk to learn all how Terraform fits into Capital One's tool suite for managing infrastructure as code and providing secure approvals for operations processes.

For more stories on how Capital One uses Terraform, read their Terraform-tagged engineering blog posts.

Speakers

Jeff StoreyDirector of Cloud Tools Development, Capital One

Transcript

I am going to be talking about running Terraform at a very large scale. While a lot of the things we're gonna talk about today are AWS specific, the tooling and methodology isn't - so it can apply to any cloud provider that you might be using. At Capital One we use Terraform for a lot of different things. We provision instances and do deployments. But I'm going to talk about what we call our VPC bones. That is all the things we build to let the application teams deliver applications on top of it - VPCs, subnets, NACLs, Route 53 domains and all the things we need to do to get it ready. Now, many of you that have used Terraform probably say, “Ok, I can spin up a VPC. It's not that hard.” Well, I will explain why it is harder at Capital One.

Specifically, we are going to talk about what challenges we had at this scale, why we chose Terraform and how it helps us out. We will also cover some of the tooling that we built around Terraform - some of which is open source, some of which is not. But the methodology can apply even if you don't have our tools. We will also cover what we are doing, going forward. At Capital One we’re still early on in our journey to the cloud, so there is still a lot of work to be done.

What “at scale” means at Capital One

When we talk about managing at scale, I want to talk about some of the different challenges that we have. But before we get into that, everyone has a different definition of what “scale” actually is. We can talk about it in terms of requests coming in, but I want to talk about what our organizational scale looks like. We have millions of customers and billions of dollars coming through our accounts. But internally we have a huge development community. We're in the order of 6,000 developers across Capital One running across hundreds of VPCs, hundreds of accounts, and multiple regions. It really is a 24x7 deployment operation.

And this isn't just for applications that are externally facing, it is all of our data analytics and internal processing. My role as part of a centralized operations team that builds out these network stacks, is to ask how we serve these internal customers the same way we serve our external customers? Because any slowdowns are problematic for everyone at Capital One.

If anyone has worked on an ops team like this, then you know it is like people throwing darts at you all the time. So how do we get this right? The first thing is we need to move fast. Our whole ops team is under a hundred people. While that may sound large, this is for running all of our internal infrastructure - whether it is things like GitHub and Artifactory - as well as running a lot of our data center.

Many of the things we did manually in the data center don't scale well. When we first started out moving into AWS, probably about two or three years ago, it would take upwards of four or five weeks to get a new environment provisioned. Developers tend to do what they need to do to make things work, which meant we had all sorts of things running in different environments where they did not belong. And at the same time, with new apps spinning up in the cloud, we were also going through this massive migration - moving literally thousands of apps from the data center.

The scale of what we're trying to do is very large and training people up who are not necessarily cloud-native at the same time poses a lot of challenges. In addition to being fast, we need to be consistent. This is particularly important when we are talking across all these accounts. Debuggability becomes very hard when things drift over time. It is easy to log into the AWS console and make a quick change. But why does your east region now look a little bit different than your west region? When you have to failover this becomes a hard thing to do. It also makes it easier to move workloads - for example, when a team gets started in a dev environment but then needs to promote to a QA environment. All in all, we want everything to be consistent and look the same.

Security, DR, and failover

We are a bank so security is important to us. We have a lot of people's data and we want to keep it that way. We need to have granular permissions models and understand who makes changes. For example, the people who can change the Capital One edge firewalls are very different from the people who can change dev environment code. But we don't necessarily want it to be a central group who bottlenecks for everybody. So how do we distribute this but remain responsible? How do we make sure that all of our changes are auditable? This helps our security architecture network understand what is going on.

We also need to be able to recover from disasters. Disasters can happen for a variety of reasons. Sometimes they are human-caused disasters, somebody logs in, does something that they shouldn’t. But it is also when data centers go out. There was an S3 outage. Not that is what we should always be planning for, but we need to be able to figure out how to quickly spin up these new environments. And that goes all back to consistency, security, and speed. We want to be prepared for these disasters before they happen. That way it is easier to failover when it does. Those are the problems at a high level.

Infrastructure as code

Now, everybody has heard this buzzword before - infrastructure as code. But what does it mean? For us it means treating our infrastructure code the same way as application code.

That is not just about building scripts that do your automation because what you end up with is a collection of scripts and instructions. We call that “manual automation.” A lot of things have to be done in the right order. It doesn't work well.

Infrastructure as code means testing, using code reviews, pull requests, modularizing and all the things you would do in application code. Not copying and pasting everywhere. Terraform was extremely helpful in getting us to this point.

Why Terraform

There are lots of options. Whether you're in AWS using CloudFormation, Google Cloud using their templating language, tools like Troposphere and any other generator languages. But there were a few key elements of Terraform that made us choose it. The first one being it was purposely built for doing this work. JSON quickly gets out of hand. YAML is a little bit better. As our Terraform codebase starts to grow, being able to read what we wrote for things like comments is much nicer. Even if you've never seen Terraform code before, it is pretty obvious what is going on. This example (in the slides) is using AWS, we're creating a security group called sample and it has got to have some tags on it. Super easy to read and write.

Now, the modularity is big and it goes back again to consistency. I

It is all about creating central collections of modules that can be reused. This is a simple example of how we might deploy the same code to east and west regions, but we will also do this across east, west, multiple regions and multiple environments. This will show you how the tooling we built can promote those changes across the environments. It means you end up with a very consistent set of environments.

The Terraform plan

The terraform plan is super important to everything we do from a review perspective. You can't really do blue-green network deployments well, so if you want to change a port or something coming inbound from the edge, you better get that right the first time. Not only do we use this for our developers to figure out what they're going to change, we integrate the plan into our review process so that our architecture and security teams can understand what is about to happen too.

We've gone ahead and enhanced some of the plan capability by parsing it out and building something a little bit more human-readable that maybe people who aren't as familiar with reading the code can actually read. It makes our security team much happier and makes things move much smoother.

Multi-cloud and open source

In terms of multiple cloud providers, I would say we are predominantly working in AWS, but as we build the tooling, we do not want to lock into a specific cloud provider. The nice part about Terraform is that you may have to have different constructs for the different environments but it makes it easier to use the same processes across different clouds.

It is interesting when you hear of a bank being open source. A lot of what we are doing at Capital One is consuming and producing open source. Open source makes it easier for us to commit things back. If for some reason Terraform does not do what we want it to do, it makes it easier for our security group to review what is going on. Embracing an open source-first culture has made life easier for the developers at Capital One.

Introducing Nimbus

Nimbus stands for the Network Infrastructure Management Bus and it is a cloud. We built it internally. It is a suite of tools and a workflow for secure approvals. We use it now to deploy a lot of our network infrastructure, including security groups for our application teams, as well as our VPC infrastructure across all of Capital One.

There are a few fundamentals before we get into the actual code here. The first one I've talked about - we want to treat our infrastructure code like application code. But the next is about creating GitHub repositories; what we call units of infrastructure. This means we deploy an entire repository at a time.

The result is we end up with hundreds if not thousands of repositories, but it is a group of infrastructures that gets deployed together. That might be a VPC for an environment or it might be a team's security group. But it always goes out together. This way we don't have any questions - like was this module deployed or that one? That is how we've done this from a repository perspective.

In the ops world and even in the development world there has been an anti-pattern. We called it the copy paste and mutate. We have some scripts, we copy them, we paste them, we change them a little bit. Now we have hundreds of files that are similar but different. That is something we wanted to avoid. This all ties back to modularity and the consistency and how Terraform fits into this.

The workflow piece is very important. It is not just about building tools that can deploy code. It is a question of how do we integrate it into this process because, as you can imagine, working at a bank, we rightly have a lot of security processes. But, how do we take some of that older-world mindset of everything's got to go through a central team, horizontally scale it out, but still make sure it is secure and make sure that our customers’ data is safe?

At Capital One we are API-driven. When we originally started building the tool suite, it was a bunch of command line tools. We now have the command line tools and talk to our APIs. But the teams were finding interesting ways to integrate with the products we were building - whether that was through Slackbots or Jenkins Pipelines. Whatever it might be, we wanted to make sure that we were providing APIs to make our developer experience as good as we would like it to be.

What does Nimbus look like?

We wanted to build a tool that made it easy to create new repositories. So we aptly named it Repo Setup and we borrowed a concept from Maven called “archetypes”. The idea is that when you run this repository setup tool, it prompts you through by asking what type of repository you want to build. In this case we have options such as one for security groups and one for VPCs.

I'm going to walk through the VPC example and it is extensible so we can continue to add new archetypes. The idea is that it lays it out with a unique folder structure and seed files.

It also hooks it up to RCI servers so it is immediately ready for deployment. This means when a team is ready to do a deployment, it is easy.

The first thing Nimbus does as part of the repository setup is generate a bunch of modules. We have a collection of modules that we organize in different ways - that we call flavors. From a VPC perspective, we have different types of workloads that have different types of VPC layouts. For example, your web and customer facing applications might look different than an internal data analytics VPC.

We have a couple of classes of these. We have a central collection of modules. Like here you would see the VPC, the Internet Gateway and more things like subnets and NACLs. As part of this archetype process you specify what type of flavor you want to use. What that ends up doing is building the Terraform files with these modules already combined for you. This way teams could hit the ground running.

From an architectural perspective, we set up our folders such that we put those shared modules in a single folder called VPC module. The VPC config Terraform file is where those modules would get loaded. And then we have a folder for each environment - the non-prod folder and the prod folder. This allows us to call that shared module, and now our non-prod and prod look the same.

This does require that the shared modules are sufficiently parameterizable for the different environments, so we've spent a lot of time cultivating those. But anytime we need a new VPC and new security groups, it lays out in this folder. You are free to change it once you lay it out but I don't know if I would advise that because these are the processes that we set and they work well. This leads to the environment promotion. When you want to deploy something you can easily deploy to non-prod and then to prod.

Checks and balances

We have laid out a repository but changes need to get made. This is where the whole workflow comes in. Typically the developers and the people who approve the changes are different groups of people. This allows for that sets of checks and balances. From a VPC perspective, we might have a security team who does a review where we might have one or more approvers - someone from architecture or someone from security. But anyone who wants to make a change is free to do it through a pull request model. They would fork it like they would any other open source product - fork the repository, make a change, submit a pull request back.

But a couple of interesting things happen during the pull request. We do a Terraform validate to make sure we are syntactically correct. That solves a lot of problems. We generate a Terraform plan for the reviewers to look at, also for the developers. The reviewers can then see exactly what is going to change.

This last part is an open source product called Cloud Custodian. It is a rules engine that lets you run arbitrary rules against cloud resources.

A specific example might be if we are deploying a security group and want to make sure that we don't have a site or block of slash zero. Depending on the type of VPC or security group we're deploying, we will have these different rules and these all show up as GitHub status checks. This means when the reviewer goes to review it, they can see if it passed all the automated checks. They can have conversations in this asynchronous way in GitHub, letting developers continue to work in an ecosystem that they're familiar with.

It is all about keeping the developers in GitHub and building tools that support the review process. And then the approval is simply a GitHub merge. If the pull request gets accepted, they merge it in. If not, they reject it. So that is how the change gets into our master branch. And the master branch then represents all that is has been approved.

Keeping it simple

Once it is in here we still need to be able to deploy that change. That is where the next part of Nimbus comes in, which is the CLI or an API to do the deployment. This is where we tried to make it super simple for developers and our ops teams. We want to make it easy because you are taking some people from an ops perspective who have worked in a traditional operations world and writing scripts. It is a standalone Python application that you run - you just deploy the GitHub repo.

This all ties back into why each repository is our unit of deployment. You can deploy any change to the approved branch. This way if you need to roll back to a specific SAR you can do that. You don't need to go through another pull request if something bad happens. You get prompted and then magically this goes out and deploys.

As you can see, there are a lot of things that happen behind the scenes. This diagram shows some of that. The CLI calls our API, and the API first checks if you are a collaborator on the repository. This is how we manage our permissions model. If you want to deploy a repository, you need to be a collaborator on it. Our central group of approvers decide who can be collaborators. This makes sure that you can only deploy what you have access to. Once we validate that you are a collaborator, we will then elevate you to a privileged user on your behalf so you have the ability to do these deployments.

This means we can decide not to give every developer engineering access to do security groups in VPCs. It also ensures we still have an audited change. We use GitHub's Deployments API here, which then audits all of our changes in there. That triggers a web hook back to our CI server, which then goes ahead and makes the changes.

To illustrate how easy it was, I noticed about 10:00 this morning we had one of our Capital One developers making a change to one of our environments during one of the talks. It is super easy to do, very distributed - the approving team were somewhere else. Then we also publish the changes to Slack. You can see a lot of what we're doing here is about trying to make security a lot more transparent so anybody can see what we have deployed. You can't slip in back door changes, permissions, visibility, auditability all the things that go with making a security team happy.

Looking to the future

That is what we have done in terms of tooling. As our VPC footprint grows to many more hundreds, we start running into issues because we still have a lot of moving out of our data centers to do. How do we do canary deployments of changes? How do we deploy it to all of our dev VPCs at once or a slice of our QA ones? Or a slice of our production ones? A bad change to even a dev environment can knock out a few hundred if not more developers causing some serious productivity issues. For this reason, we want to figure out how to canary that a bit and do it in a more parallel way. Not something we've done quite yet.

From a developer perspective the Terraform plan is easy to use.

We've talked about how we've hooked into that for our security groups, but what we would love to do is hook into this for other resource types. So if you make a change, whether it is to a VPC or a subnet, it generates a nice human-readable document, whether it is a PDF, a spreadsheet, something that somebody can read without going into the code. And many teams like to read that.

All the archetypes that we've talked about are very much specific to AWS. But as I mentioned, there is nothing in this process that is AWS-specific. We are also getting more into the user interfaces around this product from a visibility perspective. When some of our team-leads or our executives want to see what people have done, it is an easy to see a view of who has deployed what and where, rather than running GitHub commands to see. That is the next step of this product.

Finally, to recap what we've done - we aim to get an application that was infrastructure as code. I think we've done that, very auditable, very secure. We're pretty happy with the progress and we're continuing to move forward.

More resources like this one

2/3/2023
Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

1/5/2023
Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

12/22/2022
Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

12/13/2022
PDF

A Field Guide to Zero Trust Security in the Public Sector

View all resources