This talk will dive into Comcast's CI/CD patterns and tools for creating a consistent experience.
Comcast uses Terraform to consistently deploy environments across pipelines, local development machines and cloud. This talk will dive into the patterns and tools used to create a consistent experience using build containers, secrets and environment management patterns, bootstrapping, and modules.
Check out Comcast's other use case presentations on HashiCorp products:
Lew Goettner: Hello and welcome to Terraform Consistent Development and Deployment. Before we get started, I'd like to take a moment to introduce ourselves. I know our titles and departments don't mean much outside of Comcast, but I'm Lew Goettner. I've been using computers since second grade when my dad brought home a TRS-80, and — fun fact — I once wrote a blackjack game on an HP 48G calculator in high school. And my co-presenter:
Peter Shrom: Hi, I'm Pete Shrom, and my first computer was a Commodore 64 that I got for Christmas. A fun fact about me is I'm a Kansas City Barbecue Society-certified barbecue judge. If you want to talk about brisket standards, let me know.
Lew Goettner: We are a tools and automation group within Comcast, which means we build and support internal services that product teams leverage. A lot of these revolve around standard DevOps tooling, like logging, metrics, and alerting. In addition to that, we run experimentation service and service discovery.
In addition to running new services, we also provide automation patterns to our larger org. We do this mostly through documentation as well as reusable modules, but more importantly than that, we are avid Terraform users. Our infrastructure and software projects all use Terraform to configure and deploy, and that's what this talk is about.
Over years of running Terraform as an individual, as a team, and across teams, we've developed a number of patterns we feel makes it easier to do so consistently. We hope to share those with you, but we need a goal before we can start sharing.
Peter Shrom: We want to run Terraform consistently and accurately. We want to run Terraform across multiple environments, and we want to be able to run Terraform regardless of our user type or location.
We always want to run the same version of Terraform. We also want to safely store our remote state so that everyone has access to it. We want to prevent resource conflicts and divergence; we don't want people making changes that are overriding each other. We want to have that in one state. We want to be able to have everyone reference the same variables and secrets — no matter where they're running from.
Dev, stage, prod. We want to have one Terraform project that we can use everywhere — all of our environments. We can also do one-off environments for load and stress testing, we can do an integration environment. We want to use the same code everywhere.
You might have different user types, different access levels. Your regular developers have a certain access level; your other employees have elevated permissions — they might be able to do network-related things that not everyone can. You also have your service users that you use for your CI/CD.
Finally, locations. We want to be able to run it from the employee laptops, but also in our CI/CD systems. Why do we want to do this? The main thing is to avoid context switching. We want to make sure we're packaging everything we need to run it locally. We also want to make sure can have everything we need to run it in CI/CD. Finally, we want to make sure we can do emergency or non-routine actions seamlessly.
Lew Goettner: The first pattern we want to talk about is the build container — the container we create to run all our Terraform in. What problems does this address? One is that Terraform is very strict about is version. Once you run a new version, you always have to run that version or greater than or equal to the version of that.
This can make coordination very challenging across multiple projects if developers each have their own version of Terraform. No one wants to get to a deploy and find out that they can't do it because someone else has touched the remote state with a newer version of Terraform. In addition to that, we want to speed up developer ramp up time. Terraform is not the only tool we use, and they are usually additional tools required to work in our environments — so we want to bundle all of those together.
Finally, CI/CD. Most modern CI/CD systems require containers with tooling built into them for you to run your jobs in.
First and foremost, we put a specific version of Terraform. In addition to that, we install a shell with some standard debug tooling, AWS CLI tools, or whatever other cloud environment we may need. We put our secrets management tools in there, which include Vault and buildenv — which we'll get to later. Then finally, some internal tooling — for instance, federated authentication tools — and if you need any additional tools for your CI/CD system, like Fly or something like that. Why do we do this? It gives us two places to change settings when we need to.
First is when you're running from your laptop. To support that, every one of our projects has a docker-compose file in it. Inside of that compose file, we reference a specific build container version, environment variables — for maybe Vault or AWS. We map in folders required to do local development, and we set up host networking. At a very high level, your docker-compose files look something like this. On top of that — depending on your CI/CD system — we add configuration there as well.
If you're using Jenkins, which is one of the systems we use, we tend to set an environment variable containing a specific version of the build container at the top of the build. We can then reference that variable in all of our stages — wherever we need to set up a build agent.
If you're not a Jenkins user — say you're on Concourse — which is another of the systems we use a Comcast, you can do this in your task definition. You can have an image resource that looks very similar. You specify your build container and then a specific version of it.
The result of all of this is that when you have to upgrade the build container or Terraform version — things like that — you only have two places to make the change. One in the docker-compose file, and then set the matching version in the config for whatever CI/CD system you're using.
On to bootstrapping with Pete.
Peter Shrom: Thanks, Lew. One of the patterns that we use in our projects is bootstrapping. This is the process of getting together everything you need to run Terraform. One of the best practices with Terraform is using remote state and locking. We need one state to reference and modify.
For us, we're using an S3 backend. We also want to prevent conflicts from modifications happening — so we're using state locking; we have a DynamoDB backend. We add some resiliency to all of this by turning on S3 versioning and using multi-region replication.
Finally, we make a global DynamoDB lock table. We need all of these things to run Terraform. How do we manage these resources that we need to run Terraform? It'd be nice to manage them in code. We accomplished this with a bootstrap project we used to set up our remote state and locking. This is just for the state and locking components. We keep it with the main project source code.
Where do we keep the state from these Terraform runs? We commit them directly to the Git repository. We run the bootstrap project once per environment, and there are minimal updates — we don't have to run it over and over again. Also, there are no secrets in this project, so it's fine to commit it to the repo.
The next pattern that we use is the makefile. We like to store all of the logic for our deployments in our makefiles. The first problem this helps to solve is how to manage all the environments and variables. We've got our Terraform project, our remote state and lock tables, and variables — one for each environment. How do we manage that easily?
This is what a Terraform init command and config looks like with our remote state and locking. There's a lot going on there, and it's not easy to remember.
Other problems this solves is you do not want to overwrite your state with the wrong one. Moving from environment to environment, you can run into issues where that might happen. You also need separate steps for planning and applying.
Finally, we need to make sure we're accessing the right variables, secrets, and credentials when we're running our projects. We manage all of these things with our makefiles. We put all of the logic in the makefiles to initialize Terraform, clean up after it runs, plan and apply.
This way, we're doing everything consistently. This is what it looks like when you run locally and what it looks like running in CI/CD. All you've got to remember is the operation you want, and to pass in the environment,
make plan env=dev,
make plan env=dev cicd.
This is what it looks like in our makefile. First, we're going to check the environment. If no environment is defined, we're not even going to do anything — you don't want to operate in the wrong environment. Another make target we have is the clean.
We want to remove everything from the previous runs from state; we also want to delete all plans before running. Maybe there are some extra reports or something from your testing — we want to remove those too.
Finally, we want to do our
terraform init. First, we're going to run our clean target — we're going to check our environment. We're going to make sure you have the Vault secret that you need to access things — then we're going to use a Terraform init, configure everything you need.
Lew Goettner: You may have seen in that make command we're calling a tool called buildenv. Over the time of running Terraform in many projects, we found sometimes it feels like half your job is to get your environment variables correct — and that can be cumbersome.
Your secrets are hopefully in Vault, and you end up with layers and layers of configuration. Usually breaking down roughly into global configuration, has environment configuration, and datacenter configuration. After frustrating approaches to this with shell scripts — or even in the makefile — we decided to author a simple tool to manage this. It's written in Go; it's a CLI tool, it's available in binary for Linux, Mac, and Windows. It's open source, so you can find it on the Comcast space on GitHub.
We hope you'll go and download that if you think this would be useful for you.
In the most basic scenario — setting up environment variables — you define a YAML file, and within there, you have your layers of configuration. In the one in front of us, you can see that we have configuration at a global level, including some we're passing the Terraform via the
tf_var_ pattern. In addition to that, we have production variables and then even datacenter one variable within production. You can continue that on for all of your environments and datacenters.
In addition, buildenv has support to read secrets directly out of Vault. It follows the same pattern where you define global and then environment — and even datacenter-specific files. What we tend to do as a team is that all of our projects have one variable's YAML file containing the full set of environment variables required to run the project. When you run it, it's actually quite simple. You run buildenv, you select your environment, and optionally, you select the datacenter. It spits out the exports to set all those environment variables for you.
In addition to the one file we always have in there setting up the project, we also have been including a separate variables file just for CI/CD use. We do this mostly to set up access. In the example here, you can see we have AWS access keys and secrets we're pulling out per environment. We trigger that in the makefile with a simple flag that indicates you're running inside CI/CD. So, the full chain of events is: Set up authentication if necessary, set up all the environment variables for the project, and then run your command — whatever it may be.
You have one secret required to do your build. If you're running locally, you get your Vault token however you are set up to do so. If you're running in CI/CD, that's the one secret you need to store — and from that, you can run your project.
Peter Shrom: What does that mean? We use multiple layers for our Terraform projects. We have our AWS accounts; then we have a project for our base infrastructure. On top of that, we have multiple application projects that we can run in there. Each of these applications can live in its own codebase. These applications don't need to know about each other, and they can be deployed independently. This gives us a lot of flexibility in how we develop things.
For our base project, we're going to set up our VPCs, our subnets and routing, our global security groups, and global roles. All of these things are going to be output into the Terraform state. We'll put out the subnet IDs, availability zones, and security group IDs so we can consume these in our other projects.
Our Terraform software projects can be deployed independently of this base infrastructure; they can read the exposed outputs when necessary, and tolerate minor version mismatches. Our base infrastructure isn't going to change very often — that might be on an older version of Terraform. Our application that's in active development can be newer and can be kept updated — it doesn't matter. It can inherit that state and use those resources that are referenced.
Terraform isn't the only tool out there. Sometimes another tool is better for the job. In addition to adding outputs into the state that we're inheriting, we're also writing things to the SSM Parameter Store.
In this example, we're writing out a list of subnets to the SSM parameter store. We're also outputting them into the remote state. Why do we do this? It makes it very easy to look things up. You don't need to initialize a Terraform project and pull the remote state. You can just authenticate to AWS and look it up from the CLI. We use this a lot with cloud formation and SAM to deploy Lambdas.
In this example, we're doing a SAM deploy — and we're passing in the path for the VPC ID and the subnet list. This is the cloud formation that defines the parameters as SSM Parameter Store types — and this is where we're referencing those in our cloud formation.
Lew Goettner: This is the next pattern we want to talk about. You don't just have one environment. How do we keep things straight across all of them? This is infrastructure as code, so ideally, we're going to keep using that same code with different variables to configure our environments consistently.
You've already been through how we use the makefile and buildenv to handle secrets and set up remote states. What's left after you do that? What's left is Terraform variables, which tend to be grouped into two categories; global variables — or your defaults — then environment-specific variables.
Over time, we've settled on two patterns for this because they work well in different situations. The first is to use Terraform variable files, the
.tfvars files, where we keep one per environment. The second is a single
variables.tf file, which defines all the variables and values for the entire project and environments. Let's go through those.
Our variables and defaults are defined with the resources — usually right above the line where it's first referenced. Then we have separate variable files for each environment. The pros to this approach; it's easy to set defaults and override only where necessary. We get a distinct separation of variables across environments, and it's easy to add a new environment. You can either start a new empty variables file and add your overrides — or you could copy one of the existing environments and adjust as necessary.
The cons to this approach are that sometimes it's difficult to compare environments. You have to look at as many files as you have environments to see what the differences are.
Also, checking and changing defaults is a little more tedious because the defaults don't live in one place. You have to go run down the file, change the default, and then check all of your files to see if they should still be overriding that value when you're finished.
What does this look like? In this case, we have a simple variable defined for our VPC ID, and then immediately below that, we reference it. To set those variables per environment, we'll have three files. We'll have
dev.tf files, staging, and production — each of which will have its own set of overrides for that variable.
Onto the second approach. The second approach which we use sometimes is one variables.tf file. This is one file which contains all of the project variables. In this case, we use maps whenever the variables change per environment, and we use singular values for the defaults.
The pros for this is that it's very easy to scan for environmental differences because they're all defined right next to each other. Additionally, the default values are very clear. They're the only singular values in there — all of your environment-specific variables are referenced by map.
The cons are that variables aren't defined with their elements, so that involves a little bit of context switching. Finally, changing a default variable requires changes in multiple places. You have to switch the variable to a map; then you have to find where you reference it and reference it by map as well.
So what does this look like? In this case, we have a simple example of setting access log retentions. You can very easily see that dev, stage, and prod all have different values in what they are. When you reference that variable — or any of the variables you set this way — you have to make sure you always reference them as a map using the environment variable as the key.
Having run through some of those patterns, there are a set of common questions we tend to get about them and some examples we'd like to give of scenarios where these patterns come in handy — we'll have Pete start that off.
Peter Shrom: So one of the first questions we get is, why do we want to run locally? We should have everything automated; it should be in CI/CD. Why do you want to run it locally? There are a few reasons for that.
First, it speeds onboarding for new developers. They can easily get access to everything they need from the repo — they have everything in the container. This allows them to run the code locally, add their changes, and plan. They can keep planning — they can check their code that way.
It also allows for work in progress development in lower environments. Maybe you're working on a huge feature, and you want to plan and deploy — and plan and deploy from your lower environment from your laptop. You can do that.
It also allows us to target individual resources that aren't dependent on each other. If Lew and I want to work on something at the same time, but there are completely different resources, we can do it on our laptops. We can deploy or plan and apply them separately by just targeting those resources. We're not constantly creating and destroying each other's work. That allows us to do parallel development in the same environment.
Lew Goettner: Another question we get a lot is, why bake all of your deployment logic into your makefile and your tools in this container? Why not use CI/CD-specific resources versus the CI/CD agnostic approach we have?
The biggest reason is that it makes switching CI/CD systems very simple. In about the past five years, I think we've used four different CI/CD systems — so maybe we'd been burned by that in the past. When you take this approach, the requirements for running your job are very simple. You need to be able to run in containers of your choosing, a single secret so you can access all of your secrets — and a place to store your plan between your plan and apply stages.
Luckily, pretty much every modern CI/CD system has these bases covered very easily. This isn't saying we should never use CI/CD resources. Outside of core deployment, we do use CI/CD resources for things like messaging, Slack, and email. And, in addition, to do PR verification in GitHub — and people do things like that, so it’s not all or nothing.
Peter Shrom: Let's say your CI/CD is down, and you need to make a deploy. Maybe it's down for a few days — this happened to us. We can be comfortable deploying in that situation from our laptops — we have everything we need.
There's also the issue of one-off or infrequent events. Let's say a process dies, and your state is locked. You have everything you need on your laptop to initialize your Terraform and unlock the remote state.
You can also use this to import existing resources. Something is already running, and you want to import it. You can initialize Terraform, import that resource, and then define the code that sets it up. You can also inspect remote state from local. You want to see what's in your state file, check a few things — it’s easy to do with this setup.
Finally, you can do occasional role elevation. For example, you have an elevated user that can only do certain things in your accounts. You can run that from your laptop with elevated access. Once that's done, you can hand it over to your normal user, and they can run it in your CI/CD.
Lew Goettner: We want to take this moment to thank everyone for listening. We hope that these patterns either give you some validation for patterns you've developed over the years of working with Terraform. Or they've given you some ideas for how you can improve your processes, working across employees, teams, and environments.
Peter Shrom: Thanks for listening, and I hope you guys can pick up some patterns that you can reuse in your Terraform.
Lew Goettner: And thank you very much.