Terraform workflow best practices at scale

What is the optimal Terraform workflow as you get more teams within your organization to adopt it?



Hi, my name is Armon, and today we’re going to talk about using Terraform and scaling the workflow up to tens, hundreds, even thousands of users.

I think what’s useful is first to look at Terraform at a small-scale usage. If we have an individual using Terraform, the workflow is very simple. We write some Terraform code locally, we do our write, then we run terraform plan to see what this is going to change and if it does what we expect it to do. And if so, great. We apply our change and then we go back into our loop of making changes.

And as a result of the Apply, a state file will get generated for Terraform to track all the resources it’s created. This is Terraforming at a very small scale, 1 or 2 people using it.

Scaling to thousands

How do we scale this up to dozens, hundreds, maybe thousands of people using Terraform? The workflow has to change pretty significantly to accommodate many more people. The first thing you’ll see is, much like you do with an application, you’re not going to have 1 super-app that represents the whole company; you’re going to break it down into many smaller services and applications that compose it.

We’ll do the same thing with Terraform. What we call this is workspaces, and we’ll hierarchically decompose them. We might have a workspace that defines our core network, and then 1 that defines our shared logging service, 1 for our monitoring service, maybe 1 that has our shared databases. Maybe we have a Kubernetes cluster that gets shared between our different applications. And so we’ll start to decompose this from things like core network to shared middleware. And then our application teams live at the edge. So maybe I have App 1 that makes use of shared logging and shared monitoring, and I have App 2 that, let’s say, uses our database, and it runs on top of Kubernetes.

Decomposing infrastructure into Terraform workspaces

And so we’ll start to hierarchically do the sort of decomposition of the infrastructure into these smaller chunks. We would map this usually in a 1-to-1, to smaller little bits of Terraform code. So here we might have a repo “Terraform-net,” and we use that to manage what the network looks like. In this sense we have a version control repository in GitHub—let’s just call it “Terraform-net”—and that’s going to manage one or more of these networks. And when I say “one or more,” you might assume an N:1 relationship.

Meaning there’s one set of code that defines the network, but I’m using that across development, staging, and production. So I might have 3 different networks that are being managed from the same code. Similarly, with all of these services, maybe I have a repo called “Terraform-logs” and that’s being used to define my shared logging service. And again, maybe I have one definition, but one for stage and one for prod. So there are the multiple N:1 relationships here.

The first-level thing we try and do is decompose it into these smaller, bite-sized chunks so that we can have, maybe, a few thousand lines of Terraform manage this, and a few thousands lines of Terraform manage our network, rather than having to have tens of thousands or hundreds of thousands of lines in one huge repository. So that becomes piece No. 1.

Align workspaces to org structure with RBAC

What this also, conveniently, lets us do is line up the management to different organizational teams. So we might say, “Only our networking team is allowed to modify the core network,” versus, “Only our core database administrators are allowed to modify the shared databases.” So this lets us align the organization structure and role-based access control (RBAC) around who should be able to modify the database service to these smaller chunks of management, instead of trying to constrain which files in 1 super-repo someone could manage.

Provide self-service Terraform modules

This becomes one key aspect: Decompose into smaller workspaces, and line that up with your organization’s structure in terms of different teams and role-based access control. The other problem becomes, How do we make it easy for new app teams to onboard? So App Team 3 wants to onboard, but these are application developers, not necessarily infrastructure engineers, or DevOps people. They may not be as comfortable with cloud infrastructure, or infrastructure as code, or Terraform. So the approach to this is this notion of a module registry. What we’ll do with our module registry is have a much smaller number of producers. These are the people who are going to be more operationally savvy. They might be our DevOps team, they might be people that are more familiar with cloud infrastructure. And what they’re going to do is publish to this registry a set of modules. They might say, “Here’s how we do a Java app, and here’s how we do a C# app, and here’s how a Mongo database or a Redis cluster gets deployed,” etc.

For each of these modules, the producer defines how that should be stood up in our environment. It might be that, within our organization, we have a special way we want to deploy Java applications, or a special way that we want to manage Redis. So we can package all of that up as a module, and then expose this to a much larger set of our consumers internally. The consumers don’t really need to know how this thing works. They treat it like a black box.

As our producer defines this module—let’s just say the Java module—we might only give 3 variable inputs. We might say, “Tell me the number of instances you want, tell me the region you want to be deployed in, and tell me the JAR name of your application.” And then inside of this might be whatever we want. It might be an autoscaling group with a set of VMs, and then we’re going to define a load balancer and a DNS record inside this black box. And then the only variable we provide out is the DNS name.

As a consumer of this, I don’t need to be an expert in, How should I design my autoscaling group and my load balancers and my DNS? All I do is say, “Here are my inputs. I want 5 instances running in US East. Here’s my JAR file.” And what I get is the DNS name to route my traffic to.

So I don’t have to be an expert in the underlying infrastructure. I just need to fill in the right variables for the things I care about. As a producer, I get to have control over, How’s this thing actually defined? What’s the sort of best practice of defining a job application? And we can version that and maintain different templates for different platforms and applications and things like that.

Now as a consumer I can come in and WYSIWYG or point-and-click my way through this and say, “Great, I’ve defined my new Java app. I’m going to run it on top of my Kubernetes cluster and consume the shared database.” So it starts to move toward this model where, at this layer, from here down, we don’t need to have as much expertise in the operations of the system.

Implement policy as code to review new infrastructure code

Now, the other challenge is, when we’re operating in this mode at very small scale, we have high trust. If there’s onee user, we really trust that user with those cloud credentials and what they’re doing, because there’s ultimately one user, and they’re defining the initial blueprint.

As we get to this scale with hundreds or thousands of users provisioning and managing infrastructure, our trust starts to diminish. People are less expert at it, there could be operational mistakes or security mistakes, or there could be outright malicious, so we want to have more controls in place.

The first-level solution to this was doing this initial decomposition to these smaller units, and then having role-based access control—saying, “Only the networking team can modify the network; everyone else can just read,” and, “Only the database team can modify the database; everyone else can just read.”

That gives us 1 level of segmenting access and minimizing risk to the whole infrastructure. But the other part of this becomes, As app teams can define arbitrary templates, how do we do this safely? What you often end up seeing is an organization create a review funnel, where a developer is allowed to write some infrastructure as code. But then they submit it for a central review.

In the central review, we have some team that’s doing pull request reviews, or it’s looking at all the code and saying, “Great, we have a Word doc or a wiki that says, ‘Are you allowed to do this change?’” And they’ll say yes or no. And this process might take days or weeks to do the review before you get feedback that says, “Oops, you opened the S3 bucket; you’re not allowed to do this.” Our approach to solving this is really almost what Terraform did, which was take that process that lives in someone’s head, or lives in a wiki, and treat it as code so now you can automate it. Do the same thing for policy. Instead of a Word doc, turn this into policy as code. And we use a framework called Sentinel to do this. And now what we can do is split the definition of the policy from the enforcement. Because once we have defined it as code, we can install it and automatically enforce this policy.

So the moment the user submits their change, it gets reviewed by the policy engine to either say yes or no. And it can come back and say, “Nope, sorry, you opened your S3 bucket to the public, to the whole world, you can’t do that; try again.” And as a user, I can make my change, submit it, and great, now I’m within compliance, I can flow through and make my change automatically, without going through a manual review process.

Review: The steps to running Terraform at scale

These are some of the pieces that are required as we try and scale the Terraform workflow. Part of it is, How do we decompose into smaller chunks? Part of it is, How do we enable users who are less expert? And part of this is also, We get these patterns that we can reuse. Even if they’re experts, we don’t want every group to reinvent the wheel in, “How do we deploy Redis?” or “How do we deploy Java?” We can have a best-practice pattern and stamp that out and have both consistency and reusability.

And then, ultimately, how do we govern some of the risks associated with this? You can either do it through a manual review, or we can start to automate some of that review through policy as code, and we get a much higher level of assurance that every line of code is being checked before it goes out the door.

These become some of the key workflow steps, such that as a development group in this late stage, I can come in and quickly self-service. Either write my own Terraform as I need to, pull it out of the registry if there’s an applicable pattern, and I don’t need to go through manual central review teams to get my change out into production. I can just write it. If I pass my automated test, I could self-service and go make these changes.

I hope you liked this video on scaling Terraform. If you’re interested in learning more about Terraform the product, as well as how to use it at scale, I’d recommend going to hashicorp.com and going to the Terraform product page as well as checking out our resources online around Terraform best practices and tips and tricks. All of those are on hashicorp.com.

More resources like this one

  • 3/15/2023
  • Presentation

Advanced Terraform techniques

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 2/1/2023
  • Case Study

Should My Team Really Need to Know Terraform?

  • 1/20/2023
  • Case Study

Packaging security in Terraform modules