FAQ

HashiCorp Terraform Adoption: A Typical Journey

As individuals start to use HashiCorp Terraform at an organization, how do they manage the tool as more people, more teams, and eventually the whole IT department want to use it? Armon Dadgar shares the four stages of growth that Terraform use cases typically go through.

In this video, HashiCorp co-founder and CTO Armon Dadgar gives a brief, whiteboard overview of a typical Terraform adoption journey.

  • A HashiCorp Terraform deployment usually starts with an individual practitioner who writes a Terraform configuration file ("infrastructure as code"), iterates to make the plan correct, then applies the plan. As needs change, they modify the the configuration file and repeat the plan-and-apply process.

  • If there's a team collaborating to use Terraform, rather than just one individual, the process is basically the same, but they should use some sort of version-control to provide a single source of truth.

  • If there are several teams, each responsible for a different part of the infrastructure, Terraform's config files can be decomposed into separate Workspaces, each of which can have role-based access control.

  • In some organizations, there are many users, most of whom are not trained on Terraform, and it wouldn't be practical to train them all. One common pattern is to have a few publishers and a large number of consumers all working against a central registry. Organizations can also use HashiCorp Sentinel to define and maintain a sandbox, which polices what consumers can and cannot do ("policy as code").

Speaker

Transcript

Hi. My name is Armon Dadgar, and today I wanted to spend some time talking about the stages of Terraform adoption. Oftentimes we're asked by new users of Terraform, "How do I take it from an individual using it to, a team, to many teams, to an organization?" And so I thought it would be helpful to share some of the patterns we see—both among our large-scale users as well as our customers—of how they go through the journey and the multiple stages of Terraform adoption.

When we talk about Terraform, where it usually starts is with an individual practitioner. So we have an individual contributor and they have a very tight loop where what they do is they write infrastructure as code in the form of Terraform configuration. This is a Terraform config, and then the workflow is very prescriptive with Terraform. You write your configuration and then you run a Terraform plan locally that shows you:

  • What Terraform is expecting to do,
  • How does it actually need to modify your infrastructure—in terms of creating things, modifying, destroying infrastructure, and
  • It gives you an opportunity to validate the changes Terraform makes

Then once you're happy, you've inspected and validated it, you actually make those changes. You apply them and make the change to your infrastructure—create, modify, destroy various infrastructure—and then much like an application, your definition of your infrastructure is a living, breathing thing.

Today you deploy some set of infrastructure, tomorrow you decide to extend it or scale it up or scale it down. And so you go back to the beginning and modify its configuration again. So we go back to the origin, change our configuration and again flow through our plan-and-apply cycle.

This is what it looks like as an individual. Now what happens as we start to go to a team?

The moment we go to a team, we introduce a set of collaboration challenges. Specifically, we're now modifying a single definition of what this infrastructure should look like, but we have multiple people doing it. So how do we actually ensure that we have a consistent definition of our infrastructure, and that we're making these changes safely—that we're not stepping on each other's toes? And so the flow changes a little bit.

It still starts with practitioners who are locally writing or modifying this configuration. They might even be running a plan locally to determine are these changes safe? But now, instead of just applying it ourselves, we add an additional step: We now modify and commit to a version control system. This could be GitHub or Bitbucket or GitLab or our favorite version control system.

The key is: What the version control system is doing is giving us a single source of truth. Although we may have multiple people editing this configuration at the same time, there can only be a single copy of the master—what we're using to drive our configuration itself. Now that we have this central source of truth, we can use that and trigger off of that to apply Terraform off of that source of truth.

This way, we get the essential definition of what our configuration is. There's only one change being made at a time—there are not multiple changes in parallel that might step on each other.

Once we've applied our change we come back and restart our loop. The things that we need to do at this phase is make sure we use a version control system, to have consistency over configuration and then drive the application of any Terraform changes off of that—for consistency.

As we go from a single team to multiple teams, what changes is now we have a decomposition we need to do. Over here what we really had was a single set of Terraform configurations that defined all of our infrastructure.

But as we go to many teams this starts to become impractical. There's too much coordination required and the configuration becomes overly complex.

Instead, what we'd like to do is hierarchically decompose our infrastructure. We might say we have one team that focuses on the underlying network typology and cloud configuration.

Then we might have a series of middleware—a middleware tier. This might look at different things:

  • We might have a central solution for logging,
  • We might have a central monitoring solution,
  • Maybe we have security appliances that we share between applications.

This is the underpinning shared infrastructure that all of our applications are going to consume. You don't want every application team to reinvent how logging is done.

Then at the edge, you have your actual application teams. We have app1 and app2 as an example. These applications are not defining any of these other components: They're not defining the networks, they're not defining our monitoring, they're simply consumers of it.

As an application I might consume some subset of this middleware. What we've done is, on the whole, we're composing these pieces and building our application and our infrastructure. Our overall infrastructure together is all of these pieces, but we're managing it in these smaller units. Terraform calls each of these a workspace. So what we want to do is decompose this larger infrastructure into a series of workspaces and then compose them together into a larger infrastructure.

Now as we get to multiple teams, we might not want to be in a situation where the application team could come in and change the definition of our network topology and just deploy a change. On top of just separating it into independent workspaces, we want to tie this back to role-based access control. So, say, the networking team are allowed to actually modify how the networking typology works, and maybe our logging team is allowed to modify what our logging middleware looks like, but these other teams, they're all consumers of it.

My application has to be deployed into a network, so I need to know what the network looks like: I need to know what's my Amazon VPC, what's my subnet. I'm allowed to give read access to these workspaces to other parts of my organization, but I want to restrict write access or the ability to make changes to the groups that should be owning and managing these individual pieces.

Then when we get to the independent teams—as an application team I can consume all of these pieces without having to talk to them. I can just consume the network, consume logging, consume monitoring, build and deploy and manage my application.

Again I might say, "My team maintains both the ability to read and write," but I have a downstream application, app3, that might want to consume me and again build out the infrastructure this way.

As we go to multiple teams it's really about how do we allow the teams to work independently of each other, but doing that safely and without exposing ourselves to everyone being able to make any sort of change.

What happens when we start to go even bigger—if we go from multiple teams using it to an organizational-level deployment of it? Here at an org level, there's a different set of governance challenges. Along with that there's also the challenge of how we let more people be productive. At this phase, we might still have most people consuming this familiar with Terraform, whereas if we go out to a full organization, it's less likely that the whole organization is Terraform enabled, or that you want to make that investment to train everyone.

There are two answers here.

The first answer is this common pattern around publishers and consumers. So what we'll start with is a limited number of publishers. Now what the publishers do is push into a central registry modules that basically describe how to deploy different types of infrastructure. We might have a module that says:

  • Here's how we deploy a Java app,
  • Here's how we deploy a C# app,
  • Here's how we deploy a database.

These publishers are pushing into this registry a definition of how we actually manage this stuff, where this module's really packaging up an opinionated set of Terraform configuration. Now a much larger set of consumers can pull this app. And these consumers don't have to be intimately aware of Terraform or what our pattern is. They come in and say, "I have a Java application. Here's my .jar file. I want three of them and deploy this to Amazon."

We might give them a few knobs that people are allowed to tweak, but otherwise abstract all of the other complexity related to it. This lets us scale up to a much larger number of consumers without really having to train and enable them how to write infrastructure as code, or become cloud experts.

The other challenge becomes how do we allow this many consumers—this many people interacting—to do so safely.

We don't want people to open up the firewall and allow all traffic to come in or set their S3 bucket to the universe is allowed to read from and expose our data. What we'd like to be able to do is define a sandbox and say, "You're allowed to do any type of infrastructure change you want inside of a sandbox."

And as long as you're within the sandbox maybe you're using our pre-approved Java module, or you know how to use Terraform and you've written your own custom thing. And as long as you're within the sandbox you're allowed to do so. You don't have to file a ticket, or have a security team check and make sure this infrastructure is valid—you can make your change, submit it and go on with your day.

But what we'd like to have happen is, if you try and deploy something that steps outside of the sandbox—deploying to the wrong region or bypassing security controls—we'd like the system to prevent this.

Our project to focus on this is something we call Sentinel. The idea behind Sentinel is, how do we capture policy as code?

When we talk about policy, this could be things like:

  • Staging always deploys to the east region,
  • Production always deploys to the west region,
  • Our firewall rules must never allow traffic from the entire internet.

We can capture that as a different set of Sentinel policies—which themselves are code—to basically describe what our policy looks like.

This policy is now defining the sandbox. Things that are not in violation of the policy are on the inside, vs. things that hit that sandbox limit—that violate the policy—get rejected.

In a system like Terraform Enterprise, we would insert this Sentinel policy, and the system automatically enforces it on any change that comes through.

Taking a step back, what we really see is this gradual adoption curve:

  1. It starts with an individual who has a tight feedback loop and doesn't have a collaboration problem. They're able to do their own plan-and-apply cycle locally, without worrying about coordinating.

  2. The moment we start bringing in multiple people on board, we need to make sure that there's a single source of truth, both for Terraform's configuration file itself as well as when we make changes that there's only a sequential application of changes, that we don't have people stepping on each other. This requires some slight changes in terms of using version control and tying it back into a system that's applying changes sequentially and managing state for us.

  3. Then as we go even bigger, it's really about decomposition from a single monolithic configuration into many smaller configurations that we compose together—and then tying that to role-based access control, to do it safely.

  4. The final piece at an org level is making it easy for many different consumers, by introducing the notion of a registry, of pre-approved modules—a service catalog—as well as governing what's actually acceptable and restricting it through policy. Because if we don't do it through policy, oftentimes we create a ticketing queue where all the changes are reviewed manually and we lose the efficiency that we've gained in the agile self-service infrastructure.

Hopefully this was helpful in understanding the adoption journey as you use Terraform and go through these stages. You'll find more resources online—in terms of using Terraform and getting deeper with it.

If you have some of these challenges of collaboration and governance I'd encourage you to check out Terraform Enterprise. Thank you so much.

More resources like this one

  • 3/15/2023
  • Presentation

Advanced Terraform techniques

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 2/1/2023
  • Case Study

Should My Team Really Need to Know Terraform?

  • 1/20/2023
  • Case Study

Packaging security in Terraform modules