Terraform open source is great for individuals or small teams, but once you get larger teams, or many teams, things get a lot harder.
If you're looking for a guide on how to transition from Terraform OSS to Terraform Enterprise, watch this demo.
The question we’re often asked is, “If I’m using Terraform, what’s the value of using Terraform Enterprise? Why should I consider it?” And I think what’s helpful is really looking at: What is the adoption journey as we go from an individual to many people using Terraform?
When we talk about an individual using Terraform, their pattern is: I locally write some Terraform, then I do a plan operation, where you can see: What is this going to change? Validate that my change makes sense. Then I do a local apply to make those changes. Then I continue on this loop. Much like writing software; it’s that iterative process of write, test, apply.
Now, as I go from one individual to multiple people, a team of people now trying to use Terraform, I have a few new challenges. I still locally write Terraform. I run Plan to validate my change, but now what I don’t want to have happen is multiple people apply at the same time on top of each other. If multiple people are running at the same time, we got a divergence. They’ll step on each other. It can cause corruption of the state file. We need to make sure there’s only a linear application of Terraform, one at a time.
The common approach to solving that is: Then I push my config into a version-control system—Git, Bitbucket, something else. What that buys me is that now there’s one source of truth for what this configuration is. It doesn’t diverge and Alice has one version and Bob has a second and Charlie has a third. They have a consistent view of “What is the current definition of infrastructure?”
Once I have that, I can use it to drive application of Terraform one at a time, so they don’t step on each other, and then locally manage the state files.
This quadrant is where we’re introducing challenges around “How do we collaborate?” As a team of people, multiple users of Terraform, what we need to do is each manage parts of the infrastructure but without stepping on each other’s toes, because if we do step on each other’s toes, we’re going to cause issues for ourselves. This is the core of the collaboration challenges: How do we have a consistent definition? How do we apply one at a time and not step on each other, and how do we make sure we’re managing the state file in a consistent way without leading to that being corrupted?
This is the first point at which Terraform Enterprise really comes in, and it’s looking at the challenge of: How do I make a team of people productive without forcing them to first come up with a workflow and solve how they do collaboration?
What happens is I go from a small team, two to eight people, to now I have teams of teams. It’s the next scale of usage for Terraform. Now I have multiple teams that are all trying to write Terraform.
Just like we would do for an application, we don’t want to have one mega Git repo, with every application loaded into one huge repository. We’ll create one repository per application or per project or per service. We’ll do a similar thing with Terraform, which is we’ll start to decompose the infrastructure into multiple pieces. If I have a core networking team, another shared middleware, logs, monitoring, database. And then our end application teams, App 1 and App 2, these are consumers of these; they consume logs, consume database, consume monitoring.
What we’ve done is decompose the infrastructure so that many different teams can all work in parallel, just like we would do for an application—have many applications so that they’re all being developed in parallel.
The challenge is: How do we do this safely? How do we manage the risk? Particularly, what I’d like to be able to do, say, my networking team, they’re the ones that are allowed to change the network stuff initially. Everyone else, they’re allowed to view it, consume it, interact with the network, but they can’t redefine it. Similarly, my database team, they should be able to manage the database, consume the network, and expose it to the app team, but as an app team, I shouldn’t be able to modify the database.
What we want to be able to do is have this notion of role-based access control. This is classic RBAC. We’re creating different teams, tying them to what Terraform calls a workspace. Then we’re composing multiple workspaces together to build a larger application.
What Terraform Enterprise gives us is this ability to define different teams, define multiple workspaces, have the permissioning of which teams are allowed to do what, tie all of that back to our single sign-on experience, so we don’t have to re-create users just for Terraform Enterprise. This starts to let us say, “How do we get multiple teams productive in this environment, but do it in a way that we’re managing our risks?” We’re not letting everyone modify everything in that environment.
As we go even larger, we start to say, “We want end application teams consuming it, or we want the whole organization consuming it.” Then we end up with a different set of challenges, which is: Not everyone is a Terraform expert.
How we deal with this is what we call using a registry and a producer-consumer model. I have a small set of producers. These tend to be more Terraform experts, and what they’re going to do is publish a set of modules: a Java module, a C# module, a database module, so on and so forth. And these are going to get consumed by our many different consumers.
In this example, what we might have is: The producers are classic IT operations DevOps folks. They’re familiar with the cloud and how all of our infrastructure works. Whereas consumers might be more of our application teams. They’re less familiar with the nuances, with how the infrastructure works. They just want to come in and say, “It’s a Java app. I don’t really care how it’s running.”
What this lets them do is come in and point and click and say, “I need a Java app, and here are the three variables I have.” I specify what’s my JAR, how many do I want, and what region to deploy it to. Then the rest of it is a black box for these consumers. They just point and click, but under the hood what’s happening is we’re templating Terraform, that is then going to execute to bring up and deploy their Java-based application.
What this lets us do is expose a lot of consumers in a way we don’t have to train all of them, in a way that they don’t have to be experts in how it works. They can just point and click and get to what they’re trying to achieve. This really helps us gain agility for a broader section of the organization who are not experts in cloud or infrastructure.
The final challenge is, How do we do all of this safely? You know, I trust the first user, I trust the second user, I trust the first 50 users a little bit less, and I certainly don’t trust the next 500 users. What we often see is we create this ticketing pipeline. As an end user, you can write Terraform, but then you’re going to submit it to a review queue. What we’re basically going to do is bottleneck all of the changes in our organization through a single ticketing queue where we’re manually reviewing all of this.
Traditionally, this is a central group of teams that are looking at a Word doc of a policy, let’s just say, of: Did you set your S3 bucket to public? Did you open the firewall to the whole internet? Did you ask for a thousand instances? All that good stuff. Then if you’re allowed, great, you can go through. Otherwise, try again. This often ends up breaking a lot of the agility we gained as we went through this, because now we have this manual review process.
Part of our goal is: How do we automate this with policy as code? With policy as code we capture the same set of policy checks but in an automated, codified way, and within Terraform Enterprise we can install these rules so this becomes an automated checkpoint. As an application team or a middleware team, I can submit my changes, and as long as I’m within policy, the system automatically approves it and I’m allowed to go through. If I’m outside of policy, it’ll reject it, and I’ll either get manual review or I need to change my policy. I need to change my configuration, try again, and I’m allowed to go through.
This is really looking at: As we scale up Terraform, how do we do it in a way that we’re controlling risk? We don’t let people do whatever they want and get the keys to the kingdom—they can ask for a thousand VMs. But at the same time, how do we do it in a way that’s still productive? We’re not putting someone in a queue and making them wait four weeks. We’re doing it in an automated way to maintain the agility and self-service of Terraform.
As we talk about Terraform Enterprise, that’s really the goal. How do we do the collaboration problem at a smaller scale? How do we start to think about role-based access control and decomposition of the problem at a larger scale? How do we onboard users who don’t know things about cloud infrastructure Terraform, but make them productive? How do we do it in a way that we’re managing risk without putting a ticketing queue in front of everything?
That’s really the value of Terraform Enterprise, as we grow our usage. It’s enabling us to solve these problems so that we still get those benefits of Terraform as IaC (infrastructure as code), but we can do it at an organizational scale.