Terraform has gained a popular foothold among the Global 2000 for managing cloud infrastructure. But why is it so popular? And how does it scale from a few users, to a large-scale enterprise?
In this talk, Clint Shryock, the lead Engineer for Terraform enablement at HashiCorp, introduces Terraform and explores the four common stages organizations normally go through when scaling their Terraform usage: from a few grassroots users to a company or enterprise-wide adoption.
In the process, you'll see why Terraform is so great for reducing operational complexity in distributed infrastructure environments.
Hi, my name is Clint. Says so right there. I'm going to talk to you today about scaling with Terraform, and the idea there is the journey from a startup to a large enterprise—not exclusively that order—but it's really all about how you adopt Terraform, how you bring it into your organization, whether you have infrastructure or not.
So, I got four parts:
So, well, introduction was done. That was easy. The other ones will be a bit longer.
So what is Terraform? Are we using Terraform, everyone? Everyone know what Terraform is? People are still eating lunch. Okay.
So to understand what Terraform is, we need to understand HashiCorp's mission. We want cloud infrastructure automation, consistent workflows to provision, secure, connect, and run any infrastructure for any application. Says so right there.
To do that, we develop a suite of tools:
And they're roughly divided into kind of two groups, loosely categorized as provisioning tools on the left and runtime tools on the right. So Vagrant, Packer, Terraform are in the provisioning land. That's where Terraform lives.
Terraform's mission statement is to write, plan, and create infrastructure as code. With Terraform, we write declarative configuration with a configuration language called HCL—stands for HashiCorp Configuration Language—and we use that configuration language to generate plans to apply and modify infrastructure, and then we can safely apply those plans.
To do that, HashiCorp engineers and a very, very generous open-source community have created numerous providers. This is a very small subset of them, but we have infrastructure providers, software as a service providers, platform providers, bare metal providers, all sorts of providers. We have over 70 official ones that we shepherd and are automatically usable by Terraform.
Many of them are maintained by contributors. There's a couple maintainers here that I've had the privilege of meeting. Thank you very much.
All of these are distinct and uniquely implemented. We don't make generic resources like instances that apply to multiple clouds because that gives us a lowest common denominator. Instead, we have custom resources for every cloud to fully utilize them, and as I mentioned, we can easily compose these things, infrastructure platforms.
So with Terraform, you get a single consistent workflow to manage multiple clouds and services.
All right, what is Terraform, that's done. Moving right along here.
So we're going to move onto the next part, the four stages of adoption. Now, there are four of them, says right there. Again, some of this is whether you're a new startup and you have no infrastructure, or you could be an enterprise and you have infrastructure, but you're looking to adopt Terraform to bring some sanity or just consistency in managing it, so some of my examples might be catered just to startups, or middle, or enterprise areas.
So we're going to look at the first stage here, the manual stage. At the manual stage, we're using a lot of web consoles. We're doing everything by hand.
If we're using two services and we're connecting them—say a CDN fronting instances behind a load balancer—that's two different web consoles, maybe even multiple service consoles inside one of them, and we're configuring everything manually. We might have bash scripts or use some CLIs, but there's not a lot of consistencies.
We have single environments. You probably just have a production environment, or maybe you have multiple environments, but they're probably not what you think they are. They might be staging and development and production, but you probably don't have parity there. They're more like environmental siblings. On the outside they appear to be the same and they have a common ancestry, but like any siblings, if you get to know them, they're going to be very, very different on the inside.
You have mutable infrastructure, probably. That's pet servers, things where we created a virtual machine or we have a physical machine, and we install Nginx or some service on it, and before we're done we give it a name and we become emotionally attached to it. We want to keep that thing running. We care about it. You can't change that server. It's very important.
And then lastly, we have infrastructure ... No, no, that's not right. We have ops.txt—it's a text file that has all of your information in it, IP addresses, configurations, things you've installed, how to install them.
(It's only funny because I've done this, and if you saw anybody around you laugh or chuckle, they've done it, too. The only thing worse than ops.txt is just having everything in your head, where it's not actually written down.)
So how are we using Terraform at this stage? We're not. There is no Terraform in the manual stage. It is not compatible with this. But why is this even a stage? Why do I even bring this up? Because not using any Terraform is the first step to using a lot of Terraform.
So what are the challenges that we see in this stage? Well, we see similar challenges across many stages: technical challenges like reproducibility, change management, architecture. Then we have organizational things like auditing, consistency, and knowledge sharing.
Reproducibility—can we remake our environment? If something catastrophic happens, can we get back to where we are?
Architecture—what do we even have? Like what components actually make up our application? I don't know if anybody else is like me and they've tried to duplicate an environment only discover your performance is terrible because you actually had a Redis cache in production, and in staging you forgot you had no Redis. I've done that.
Change management—how are these changes happening? How are we growing our infrastructure? Is one operator using the web console while another one uses a CLI?
Operational—who did what? When? Why did they do these things? We have no idea.
Consistency—environments and things might not be named the same. We have 'prod' something but 'production' other thing. We have pet server names. We don't know where these things belong.
Knowledge sharing—if you're big enough to have multiple teams, how are other teams doing things? How do you find out what other teams are doing? If you're by yourself, you're a startup or you're starting to grow and you bring somebody on, how do you transfer all that information? Is it all in your head or you giving them that amazing ops.txt file?
So if I had to rate all these, I would put all of these in the red. None of these challenges at this stage are being met, or if they are, they're probably not being met in an ideal way.
Now, in theory, you could set up an infrastructure, it'd be perfect, never changes, and you're done. You and your ops.txt file are great. But in reality, we're growing, we're changing, things are evolving, and at this stage all of that becomes very painful. So that's when you know you need to scale up there.
So that's the first manual stage. We are done with that. We all agree, I'm sure, that it's not very ideal. So we're going to move on to...
In the semi-automated stage, we introduce infrastructure as code, which is just a plain definition that says, "process of managing and provisioning infrastructure through machine readable files".
We're going to introduce machine images likely. That's Docker containers, Lambda functions, machine images, all the great things that Packer—another provisioning tool—makes. We're going to stop manually installing system files and configuring things. We're going to configure it once, in an automated way, and then we're going to duplicate that through automation. We probably still have web consoles, though.
At this stage, if you're a startup, you're starting new, you're learning Terraform, not everything is going to be put in Terraform right away. Same thing with an enterprise setup. If you're adopting Terraform, you're not just going to throw your entire infrastructure in there quite yet. You got to take it slow, do a greenfield project, import some things. There's going to be other things out there.
But most importantly at this stage, we introduce Terraform. We start using configuration, we have automation, and we have iteration. All of these things are very exciting! (You can tell by the exclamation points.)
So at this stage, the two main features of Terraform we're using are:
So for modeling, here's an example configuration file, if you have not seen HCL or Terraform code, this is an example configuration. We've got three things here: a network, a node, a firewall. To model these things, we use HCL, which is a human-friendly language, but it can also be machine friendly too. You can use JSON back and forth. And we declare our resources, and we set the values for their properties. We tell them what they are. Where am I? Okay.
It has very powerful language contracts. Two very simple ones explained here are the 'count', which is just to say I don't want to define this three times. I just want three of these things. That's like the most simple of all the powerful things HCL can do. And then the next common thing you'll see are interpolations, so we take the output of one of our values and use it as the input to another one. So we say these things are related, and we do that through interpolation. In this example, we say our network ID is the ID of the resource at the top, and I want all of the machine IDs from the nodes to be in my machine ID list. So all of this configuration is read by Terraform and used to make a graph.
Now, I've talked several times about Terraform, and I've spent a lot of time talking about the (dependency) graph. I'm not going to talk about the graph here. The graph isn't really the point of this talk. The point here is that we're declaring infrastructure in files, and Terraform can read that and reason about what your infrastructure looks like. In doing that, it knows dependencies, order of operations, and it can parallelize things.
So those three nodes can be created at the same time as that network because they're not related. It's the firewall thing that needs to wait for those other things to be related, or to be created before it can use those. It understands the topology and understands how things are related, and that ties into your re-usability because it knows how to get to that point.
So knowing the graph, knowing what our current state is, we can then automate plans. This is a quick little video I made. Here we're taking that configuration file we had and we say "Terraform plan," and we gave it an output. We said "save this plan to a file." So Terraform says, "okay, you have absolutely nothing, so I'm going to make all of these things. Here's your three instances I'm going to make, and here are the other two things I'm going to make."
Now, this is not reflective of the ordering that we saw on the graph, but you can see by machine IDs and network IDs up top, it says they're ... Well, it gives you where I'm going to use the value there, but it says machine IDs are computed. That's because the ID of those machines is something I find out later. I need to wait for those things. I don't know what they are yet. Terraform can read that graph, it knows what to do and when to do them.
Now because I saved that to a file, I can tell Terraform just make the changes I already approved. Don't redo things and maybe something changed. Just make the changes that we just talked about. That's it. So here Terraform does those things. This is more reflective of the order. It makes a node and it makes a network. You see those kinds of happenings at the same time because the network comes in between the other nodes. Terraform can parallelize these things, make them in parallel because they know that they don't depend on each other and at the very end it makes the firewall. Again, Terraform does the structure, Terraform can get you to where you want to be, and—God forbid—if something catastrophic happened, it can rebuild all these things.
These are just things we simply don't have in the manual stage. In the manual stage, who knows what order you would do these things in? Maybe the firewall can be created before the machines are created, and that's what you want, and you went back and added them, but maybe your colleague wouldn't do it that way. Terraform presents a very consistent way of doing these things.
Now that we defined our infrastructure and we're automating this creation, we're going to look at those challenges again. This is where we were in the manual stage and I feel like we've done a lot better here. We're not perfect, we haven't improved everything, but we have upgraded.
Reproducibility—we have a config file. We know what we want, if something catastrophic happened we could redo it. Or we could apply it somewhere else. We can build it again.
Change management—we're now doing things consistently. Terraform is automating that, we're not relying on someone's specific skills with a CLI or a console. Terraform is handling that for us in a consistent, reliable way.
Architecture—I left as red. We'll get into why. The way we have architected our application, it's not very shareable. We can't really consume it easily with other projects yet, so we'll get to that.
Auditing—we still don't really know who did what and why, other than your talking to your colleague.
Consistency—we can now name things more consistently because we can define them in variables and then just interpolate that across several resources, so our environments can start to at least look similar in names.
In knowledge sharing you have files. People can actually look and see—oh, it has all these things and oh, Clint, did you remember that we had Redis and you need that somewhere else? I can now remember these things because they're written down. So things are much better at this stage.
But what are our next challenges? As Armon said yesterday—there's no free lunch, so now that Terraform's generating these graphs and it's applying these plans, it's saving the state of our infrastructure. It's saving all of this stuff in a JSON file on your disk right there.
But if you have a colleague, having two state files that are supposed to be the same is not ideal. The benefit of having a state file—and normally what we have the problem of managing it—it needs to be consistent. Then we have an operations problem in that, who is doing these things? And how and when? There's no organizational oversight into this if everyone's just using their laptops. But it's a start.
That was the semi-automated stage. You're done with that. We still have 20 minutes for two more stages, so I'm curious as to how this is going to play out.
We're scaling up. We're going to move on to the next stage. We've introduced a little Terraform.
Startups, you're starting to build your infrastructure and have easy control of it, enterprises are probably either doing greenfield projects or experimenting with importing existing infrastructure as a way to grow and get adoptions throughout the organization.
Now we enter the infrastructure as code stage. Yes, we introduced infrastructure as code in the last stage, but now we're changing our ideology. It's no longer this is a thing we're starting to do, or we're looking at, or we're playing with. This is now, we're going to get on board, this is how we manage our infrastructure.
We're starting to get organizational adoption. Either the startup says everybody's doing this or the enterprise starts announcing, look, this is where we're going with the future.
We start having multiple environments and these environments are more closely related, more identical, so think dev staging to prod, and we're going to start doing version control. We're going to keep track of these things.
Organizational adoption too. We're going to stop using Web consoles so much. Some people may go as far as to actually block a lot of the Web console access except for administrators. You start seeing that at this stage sometimes, but it's generally recommended at a later stage. With Terraform, at this stage we introduce modules, workspaces, and we start doing managed state.
Modules are packaged configurations. They're a set of Terraform configuration files like you would normally use, but they are packaged such that they're in a folder that you can easily share with other people. The idea is to pre-configure components. Components can mean a lot of things. It could be just a varied base of PPC setup. It could be, this is how we make a relational database with a follower for reads. It could do all sorts of things, it can be all sorts of scales depending on what your organizational needs are.
You set it up one time. You give it proper inputs and outputs, things that you know other teams or people are going to want to configure. Names, instance sizes, how many instances you want in an out of scaling group, and you just put that in a very simple folder structure. In fact, it really only needs one Terraform file, but a common structure we'll see here is a readme that explains what this is, what it makes, and documents what the inputs are.
The main Terraform file will then contain all of those instructions. The normal Terraform configuration you would see, it accepts variables that are defined and the variables' tf file, and then it defines outputs because these are just configured from configuration files after you use a module. You give it inputs there, as we see we're referencing, we say where the source is. In this instance it's on GitHub.
We said the version we want, and here we have an output. So just because it's a bunch of Terraform configuration files, it too has outputs. You could output IP addresses, address spaces, number of servers, anything that you would output in another Terraform configuration file.
Now we can package up common architectures and share them. A larger organization with an operations team can say this is how we do a two-tier application. This is how we do it. If your development team wants that, use this module, plug these things in, there you go. Once you have it booted up then you give it your instance or you give it your AMI or your disk image, and these things are handled for you. The operations team can do that and supply it to enable the other teams. As I mentioned, these are easily shareable, they're on file, you can put them on S3 buckets, anywhere in public, really.
Workspaces in Terraform are defined as the pairing of a configuration and an environment. Workspace is a command in the Terraform command line tool. It easily allows you to basically manage several state files for the same configuration file. So locally you could create a new workspace and say the default one is going to be my dev, and then you could create staging, and then you could create production. And it's similar to using Git as a version control system in that you can change workspaces, run a new plan, and it will use the state file for that workspace and say this is what I'm going to do.
So you can easily test out changes to your infrastructure on a development environment, and once it checks out, it's fine, you can change workspaces and run the same plan or run a plan again to see how it would take effect on the next environment. Verify it looks fine, and you can gradually promote these things all from the same directory, using the same configuration files. We're just using different state files and different interpolations—variable file things to control the size of that environment. Same configuration, separate state files.
And as I was mentioning, you can use the workspace in your interpolations in the Terraform files themselves. So back to our example here, we had this count three. You can change that and say what's my Terraform workspace? If it's production, I want three. If it's not production, it's stage or dev, I only need one. You might need an auto scaling group with a max of 50, but in staging you need a max of two. You don't want to completely duplicate your production environment in the number of instances, but you do want the same structure.
So workspace allow you to have interpolations and control those things. And it pairs nicely with like a feature brand style of development that we see with Git, and GitHub, and polar quests, and change, and things like that. The state is still tightly coupled to the configuration files, but you can manage that.
So about state. The last part of this section is managed state. I mentioned earlier that Terraform applies these changes, it saves these things to state in a JSON file format. That's great, except you and your colleague both have that state file that's supposed to be the same, but maybe you made an apply and they didn't know it, and their state's now out of date and maybe something terrible could happen. We don't want that. We need to manage our state.
At the very least Terraform generates the state file in a JSON format. It's machine readable and more importantly easily diff-able, so if you put it in something like Git, you're able to track the differences over time. You apply a change, your state changes, you commit the new change, you can push it up to source code management like GitHub, BitBucket, whichever one you're using. That's better than nothing, right? You still have a weird situation where maybe your colleague hasn't updated DCS and you can run into odd states. But at least we're recording things, and we get a little bit of a history.
A more advanced thing to do would be remote state, where instead of having the state file generated locally, you configure a backend in a normal Terraform file, and you use this backend location for all of your state operations. You no longer get a state file locally. Every action—like a read, a plan, an apply—that would affect state, that would edit the state in any way, happens remotely. It reads first, does its thing and then when it has updated information it saves that at the remote location.
That provides better consistency in collaboration with your teammates because you now have a common point that's not version control. All Terraform actions, all the commands behave like normal but they talk to a remote state to get their most current values. There are lots of different backends. This comes built into Terraform. There's a Terraform enterprise backend, Consul is a backend, and then AzureRM storage, Google class storage, S3.
A lot of these also offer, but not all, state locking which will prevent you and a colleague from clobbering each other both doing writes at the same time. It will prevent that from happening. It will lock the state file when it knows that you're making that kind of change.
Okay. So that's the infrastructure as code stage. We've got workspaces to help us manage multiple environments, and we're managing our state. We can now bring more things in there, but we need to review our challenges. This is where we left off, after we complete this stage, things are a lot better. We've got green now.
Reproducibility—we have common architecture stored in modules that are easily shared and read. Easily sharing them helps with consistency and knowledge sharing, we know where these things are. We now have things tracked in version control so you can see the history.
This is where we start seeing people integrate with GitHub or other providers where you start seeing pull requests against common infrastructure. Teams can suggest changes and they can be evaluated individually and then combined in a peer review manner.
Architecture is better. We can now define these common modules that are reusable. Teams can now use them. State management is improved because, ideally, we're using remote state, but if nothing else we're version tracking it in version control.
Auditing has improved. If you have tracked conversion states you can see where you've been and how you got there, but there's still some assembly required.
Consistency—at this point we have common modules. We're creating things in consistent ways. All right, so we still don't get free lunches. We've introduced new problems here. When you're in this stage you can probably be quite happy for some time, but when you start to feel operational pains or you start really worrying about secrets and governance, that's when you know you need to advance to the next stage.
By secrets, I mean obviously providers require credentials. There are tokens, there are SSH keys, there are all sorts of things. You don't want to save those in state. As I mentioned before, those are in version control. Those are published in clear text. We don't want that. I mean, they could be. It depends on what they are and the provider that's using them.
And then governance. As an organization you're growing and you need some kind of oversight into who's making these changes, when, how are these changes happening, do I have 10 operators who are just running this on their laptop? Maybe they're at a coffee shop but they're modifying my production infrastructure. We kind of want to know when these things are happening, who are making these changes, and where are they happening.
So that's the infrastructure as code stage. We're gonna go on to the next stage, the final form as I've conveniently laid it out.
It's like the infrastructure as code stage, but it's collaborative. At this point we start seeing a shift in how Terraform is used. Terraform is still the point where changes are happening. It's still automating the creation, the modification, and management of your infrastructure but, because your infrastructure has grown, because you're using it on so many things, because you have a larger ops team, at this point we can't just have people running this on their laptops. We need a centralized system of where this is going to happen.
We need to take a moment then and be happy that we're now automating these changes, but we need to start automating the automator. We not gonna be using Terraform directly anymore. At the collaborative infrastructure as code stage we need coordination, we're gonna need some enhanced workspaces and we're gonna start using registries. Some of these are features of Terraform, some of these are features of what we're gonna get into next because, at this stage, we need something like Terraform Enterprise.
This is not a sales pitch. I'm not a salesperson, but Terraform Enterprise is the result of working with practitioners and companies who are using Terraform at this stage. Coordination, governance, enablement, that is what Terraform Enterprise does. You can use other tools to do this thing, but this is where Terraform Enterprise lives. This is its purpose. Back to our idea of Unix tools, Terraform Enterprise does those things. That's what it does, so I use that as an example.
Full disclosure: I work at HashiCorp. But also, Terraform Enterprise is like a canonical example of how these things are done. At this stage we're using inversion control but in our centralized system we need VCS integration, we need to pull from there. We need team-based permissions and we need something to actually run Terraform for us. These are all things that Terraform Enterprise offers for us. It's going to run automatically, and automate our automator. What I meant to say there was you're using something like GitHub and you're able to trigger runs and other plan changes in Terraform Enterprise.
Looking at upgraded workspaces, in Terraform Enterprise workspaces are a little bit different. You still have the concept of configuration mapped to a state file, but they're no longer tightly coupled. You can create new workspaces and say—well, the repo here is that PR demo and we have two of those and you can see we have two workspace names. That doesn't mean the state file's gonna be shared side by side and you're no longer gonna use interpolation of the workspace to set up your variables and to create your differences. And each of these workspaces are available as a data source to other workspaces so you can still share things across workspaces.
If you look at an individual workspace you start seeing a run history. Here we see three of them. Two of them were cued manually because someone went in there and changed something, did it manually. Or, the middle one is what you'll see a lot more of, when somebody submits a pull request to GitHub it can automatically trigger a run in Terraform Enterprise.
Now, if it's not on the master branch, if it's just a pull request, that run will only be a plan. You cannot apply that run because it's not in the master branch. You can review it, say, "Yes, that plan looks great." You can go back to the pull request and say, "Yes. This pull request checks out." Then they merge it, which triggers another run. That run, then, will generate its own plan. You can verify that that plan lines up with the last plan that you looked at, make sure nothing's changed, and then you can apply it.
The key point there is that you're now working with Terraform from a distance, in a sense. You're no longer using the command line interface. You've automated the application and changing of configuration files. With workspaces you get independent variables per workspace. I was mentioning in my previous example that we use interpolation to determine how many instances we wanted based on our workspace. In Terraform Enterprise you get your own set of variables that you can input and they're stored independently by workspace so you don't have to completely modify your configuration files to always reference the workspace. You can just say, "Use this variable." And then that variable is declared in all of those workspaces.
You have team-based permissions, so you can have a dev, staging, and production environment. With the dev environment your development team can write to the state there, which means they can improve their own runs there. But say staging your production, they can only read. So you can have governance over who can promote changes to production environments.
Then you get state and run history. When you use remote state with Terraform Enterprise we keep version states of every change, so you can go and look at a historical log and you can look at all the runs that have taken place. Who approved them, where was the source, back to pull request, things like that. Sensitive information in the variable's encrypted with Vault. It has its own Vault backend and it's one way. If you add a sensitive credential in the user interface you can't get that back. All you can do is replace it or delete it. In the environment, Terraform will have access to that but users won't.
Again, I'm not trying to make this like a sales pitch, but at this scale of adoption, these features are the bare minimum. These are required at a large scale of adoption in Terraform, you have to have these things. I am running low on time so the other thing to mention there is the registry. Yes, another HashiCorp product that I'm selling. But this one's okay because it's free and it's public. You can use the Terraform registry to publish those modules you've worked on and use modules from other people. The registry has several modules published by Amazon, by Azure, where they're sharing what they believe are best practices. You can consume that and then that serves to be a self-serve product. Dev teams can then use these modules off the shelf and get going without operations being a bottleneck. A limitation there—search and discovery. The Terraform registry is all public and all open. You can look at the source. If you have private modules you need to use something different, you need to use a private registry.
We've published the API protocol for the private registry. You can write your own if you want, or you can use the one built into Terraform Enterprise. Still not a sales pitch, promise. It says right there. I have 14 seconds. Reviewing these challenges, this will go really good because now that we've entered the collaborative stage everything is green. Because you can't end a talk with red or orange things, everything has to be perfect.
We have a centralized system. We have consistent and safe iteration and management of our infrastructure. We have governance and histories and multiple environments, things like that.
So, stage Next. What happens there? Where do we go from here?
Well, open source and Terraform Enterprise have some exciting things coming up. These are not supposed to be incremental so I'm just gonna fast forward. For Terraform open source, in the near future we're getting an updated plug-in interface. Right now we use a plug-in architecture that we're upgrading to have a better type safety and enable better features, such as HCL2, the next version of our configuration language. It's gonna have more powerful features for you to model your infrastructure and we're always gonna continue to grow and enable our providers.
New features, making things better and easier, adding more documentation to make it easier for you to write providers, write your own. Terraform Enterprise. Sentinel: Policy as Code. That actually exists already where, instead of strict governance of who can do what, you can set policies that say within these bounds people can do certain things. Like, you can only create an instance of this size, you can only create this many of them and it's run against all of your configuration plans so a dev team can create a new thing but an administrative team can say, "Well, it must fit within these parameters." That exists now.
Enhanced remote backends are something that's gonna be in Terraform Enterprise where an operator can run Terraform commands on their machine but the actual runs are happening in Terraform Enterprise. They're not actually happening on your machine, it's more like a remote console in that sense. It's not something you would do often, but it's still using the centralized part. It's still gonna be logged, all those things.
Configuration designer—I don't know how available that is but I was told in the email I can talk about it. It's a user interface in Terraform Enterprise that people can design infrastructure without having to know HCL. You can enable frontend teams or other teams that just might not have experience with HCL to create what they need. They can reference modules, add new things and it's all done from a user interface. That's coming.
And promotions workflow, so instead of having to change workspaces manually to apply these different changes that you've pushed, you can approve a change to one environment. Once that's applied and it checks out, you can automatically promote that to the next environment in the chain. You don't have to open a new pull request or change the configuration files.
Okay. That's it. I'm done. Thank you.