Crawl, Walk, Run With Terraform
Oct 07, 2019
Learn how to do Terraform right from the start—hear about Anaplan's Terraform adoption journey, which took them through the crawl, walk, run stages of using Terraform with individuals, teams, and teams of teams.
This talk walks through using Terraform to manage various stages of managed and unmanaged infrastructure components, starting small, making life more enjoyable for one person, and then growing to several people and exploring opportunities for extending it to many.
Terraform has many uses. With its declarative nature comes some restrictions and behaviors that are shared in this talk. The session also explores where changes have occurred in some example code and provides insights into early, easy choices that will make later decisions easier and more maintainable.
Principal Cloud Engineer, Anaplan
It's awesome doing a topic on Terraform when, in prior years, it wasn't such a big topic, and now we hear it being mentioned almost every session or every other session.
I'm going to do a "crawl, walk, run" talk. It's not particularly tech-heavy, but it's more interested in what you would expect when you journey from first looking at Terraform from a technology and a process perspective and into growing it into your teams, and then eventually growing it at scale, when it's not just your teams or your neighbors running through it.
I'm Brian Menges. I've been doing Terraform since 0.3. It was really interesting back then. State files were definitely collateral damage back in those times, but things are much better these days.
I've contributed to the project. I contribute to a lot of other open-source projects.
I work for Anaplan, a planning company. I'm always looking for people, for my team in particular, because we have interesting cloud challenges as we move around. If you're interested, check out our jobs.
» Infrastructure as code
You don't just deploy your infrastructure; it takes a lot of loving care.
When you start embarking on this journey, usually you've got a lot of manual rote processes, your ServiceNow tickets, to where now you're moving into APIs and you want some consolidation. This is where Terraform comes in. Then you move on further.
This is going to be our journey.
Crawling is when you're looking at the very beginning of this. Let's look at Terraform's structure. In Terraform,
.tf ReadFiles are your HCL, where your code comes in. The
.tfvars are usually going to be some sensitive information, the full populations that you as your own workstation want to be able to supply.
Your API credentials to AWS, your JSON credentials to Google, your Azure key, just to name a couple of the first starters—you don't want this committed most of the time. But you're going to have a problem when you start moving to the other teams. They need credentials, but they need to add their files. We'll get into that.
Terraform is a state engine. You need that state. It's very important. You don't want that committed. That's going to contain those sensitive things that we just referred to.
And of course, if you have plugins, any kind of context for your information that's going to be in your local
.terraform/ directory. If you look at repos online, you'll see in your
gitignore as it's committed, you'll exclude things like
.tfstate.terraform and some other often CI-related files.
You're going to go in Init to download your plugins, do a cursory syntax check. Plan is going to tell you what it thinks it's going to do. Apply is going to try and make it so. It doesn't always succeed. You can try and apply, and apply, and apply again.
My favorite phase is Destroy. I believe in a lifecycle of things, especially a cattle/pet sort of mentality, in that I want a final state of things. I want it gone.
In a lot of the processes that we exercise, we want an environment to exist for as little time as possible. That includes the full lifecycle. We don't want it lingering and growing to the point where it becomes so uber-critical that a team needs it. I don't want that. I want you to be able to move on.
Some of my favorite commands:
fmt, for free, will do a good job at linting your code, aligning it, making it a lot more readable. The HCL syntax is great for that.
Refresh allows you to cover all those points at which somebody went into the console and changed information on you, but you need to acquire that. You don't want to fail when you decide to run any of your code or automation.
Console is where I spend probably a good 20% of my development time. I'm testing interpolation. I'm figuring out what math can I use or what information can I reuse from what I've already gotten. How can I prompt the user—when we're creating things—to give me more information necessarily than I need?
Then I can cover a multitude of places by parsing streams, filtering lists, doing things like that. In the console, you can test that.
And of course
state. We move state, we inspect state, we look at all sorts of things. Instead of telling people, "Run the plan again so you can see the outputs," no, just
terraform state show the resource name. You can get a whole set of information about it.
Let's do something simple. Let's look at Google, create a network. This is uber-simple, right? You can go into a UI. But logging in is at least 2 clicks, and if you have 2FA, maybe 3. You have to pull your phone out of your pocket, click confirm or grab the code and put it into the UI.
Now you're just in your project.
You're going to go onto the network tab, you're going to add a network, you're going to name it. You're finally done, 2 1/2 minutes later.
I have 12 lines that just do that: just run it. And when you no longer need that, again, my favorite command,
terraform_destroy, so then you can just clean that thing up.
But when you're crawling with Terraform, this is probably what it's going to look like. You're going to start with a small amount of sets or items. Choose some small unit that traditionally is a repeat or rote practice.
Creating things that may live a little bit longer than normal, networks in Google, VPCs on AWS—you have a litany of options of things that traditionally will stay around. I suggest starting with those, make some small time savings for yourself. Get rid of the really rote tasks to just be comfortable with the language and inspect the providers. See what you can get away with.
You have maybe some variables; you're supplying some information. But now what's more important is that this code is going to live and maybe not necessarily be quite as modified.
I'm going to give a lot of people some options. Grab my code. If you have credentials for the environment, that means we allow you to create stuff. Update your information, otherwise you're just simply going to get the default stuff.
Now, mind you, if you're collaborating with somebody else and some of this has already been done, you're probably going to get conflicts. But needless to say, this is at least telling not just me, but other members of my team, "When you get this request and I'm on vacation, you can pull this code, add your credentials, and create some networks." And this will save our time.
Going back into Terraform console and talking about interpolation, you can see down in the provider, we're doing some interpolations. We're joining on hyphen. We're slicing and splitting off of hyphen from our var zone.
One of the nice things about this is zones and regions—well, zone is only a component of the region. If we drop that -a from us-west1-a, we now have our region; that's what the provider wants.
But something else might want the more specific information like zone. So we'll provide that as an item, but we're going to chop it up. We're going to take that item, make it a list out of the hyphens for 3 elements. In split, we're going to do 0 to 2, 0 inclusive to exclusive. We do not want the last element, so we'll drop it.
But in order to provide region—region is a string—we have to join that back up with hyphens again.
So we got a list. We're going to join that hyphen. We're going to re-put that us-west1 and keep that going. This is a little bit more portable. You can take your teams, PRs, and indexes off of it.
Some other interesting options: If you are providing defaults, you won't prompt your user for them. If you want to force them to give you that information, say, "In the networks, we're no longer going to create Terraform network for every one user who pulls this stuff. We'll simply delete the default."
Now on the CLI, this will prompt you. It's like, "Give me a list. I want to know the networks you want me to create." But these days, we want to give them at least an easy ability and an expectation of success.
Again, this code is reusable. We can start to do things like introduce CI/CD. We can get PRs and feedback. We have that expectation of getting a loop.
And, as I pointed out, we already know the zone, therefore we already know the region, because that zone is not going to be outside the region. We can reuse this information through crafting interpolation on Terraform, make efficient use of our time, and also get something better out of it.
Now you're going to start coming into problems. Not just you are running the code; somebody else is. So somebody else's credentials might accidentally get committed. We hope that never happens. If you're using the same file types and you're excluding the right files, this won't be a problem.
We've definitely committed credentials every now and then. We solved that by destroying a repo and then recommitting it.
We depend on local files, but we're only walking at the moment, so we're probably still running this from our own machine. CI is probably an interesting thing. We might want to look at it.
However, this introduces the question, What do we do with that state file? Because, if in our CI engine, we keep the CI up and the state file is there, that's a security problem. It has to be incorporated where our sensitive information is.
But if we externalize that state, we have to make sure that that's secured properly, that only the right people have access to that state. Naturally, though, you have to read the state in order to run the plan. So you can't just willy-nilly give it to a developer, because if they decide to get inquisitive, they can take a look at the state because their credentials are needed in order to update.
I've definitely been guilty of boiling the ocean. We've had some interesting runtimes where I've had Terraform plans running in excess of over 30 minutes successfully, across thousands of lines of code.
This is not a really good way to do things. You think , "Great, I'll just deploy an environment." Well, what does an environment consist of? My 45 apps, my 16 load balancers, couple of S3 buckets, 2 databases—the list goes on. And that's just 1 plan.
As you begin getting from walk to trot and start to approach run, units of scale become an issue, and your blast radius becomes a concern.
Terraform does a pretty decent job by default of keeping your order of operations. However, sometimes you need a depends-on mechanism, in order to make sure that that last resource did run. So I implore you, look at all of these syntax options and items that you can make use of.
One thing that I don't cover here is [modules](https://learn.hashicorp.com/terraform/getting-started/modules.html “Introduction to Terraform Modules”). That's kind of important. When walking with Terraform, I really recommend that you know what's going on underneath. Maybe not somebody else's opinionated view of the resources and inputs. Experiment with modules and play with them. But you need to know what goes on underneath.
Some of the most fantastic discoveries that I've made in terms of code that I can reuse is by going into a module and seeing what somebody else has done, especially with interpolation and stuff like that. Understanding at the base root levels what it's doing that somebody else said it needed to do in x order, that's really meaningful.
So I do encourage the exploration of modules, even within and amongst our teams. Like, "Why don't you use the AWS VPC module?" "Well, I don't need to give 14 inputs; I only need to give 3. It's much simpler and I don't need to have another thing download just to run my VPC."
So, units of scale and blast radius.
» Infrastructure as code testing
At the very end of walking and trotting, I suggest that you start looking at testing. There are various opinions on infrastructure as code testing. Like most developers, I'm in the camp of "you need to test and verify your code."
These are the 2 that I recommend. If you love Go, Terratestis going to be your tool. It's wonderful. It's beautiful. You can get fairly deep into your state and your intention.
If you're more familiar with things like RSpec and Kitchen as a CI framework, Kitchen-Terraform has been doing this for quite a while. They have a very good product.
What I do implore is that you choose 1 and try to stick with it as much as possible. Terratest is probably more appropriate for developing your own internal modules, and Kitchen-Terraform is more appropriate for testing your plan, what you plan to do, the infrastructure that you would like to create.
And, of course, it destroys. I love destroying.
» Organizing the code
Now our repo structure comes into question. How do we want to organize this code?
HashiCorp has recommendations on their site when you're loading and onboarding into Terraform. Multiple workspaces per repo is fantastic. I use 1 codebase. I supply the variables, dev, stage, prod. They all look the same except for what we put into it.
Regions are variables. Quantities, counts are variables. The S3 buckets are all inputs and variables.
These are all committed as code and we'll import those variables, so as they're committed and changed those environments, when that workspace hits, gets that information.
If you have your CI integrated, when you make that PR change, you can run plan against everything that uses this plan and then you could see directly in your CIs like, "I expect prod to have no change. I expect stage to have 2 changes. I expect dev to have no changes because we've already done it and we're promoting it to stage." So you can integrate that.
Branches end up resulting ultimately in a bit less code, but you have longer-lived branches. For anybody familiar with GitFlow should probably use it as your framework, a little bit more customizable. Also, there's a real direct promotion. However, you start fighting drift, and I'm sure that nobody wants to do a cherry-pick merge.
They're very explicit. This is what prod framework is, because prod may have, say, an elastic search cluster that spans much greater an area than the 1 single-instance elastic search cluster in 1 zone.
Those differences sometimes cannot be covered with variables. You have lots of options. Naturally, there are a few others. For repeatability, my personal preference is to try and reuse the same code as much as possible and then question, "Why would one environment be different than another that would necessitate me changing my model?"
» Running with Terraform
Running is not so much about the code as it is about the people that you're interacting with, the teams that you're growing it to. Because you don't run alone, especially in a company of any size. You yourself cannot do the work, and you can't automate every facet of it.
We all love to, and try. I'm always constantly looking to automate myself into my next position so that I no longer have to do what I've discovered is now remedial stuff.
Now we want to go from my team, my group, or maybe even my neighbor, to teams spanning geographies. I don't necessarily want to know when you want to create a Google network; just do it.
Collaboration becomes critical in that, if we're going to change a paradigm, we need to communicate. We don't always do that through code. We do that offline.
But when we submit the PRs to the code that's constantly reused around teams in the world, we communicate those changes to them and say, "I have this PR up and running, and of the 9 workspaces that use my network code, 4 of you are going to be adversely affected. We need to make this change. Please tell me how."
Governance becomes super-important. The transports of those security keys, the sensitive information in pieces. Things like your CI engine and secrets engine, implementing Vault, doing Terraform Enterprise, CircleCI, GitLab, etc, they become the place where your policy has to be crafted, ruled, modeled, and enforced.
Policy can cover things like running Terraform from someplace other than your own machine. Maybe you don’t have access to it, but you're in a high-compliance mode, and we only allow these changes to happen from the corporate office.
» Test-driven development
We want to validate before we commit, I want to see what the testing does. I want to see what Terratest outputs. I want to assert my changes and make sure that it works across all the workspaces that I'm going to. And when I commit a change, Sentinel will tell me, across all the workspaces that I'm going to go, "Policy says you can't do $4,000 in infrastructure." Things like that.
The feedback needs to be immediate, because if multiple teams are going to be using your plans, your modules, or any sort of information, you need to make sure that you have that rapid feedback loop.
» What running looks like
It's quite large. You want to build, especially against your VCS, you want to send that to your workspace, make sure that your workspaces are receiving that change.
We implement this through Terraform so that we get that full lifecycle. Terraform is responsible for telling me what provider, what the inputs, to what region, what zone, geography, etc. We do this more than with just 1 cloud, and we execute this with more than 1 geographically distributed team.
We do development in the UK, in the US, in a couple of different sites. When I have teams in the UK time zone trying to deploy stuff, I don't necessarily need to be bothered.
So we've covered the full gambit. With crawling, you want to investigate it. Then take a look at your units of scale as you walk. And then as you run, it's more about who you're collaborating with and how you're collaborating with them. Making sure that your feedback loop comes in and that you have the appropriate guardrails that don't get in the way of other people.