Hear about the lessons from Indeed's Terraform-driven move to AWS in the aftermath of the 2021 Texas power outages.
Hi all. My name is Andrew Noonan. I'm from Indeed. Welcome to Everything, Everywhere, All at Once: Terraform, and Indeed's move to AWS. We'll dig into that title a little bit later.
First, for those of you that haven't seen any of our ever-present ads on all forms of media, Indeed is the world's top online resource for job seekers and employers to connect with one another. We're a leader in the HR tech space and a great place to work. In short, we help people get jobs. It's a great motivating principle to work towards, and like you probably heard from many other presenters here, we are hiring.
Who am I? Thank you for asking. That's so nice. I'm a long-time Indeedian focused on security and infrastructure and the co-chair of the Terraform Guild at Indeed. I've touched just about every piece of Indeed's infrastructure over those last 15 years — so, I like to say a lot of what our platform team is doing is fixing my mistakes.
To understand Indeed's infrastructure, time travel back to 2020, which feels like an eternity ago, but, believe it or not, was fairly recent. Indeed has been using AWS alongside managed service providers like Rackspace to provide the services on Indeed.com. These remote datacenters provide the horsepower for job seekers searching for a new employer, and the data is hosted in a co-location facility in Austin, Texas. It's a hub.
A while earlier, we had migrated one of these datacenters in Chicago to AWS's US-East-2 region in Ohio. And one of our business units, responsible mostly for the employer sites, was working on moving their services entirely to US-East-2 — away from Austin.
There was a growing interest in using Terraform to perform infrastructure operations. A few groups, especially our SRE organization, were already using Terraform to do some tasks during this migration. But this Terraform use was largely unorganized and focused on those employer sites.
In the meantime, some engineering management was proposing a broader move to AWS for all datacenters, including the central Austin DC. But the timeline was long, around five years for migration.
By the end of 2020, Indeed had brought in Terraform Enterprise to help centralize and organize the Terraform code floating around. But usage was mostly experimental, and the entire platform was managed by a single person — me. Then everything changed when the winter storm attacked.
That's right. Snowmageddon 2021. The winter storm that proved everything was bigger in Texas — including rolling blackouts. Faster than you can change an Avatar reference to Game of Thrones, Indeed leadership backed a plan to migrate critical services from the Austin datacenter to AWS. We called it “Winter is Coming.”
The idea was to move all of Indeed's critical services to AWS by the end of January 2022. So given the winter holidays and the fact that it was currently March of 2021, this gave a more realistic six to eight months to provide what was being considered for over three to five years.
In addition, migrations of the remote datacenters to AWS were already starting, and Indeed's microservices approach meant many moves needed to happen in parallel. In other words, we had to move everything from everywhere all at once. You didn't think I'd get back to the title. That's a callback — advanced writing.
I'll spoil the ending. We did meet the goal, though, about a month or so after our initial deadline. Scores of teams, hundreds of software engineers, and SREs who had barely even heard of Terraform were being asked to use it to deploy infrastructure into AWS, and many of them were not even familiar with AWS services, much less Terraform — so there was a lot of chaos and pressure.
Sometimes, we rallied brilliantly. Other times, we got the job done with duct tape and API calls. The rest of this presentation is a synthesis of the duct tape and diamonds — the what you should and shouldn't do. Your mileage may vary, but we put a lot of distance on Terraform and Terraform Enterprise along the way — so these are at least worth considering.
First off, create a guild. It doesn't have to be called a guild. But it should be a group of people that are responsible for the development of the language used and centralizing best practice and usage guides.
If you're running TFE or using TFC, it shouldn't be the exact same people as an operations team, as the guild represents the customers of Terraform to a certain extent. So, having those two teams completely overlapping isn't great.
Guild members should be deeply present in the formation of documentation, training, and the ongoing support channels. And there should definitely be support channels for the language (HCL).
This is something that TFE early on forced us to do somewhat, but we should have gone further earlier on. We need to have answers for questions like, who is going to add or remove workspaces from TFE or TFC or create access permissions.
We went with a self-service model where we used the TFE provider and some YAML to allow engineers to create their own workspaces and assign teams. This keeps the TFE management team from being a blocker but also has a git log to keep everyone honest and aware.
Will you create multiple organizations to segment your administrative space or keep it simple? For non-TFE users, how will you collect and structure your Terraform co-repositories to make them discoverable? We went with a single organization in TFE for all mainstream uses and added automation to grant read-with-no-state access by default to most users.
Combined with the self-service option mentioned previously, this helped keep visibility high and workspace administration low, but at the cost of additional documentation and training for our users.
If you're not using TFE, what will be your state storage, and how will people get access to it? With TFE, this is straightforward. But if you're not using Terraform Enterprise or Cloud, or if you have other state stores, it's super important to keep track of where they are and consolidate them if possible.
I'm sure you all know that there are security concerns with state files. But they're also extremely useful for observability of your predicted state — which can be used to help find resources that are not managed by Terraform. We push using TFE pretty hard, but even then, there are still exceptions that we need to keep track of.
This one is tough, and there's never one answer. But you have to be willing to give an answer, even if it is a difficult answer and a different one to the next person who asks. Using organizational lines or application dependencies can sometimes be an okay way to go, but that's rarely a complete answer.
I like to ask the question: if a workspace deleted all its resources, what would be the impact? If that impact is large, is there a subset of resources that, if you moved it to a separate workspace, would result in a smaller impact? Will moving those resources result in a meaningfully larger management burden? If I can reduce the blast without a larger management burden, it's time to split the workspace.
If you're using TFE, will you charge teams in some way for the workspaces that you create or for usage in TFC? This cuts both ways. You’d think without some chargeback model, people would spend like crazy. I can't tell you the number of times I've had perfectly rational, highly capable engineers attempt to go down a path for weeks to save a few hundred dollars a year. Regardless of how you do your cost model, it's a good idea to be able to at least define the cost.
Usually, those engineers went down those paths because they didn't know it was only a few hundred dollars. Our self-service workspace creation makes it easy for engineers to not have to think too hard about cost, but it also lends itself to wasted workspaces.
So we have to clean up unutilized workspaces as we get closer to license renewal to make sure we're not wasting money. Someone, somewhere has to think about cost. It might as well be transparent.
We built automation that adds credentials into Vault and TFE for AWS accounts to make it easy for users to add credentials to their workspace. That's saved the team hours of work copying sensitive information. With some new Vault features, we're very soon going to be able to simplify that process even more.
Once we have a strong Vault integration, it will be very easy for us to move credential information from the infrastructure to the application configs. You can also look at doing something similar with AWS Secrets Manager, but we're already using Vault.
One superpower of infrastructure as code is broader access to infrastructure by more people. One weakness of infrastructure as code is broader access to infrastructure by more people.
Who's writing your Terraform code? For us, it's a little bit of everyone, and they all care about different things. The developers like to know quickly if a message applies to them. The ops folks like to know details about the changes, and the SREs often want to know why the change was required.
These are all generalizations, of course, but the point is you may need to craft your message format to address all these different people. To do that, you really need to know your audience and what they care about.
Is something like Slack good enough for a critical message? If you send an email, does anyone read that? Whatever you choose, you need to set expectations with your audience — the critical messages will come from that path.
Probably your company has some mechanism that you use to communicate library and software updates. You need to try and use those same channels if they already exist. Same thing with governance. If your security team typically sends critical messages out, make sure they have a seat at the guild table, and use those same processes to inform people about Terraform things.
Operations teams typically use their own channels to inform, like a maintenance calendar or an email. This is over-generalizing a bit again, but often, because these channels are purely informative — we're going to use this maintenance, we're reporting this outage, etc. — there's not a lot that you're expecting the audience to do, so they tend to be more passively structured.
Much of Terraform communications, especially with a cloud provider, will be more active. We're deprecating this version; please update your code. Announcing this feature, please use it. etc.
You should treat Terraform communications as you would code or library changes and integrate them into development channels, which tend to have more active requirements of a listener instead of operations ones — or duplicate those channels if necessary. That’s not to say that you won't have a need for passive messages, but it's better to treat Terraform changes like code.
Also, it's a two-way street. How do users ask questions? At Indeed, we have two major Slack channels:
Terraform Guild with guild communiques and users asking questions about process of standards
The Terraform channel, where people ask questions about operations, how to do things, errors you're having, etc.
Having that two-way communications channel has been great, and the guild members are encouraged to help out and have open discussions. It's a lively channel with many new threads every single day.
Next is my favorite section. You're not supposed to have favorites, but in my heart, I know it's true. Make standards to do things. Create repos with standard code that uses your modules. This is a great starting place for new users who otherwise might stare at a blank screen for an hour getting up the courage to start.
Write those standard modules. These are the key to so much more. Create a standard for how you want the modules to be laid out, variables, and workspace naming conventions..
Having standards for standards — while boring sounding — is how you get other people to do work for you that is still high quality and interoperable. If everyone invented their own network protocols, we wouldn't have the internet today, and then where would you get your cat pics?
In Terraform Enterprise, the module registry is your secret sauce. You should do your best to work with teams to create modules for just about everything basic. This could be as simple as an S3 bucket module that literally just takes the same inputs as the S3 resource and the AWS provider — it's like a sidecar app in a service mesh.
Once people are using that interface, you can expand what that module does. If you need to promote standardized monitoring of all your S3 buckets, just modify the module to add the monitoring. If you need to make sure that the naming convention for the bucket matches the standard, it sounds like you just need input validation on the variable for the module.
All your users need to do is to update the module, and they get free improvements. Otherwise, all these things would have required co-changes directly in their workspaces. This way, the guild only needs to monitor the versions of the modules used and encourage updates, but more on that later.
You also need to standardize on your code deployment process. This doesn't have to be a single path, but it shouldn't be any path. And as much as possible, it should be a path that the engineers are already familiar with.
Maybe you have some Terraform Cloud and some GitLab CI, depending on the use case. While most Indeed workspaces were executed from TFE, we have a growing number that execute from GitLab CI using TFE primarily as a state store and access control.
We're pushing more people towards that pattern to make it easy to add steps like linting and static analysis as a requirement for code deploy. The point is visibility and auditability, which brings us to governance.
Many of the new users to Terraform may be software engineers who never had to take part in audits or deal with security concerns that are unique to infrastructure. Things that a systems engineer — or a network engineer — consider common sense may not be something a software engineer writing Terraform knows about.
Working with a security team early to create those requirements means you can bake them into your modules — or in a Sentinel policy — before things get out of hand. Figure out how to create differentiators that divide things into compliant and non-compliant with your requirements easily.
Naming conventions, a certain tag, use of a certain container, etc. These may not add specific value to the resource itself, but they do make it super easy to figure out who is doing the right thing and who needs help. Back in the standards section, I encouraged modules as wrappers for common resources. Those wrappers are a great place to add those in.
Also, in this same space, third-party libraries, or in Terraform-speak, providers' modules. You're basically letting someone else's code into your infrastructure, so you need to figure out what level of review or requirement should surround the use of those things. But how do you know who is using what?
Boom. Metrics. It's another solid transition. You need metrics to be able to answer questions like what percentage of the environment is managed by Terraform? This lets you know what is missing and where you should spend your time improving things. Or what people or teams are making which changes — which lets you know who your major customers are.
You can ask them what needs to improve or focus on training their departments, what versions of what providers or modules are being used. This helps to avoid version debt of old libraries or modules and helps to make sure that the standards are still being followed.
The state files are a great source of information for many things. Going back to standardization, you have standard places where you can find state combined with the Terraform lock HCL file from the repository.
You have a standard way to answer questions like when was the last time a workspace changed, which you can get from metadata about the state file itself. And what providers are being used — and their version, which can be pulled from the Terraform lock HCL file. That should be checked in with the repositories and their code.
Last time I checked, the module versions aren't stored in a lock file. But one approach to that is to standardize the output, or at least part of the output of your official modules to include the module name and version as outputs. Then you encourage your users to pull those values into the workspace outputs, where your automation can pick them up.
It's not bulletproof, and it would be better to have that in the Terraform lock HCL files or natively in the state files. But until that's integrated into Terraform, something's better than nothing. At Indeed, we were late to the standardization party, and with the sudden rush of workspaces from Winter is Coming — while we have a great observability team that is building metrics from the state files — we are limited in what we can glean without that consistency. And there's a need to go back and try and fix things, which is always harder once that's out of the bottle.
This leads us to the finger-waving part of the presentation; things to avoid. These are traps that are easy to fall into, either because of the earlier advice or because they are things that our rapidly growing platform tends to collect.
Let's start with a Terraform-specific thing. Terraform version debt. It's awesome that HashiCorp releases regular versions of Terraform, that these versions can add up to significant improvements in speed and new functionality.
But when you begin to bring in standard modules that are written by other people, and those modules have providers or other version requirements, you can get into a situation where you have impossible version requirements — where one module requires a version, say, less than 1.0, and another requires versions greater than 1.0.
The basic solution to this is obvious: To have the guild mandate the supported versions for the company. But in execution and reality, it can be really tricky. If you don't have metrics on the requirements of the modules, how do you know where you get into trouble?
If you don't track module and workspace ownership, how do you know who to talk to? If you don't have a proven communication strategy, how do you know those people are listening when you ask them to update the code? What happens if people don't update those modules?
Without nailing these things down, you can very quickly get to a sprawling set of versions that represents a bad experience for Terraform users. But on the flip side, updating Terraform versions is an easy change that is fairly low risk. And when communicated well, it's a fast way to get the community engaged and feeling helpful.
TFE can help from a workspace standpoint since you can check which versions are set on the workspaces, but that's just a first step. Establishing a cadence for evaluating the version range and working with people to keep your code within that range, you’ve got to be prepared to get your hands dirty and help people out. And always be publicly happy and supportive when you finally remove the old version.
Another thing to avoid is getting too rigid on your infrastructure management path. You want to establish a paved path that the majority of your users will go down — your easy button, if you will.
But one size rarely fits all. As an example, Indeed has been exploring the use of cloud desktops as a replacement for development desktops, which we would issue to engineers alongside their laptop.
When people started to work from home, these instances were much easier to use compared to the desktops that were stuck in the offices — and we're still working on improving and modifying them to see if they'll keep engineers happy. But deploying and managing the systems using TFE was really clumsy.
The number of resources in a single workspace quickly made the creation of new instances take several minutes while the state was refreshed, and the memory requirements just kept growing.
We started doing targeted runs to speed things up, but what we really want to do is have a single state-per-instance in TFE — wasn't practical for administrative and licensing reasons.
Because these and other cases were discussed with guild members, we've been instrumental in developing alternative paths and in the process of paving them — like using Terragrunt and GitLab CI to manage those more complex use cases.
You don't want to have an unlimited number of paths, of course, but if the guild wants to push back on a certain use case, you also have to be prepared to help them adapt their solution to some existing approach.
Indeed made a very large all-in-one code base that represented all the basic infrastructure needed to build the datacenter. It was glorious how, in one fell swoop, all the systems we needed for DC were provisioned, but two icky things happened over time.
One: The architect at the center of this code moved on to another company, so we lost a lot of institutional knowledge with how everything was put together — which wouldn't have been too bad. But because the blast radius of the code base was so large, new users to the workspace ended up feeling very cautious about making changes — making the workspace brittle and slow to refractor.
We should have created more workspaces that managed individual or closely related components and used outputs to give information about those resources then — rather than expect those other workspaces to consume those outputs directly — write modules that consume those outputs and expose the details to the rest of the workspace.
This technique gives a layer of abstraction, which allows the outputs from the other workspaces to change and update, with the modules acting as a translation layer so that the output feels stable to the module user.
We do this with a module called Subnet Sync, which helps copy AWS tags from a shared VPC account into the local AWS account subnets. The local workspace can discover and use shared subnets through tags instead of hardcoding the IDs. But the user doesn't have to dig through the output of the shared account Terraform, which is operated by a completely separate network team.
With all this talk of a guild and ownership, it's tempting to make declarations that everything must pass through the guild or keep all infrastructure changes entirely within existing infrastructure teams, but this is a mistake.
Tools like Terraform provide a path that connects the Terraform user more closely to the infrastructure, and that should be embraced. Use of standard modules, compliance checks, and so forth, can help meet the regulatory and security requirements that your company may need and guide the Terraform user toward the choices you want them to make — without a centralized team reading or writing every single line of code.
When you over-centralize, you also create points of failure with what is typically a smaller staff. And that smaller staff tends to write larger, more complex code when they believe that they are the primary user of that code.
This, in turn, leads to fewer composable modules and a larger hill to climb for co-workers who are trying to understand how things work. This group is often seen as a blocker, not an enabler. And with the cloud on the other side of the hill, people start looking for ways around the path if all roads lead through you.
Instead, use Terraform itself and other mechanisms to perform common tasks or create templates and figure out how you can make those self-service. Do these things very publicly. It will take a little longer — but use your communication channels to let people know when you are deciding on process. Even if they don't join, people feel better when they've been invited and are more likely to follow the resulting guidance.
Lastly, like with the governance of the infrastructure itself, you're going to want to create guidelines for the language and use linters and repository structure to help people write better code themselves.
There's already a preferred module for an S3 bucket. Your deploy process could let the developer know, hey, you should use this module instead of this resource. Time spent on those systems decreases the time and attention needed on code reviews and allows those reviews to be done by a broader, less specialized group — reducing the impression of centralized control.
In the end, we did complete the Winter is Coming goal, and it was an incredible goal, to begin with. Along the way, Terraform went from a cool tool to play around with to a critical component of everyday life at Indeed.
There are over 500 people in our Terraform channel asking and answering questions on a daily basis. Guild works hard to make Terraform better for everyone and, every day, thanks to Terraform, we get closer to being able to see and manage everything, everywhere, all at once.
I hope our experiences were helpful to you as well. Thank you for your time.
How Discover Manages 2000+ Terraform Enterprise Workspaces
Architecting Geo-Distributed Mobile Edge Applications with Consul
A Field Guide to Zero Trust Security in the Public Sector
Deploy Resilient Applications with Service Mesh and AWS Lambda