Learn how Wayfair built their internal developer platform around Terraform Enterprise and Google Cloud.
Good afternoon. My name is Nicholas Philbrook. I'm a staff engineer at Wayfair on our public cloud team, and my talk today is about how we've used Terraform Enterprise to transform how our engineering teams provision and manage cloud infrastructure.
First, a little about my employer, Wayfair: We're a home goods e-commerce company. We like to joke sometimes that Wayfair is "Big Couch." We do sell a lot of couches, we sell lots of other things, too. But I wanted to draw your attention to a number on the far right of this slide — over 2000 engineers. These numbers are accurate at the end of calendar year 2022.
We're a medium-sized company. Our website handles between 20,000-30,000 requests per second on an average day. On big days like Cyber Monday, other promotional sales, we'll see spikes up to 50,000-60,000 requests per second. Wayfair started out in traditional data centers, but with such a bursty workload, the cloud was the natural destination for us.
We started our cloud migration back in 2017, and we completed it at the end of Q3, 2022. It was definitely a lot of work. And that's not what I came here to talk about today, but we did get some good press when we shut out the lights on our last datacenter. Google Cloud did a nice case study writeup on our website.
We used Terraform Community from the very start to manage our cloud infrastructure. To manage our Terraform, we had a single repository and wrote our continuous integration using GitHub's workflow with the state file stored in a Consul KV backend.
I'm sure a lot of you are familiar with Atlantis or have used Atlantis. We actually built something very similar about the same time that Atlantis was being developed as well. I wish it had been a little more mature. It could have saved us a lot of time, but we did a similar thing using Jenkins.
This system has worked well for us throughout the years. It saw a lot of use and adoption. You can see from the slide there — over 27,000 state files, 57,000 commits over six years, and only a few outages. I pulled those numbers about two months ago, and it's still growing.
Back to those 2,000 engineers — well, they're not all writing Terraform code. In fact, in the early days, a lot of the committers to the Terraform repository were people like you and me, platform engineers, infrastructure engineers, SRE types, working with the development team to provision their bucket's compute to run their applications.
But as the engineering organization grew, we started to have a lot of software teams writing their own Terraform code, and this started to generate a large pull request burden.
In calendar year 2022, I reviewed over 700 Terraform PRs, and that was only through early November. I don't mention that as a badge of honor, just a data point. I'm sure some of you in the audience can beat me on that number. But we were starting to spend a lot of time reviewing PRs.
In the early days, every single piece of code in this repository was run as a single GCP service account. There was a lot of risk there, and we needed to be diligent with those reviews. Later on, we did add the capability for multiple service accounts for different route modules. We combined that with GitHub's code owner's functionality so that we reduced the PR burden a little bit, but it still wasn't scalable.
We had to provision the service accounts manually using Terraform, of course. We got this big JSON mapping of directories to Vault paths with service account keys, so it made it a little bit better, but it wasn't great.
There was also a maintenance overhead with this continuous integration. We wrote it for a Terraform 0.11 initially, and we kept it upgraded for a couple of versions. But every time we upgraded the Terraform core version, more maintenance had to be done. For example, changing how Terraform downloaded and installed providers. With 0.13, that was a big change. Changing how to plan and format because we were parsing the plan, posting comments back to the pull requests. Anyway, we needed something better.
We started building a Terraform module to provision a collection of Google Cloud projects, including development/production separation in a Google folder structure, along with service accounts that were highly permissioned but only within those folders.
A brief detour here because I know there's another very popular cloud provider out there. Google Cloud has a very rich hierarchical folder structure, and you can nest projects under folders, resources under projects. And you can set IAM permissions at the folder level and it inherits down. It works really well.
I haven't worked on Amazon Web Services myself in quite a while. But, I did check in with a few of my coworkers who have more recent experience. They wanted to let me know that the concept of a GCP folder is roughly analogous to an AWS organizational unit. GCP project, roughly analogous with an AWS account. That's a rough analogy. Please, don't attack me in the hub afterwards with all the differences. Hopefully, that helps paint a blurry picture if you haven't worked with Google Cloud.
The new module also created some basic foundational elements. Like default view access for the team requesting the environment, default monitoring integration with our monitoring vendor Datadog. All the Google Cloud services that we support are enabled out of the box. We also have the concept of a sandbox project, where human accounts have elevated permissions. They can click anything they want on the web console, the prototype.
Of course, for development and production we enforced ‘everything goes through infrastructure as code’ using Terraform. But, we did have a rich appetite for this ability to rapidly prototype in the sandbox, so we enabled that with separation from the production and development networks and environments.
For our initial proof of concept, we used this module in our monorepo, and it worked well for proving out the concept, but provisioning was still fairly complex. We had to have someone like me or someone on my team to instantiate this module. We had to set up the code paths that JSON filed — the code owners I mentioned earlier — so it proved the concept. But, next, enter Terraform Enterprise.
Prior to this project, we had had some experience with Terraform Enterprise, but we were mostly using it for the API on that last project. We hadn't rolled it out to a wide audience of users yet. But we had seen how slick it was as an execution platform for Terraform code. It was way better than what we had developed on our own, which you would expect coming from HashiCorp, the developers of Terraform.
Since we were making a clean break in how our teams were managing and provisioning their Google Cloud projects, we took this opportunity to start using smaller, decoupled Terraform repositories — separate from the monorepo — and gave each team their own set of Terraform Enterprise workspaces integrated with those repositories.
To accomplish this, we used the Terraform enterprise provider to provision a Terraform Enterprise organization as part of our new module, along with useful things like Terraform Enterprise teams, variable sets for dev and prod, variables to populate those variable sets. For example, their Google credentials are pre-configured in an environment variable, so they don't have to worry about that, and a default global Sentinel policy set.
From a tactical perspective, we have one admin organization on our Terraform Enterprise instance that instantiates our awesome new module. End users don't have access to this admin organization, so it's where we enforce things like that default global Sentinel policy set.
Once our module provisions the organization, we want to hand over the keys to the users. We made a strategic decision here to not provision Terraform Enterprise workspaces as part of this module because we wanted the users to have fairly broad control over the workspaces. But we still wanted to provide them with some useful automation and a working starting point.
To accomplish this at the workspace level, we also built a Buildkite plugin that will translate YAML configuration into Terraform Enterprise workspaces. Teams can manage things like the Terraform core version, whether they want to auto-apply runs, stuff like that. This YAML is stored alongside their Terraform code in their decoupled repository. These decoupled repositories are generated from a template we maintain, so teams' users have a working Terraform repository integrated with an Enterprise workspace right from the start.
We have one last piece of the puzzle: to make this thing truly self-service. We have this admin organization where we restrict user access, so the users can't come in and instantiate it themselves.
At this point, we were able to build on some existing tooling at Wayfair for providing self-service interfaces to infrastructure.This is built on top of Backstage, which is Spotify's open source developer portal. I'm not going to go into all of the technical details on the middle part here because we'd run out of time. But based on some excellent engineering work that had already been done at Wayfair and plugged into Backstage, we were able to expose a web form.
Users fill in some form inputs, submit, and we issue Terraform Enterprise API calls to create this new workspace in our admin organization using that module repository with variables set to the correct values based on the inputs.
And boom, they're off and running. Putting all the pieces together from the last few slides, you can see a high-level view of the interactions here. This existing custom tooling also supports CRUD operations.
So, if the user needs to go back because they forgot something, they can add a new team, add a new service. Or if they have decommissioned the project, self-service delete is in there, too. We refer to this collection of projects and folders — along with the source repository and Terraform Enterprise organization — as a decoupled infrastructure environment.
First, I'm entering the environment name here. These next two fields are some ownership metadata and some geographical metadata. In this field, I'm adding access for my team and to the build-type pipeline that's going to manage my workspaces.
From a user's perspective, they're going to see this form until it completes, but since I'm demonstrating the backend, we're going to pop over to our admin organization and filter the workspaces by name. We can see we've got a plan running. We're going to pop over to the variables tab real quick, and we've got a bunch of variables here. It's a combination of the form inputs along with defaults.
Going back to this run, this whole thing takes about 15 minutes to plan and apply, so obviously, I edited a bunch of time out of the video. Here, we have a successful plan, 651 resources to create. I should have edited it down a little bit tighter, probably. Here we have a greenapply.
We're going to look at the decoupled infrastructure repository next. We have a recommended directory structure here. If we look at the Buildkite configuration, we'll see that YAML file I mentioned earlier that configures the Terraform Enterprise workspaces.
Now we're going to look at that first Buildkite run that was triggered, and we have links to our newly created workspaces. I'm going to click on one of those, and we got a 404. We're going to come back to that in a little bit to see what happened there. But let's just try that again. Here, we have one of our user workspaces in the user organization along with, again, an automatically generated first run.
Wayfair has been a power user of HashiCorp Vault Enterprise for several years now. Over the years, we've built up some custom metadata and distribution tooling on top of Vault, and we've combined that along with Terraform Enterprise's workload identity feature to enable bidirectional Vault integration.
This means teams can access their existing Vault secrets using a Vault KV secret data resource. They can write new secrets using a Vault KV secret resource. This is a great showcase of HashiCorp's products all working together.
I previously mentioned a global Sentinel policy set. We've written a bunch of Sentinel policies — a combination of advisory, soft mandatory, hard mandatory — mostly to guide folks away from dangerous configurations we've discovered over the years. But also to enforce things like your production workspace can only run off of code that's been merged to the main branch. In the development workspace, you can prototype on feature branches.
We've also given teams the ability to elevate their human account permissions in case of emergency. Infrastructure as code is always the default. But in an audit situation, sometimes you need to take the most expedient route of clicking on the web console to fix something. So we built that capability into the system along with an audit processor that tracks these manual changes for SOX compliance.
So, how's it going? We piloted this product with a few teams early this year and then launched it for general availability in April, and we've seen a phenomenal rate of adoption. Some of that has been forced adoption, as we've mandated that, for certain cloud services, you have to go through this new provisioning workflow to access them. That includes a lot of new cloud-native stuff like Cloud Run, Spanner, Eventarc, many others only available in this new decoupled framework.
There have been some bumps in the road. We've definitely experienced a number of learning opportunities and hurdles along the way, which I wanted to share with you next. I'm going to start with some of the smaller, more tactical issues.
We manage Terraform Enterprise team memberships via an integration with an external identity provider. This is great because it lets teams manage their access with existing tooling we have. But if you've used this integration at all, you probably know that synchronization only happens when a user logs into Terraform Enterprise and that external identity provider sends a payload with their team memberships.
So, we have this fun interaction with our teams very frequently. They're logging into Terraform Enterprise already, they go to Backstage, they provision their new organization, they go back to Terraform Enterprise, and they can't see it because their team membership hasn't been updated yet. This is what happened on my demo earlier. We've got good at answering this question quickly. You just have to log out of Terraform Enterprise and log back in.
Another general challenge is this module manages a lot of resources, at least in my experience. Depending on the combinations of inputs, we usually manage between 600 or 700 different resources, most of them in the Google provider. The initial run to provision takes, like I said before, 10 or 15 minutes. A lot of times, our users get impatient. They think it's broken or hung, and it's just the normal amount of time it takes.
In the early days, we had to request several quota increases for the GCP project that hosts the service account running these operations. Some resources — for example, the Google project service resource that enables a cloud service in a project — just takes a lot longer to create and read than you would expect. It's not just the create time. It's also the read time, so when it's refreshing the state on updates, it also takes a while.
Another challenge has been building a module and system that's resilient to the occasional platform outage or interruption. When you're managing hundreds of resources in this long dependency chain, a single API error right in the middle can wreak havoc. Terraform strives to be item-potent, so usually, a retry will work. But we've hit on some fun and interesting edge cases.
An incident was resulting in a large number of tainted Google project resources. Of course, we all know when there are resources tainted that, on the next run, Terraform is going to try to delete and recreate that resource. If you're familiar with the Google Cloud project lifecycle at all, you might see the problem here. When you delete a Google project, it goes into a delete requested state, and then you can't use that project ID for another 30 days. So, that recreate is just never going to work unless you wait 30 days, and even then, sometimes it doesn't work.
So, we had to go through and clean up a bunch of these manually and ask our users to retry with a new project ID, which is not the best interaction. We were able to mitigate this while the incident was ongoing with a change to the parameters., If you're curious on that one, hit me up after the talk. It's very specific to Google Cloud on this resource. So I don't want to spend too much time on it.
Another Google-specific challenge was one of GCP's hard platform limits, where a project's IAM policy can only have 1,500 principles. A principal is either a service account, a user, or a group. So, despite this project's emphasis on decoupling, we still do have a few centralized projects where we store things like VM images, container images for common infrastructure components. And as part of the provisioning, we are granting access to these centralized projects to every decoupled Terraform service account. At a certain point, we hit that 1,500 principle limit, and all the new provisioning ground to a halt.
Thankfully, this one had an easy solution — we just hadn't needed to explore yet — which was to manage those Terraform service accounts as part of a Google group and then manage the permissions on that group.
Unfortunately, this module predated the Terraform Enterprise feature of projects, so we only had organizations and workspaces to work with, in terms of hierarchy and access. In the early days of this project, we were talking with our HashiCorp account team and the topic of how many organizations we had came up with. I think we had about 50 at the time, which seemed like a lot. We mentioned we're probably going to have close to 1,000 by the time this is all said and done — and you saw at the top we're between 600 and 700 now.
When we said over 1,000, we got a lot of blank stares and raised eyebrows during that conversation. But a concrete result of this was that we had some performance issues with the Terraform Enterprise interface that we were able to track down to admins like myself, my teammates.
We were members of all of these organizations, and we were hitting 10 or 15-second page load times on certain pages on Terraform Enterprise. Thankfully, we were able to work with our account team, HashiCorp support. The Terraform Enterprise team rolled out a patch fix to fix that one, so thank you for working on that. If anyone's in the audience who's on Terraform Enterprise, I know that was an edge case, but it was making life really slow for us for a while.
One issue we've discovered with making this capability fully self-service is we end up with a lot of abandoned environments around. A team or an individual might want a sandbox to experiment with something for a few weeks, and then they just forget about it and never come back. Self-service cleanup is available, but there are no monitoring or reminders on that.
Another common pattern is that a user will receive an error on the initial provisioning. Rather than reaching out for help trying to dig into the root cause, they'll just try again with a different set of inputs and a different name. So we'll run into these half-created, half-broken environments as well.
These decoupled environments do come with some baseline costs that accumulate over time. So we have some ideas on the backlog to monitor for this and address this situation, including the recently released ephemeral workspaces feature on Terraform Cloud. We're definitely going to look into that when it comes to Terraform Enterprise.
Onto some of the bigger issues that we're still grappling with. We have this uber module that includes several other modules in this composition hierarchy. And when you need to make a change down on the bottom of this hierarchy, it's a series of PRs from the bottom all the way up to the top.
When you read HashiCorp's official documentation on module composition, there's this line that says, "In most cases, we strongly recommend keeping the module tree flat, with only one level of child modules." I used to balk at that recommendation because I'm a "don't repeat yourself" enthusiast. But I think they might've been onto something there now that I've experienced this.
We did combine several of these modules into the same source repository, which reduced the PR burden a little bit. And once you get that last PR merged, how do we update those 600+ and growing workspaces?
At the beginning, we used the default VCS integration, which meant, as soon as that commit hit the main branch, all 50 of those workspaces kicked off a run. This became problematic almost immediately.
We were overwhelming the Terraform Enterprise host and the GCP API quotas. Even in the best case, a human had to come through and click apply on all 50 of those plans. We ended up moving to a version-pinning process, so we could manage the rollout more gradually. Even after throwing some more compute at it, we wanted to be able to control it more gradually, so this required yet more custom deploy tooling using the Terraform Enterprise API.
I'm sure you can imagine how this part works. The API sets the branch on a workspace that will generate a run automatically. The tool will come back through, pull for those automatically generated runs, then compare the plan against some safety parameters we provide and conditionally apply the run.
We crank up the concurrency on this as much as we can without triggering those earlier issues, but it can still take the better part of a day for this whole thing to finish. We're still not happy with this process; improvements continue.
Currently, the engineer running this is doing it from their workstation, so we'd like to get it into a deploy pipeline, obviously. True story: one of my teammates was running this process, lost their power and internet. It left behind a whole bunch of dangling runs we had to track down and apply. And also a bunch of workspaces didn't get the update.
So, obviously, getting this into an actual deploy pipeline, that's a near-term priority along with rolling out TFE, worker agents, agent pools. Currently, we're running everything on the single Terraform Enterprise instance, so it scaled pretty big when it doesn't need to be all the time.
Finally, the biggest challenge of this whole project, bar none, has been documentation, training, and education. Until now, the Terraform development at Wayfair has been constrained inside this tiny restrictive box — the monorepo — with its limited CI. Still, for all of its limitations, it was a well-understood box.
Terraform Enterprise is the luxury starship of Terraform execution engines. But just learning how to use the tool, owning your workspaces, being in control of your state files, it's been a learning curve for a lot of engineers.
Recently, I had the fun experience of training a few people to use the Terraform state RM command from the command line to remove some stubborn resources. Very excited about the new config-driven state remove that we saw this morning.
Under our old model, with one Consul cluster storing all the state files, we kept the keys to that cluster in the hands of a pretty small number of people. So anytime there was a state file operation, it had to go through us. Now, it's fully self-service, but it comes with an educational headwind.
Not everyone has been excited about this. Not all teams want that power or the responsibility that comes with it, so we've been attacking this problem from multiple directions. We've been running hands-on workshops where teams use the new tools to develop an example application with real-time help and instruction available. This has been one of our better-received options. But unfortunately, the throughput is low, the time commitment is high.
We have a library of documentation that is, unfortunately, large and not very approachable. This is another area for improvement for us as a platform team. We have a bot in our Slack channel that will match Reg Xs against common questions and send out links to documentation. This has held a little bit.
Of course, I can't give a talk in 2023 without mentioning large language models. We would really like a smarter bot that can train itself on our corpus of documentation and then answer questions in a conversational style, just like the human experts. We're actually prototyping this right now, so hopefully, that'll launch soon.
To continue on that theme, I wanted to review how the product has been received by our engineering community.
I've included some qualitative feedback from a recent survey. As you can see, we have a lot of work to do on the provisioning process and the documentation. We do provide a lot of example code, including an extensive demo repository with example code for almost every Google Cloud service that we support. But we can never quite predict all these combinations of services that our users are going to use, so the effort is ongoing.
We've had a lot of positive sentiment as well, with engineers enjoying the ability to use the true Cadillac of Terraform pipelines — Terraform Enterprise, with modern Terraform versions — without needing a lot of code reviews from outside their team. I have included some qualitative feedback from one of our early adopter teams here. Using the product has resulted in a huge increase in velocity for them.
Another team's experience was quite positive. Again, focusing on the ability to rapidly prototype in the sandbox, translate that to development and production, the autonomy of owning the decoupled environment completely, and the confidence that other teams won't be impacted by their changes.
At Wayfair, we like to say that we are never done.
A big one is making that workflow easier to provision. I glossed over some of the details there. You just have to fill out a few form fields. There are a lot of subtleties here. It's hard to communicate to an external audience. Suffice it to say, there are a lot of improvements that we can make on that.
Another big thing that you might've been thinking about is those 27,000 state files from the monorepo. We do want to import those into Terraform Enterprise eventually. That's going to be a big project. But thankfully, with Terraform 1.5 and the import block, this is going to be a lot easier.
Prior to the import block, importing into Terraform Enterprise was very painful. You didn't have access to your workspace variables, which is usually where your credentials were. I wrote some tooling to do it with one of our teams, but it was not fun. The import block looks great, so kudos to the Terraform core team for working on that feature.
Obviously, we're going to keep running those workshops and increase the variety of available content while we iterate on and improve the findability of our documentation.
Lastly, we will continue to make more Google Cloud services available. This involves a security review and writing some basic Sentinel policies based on best practices, preventing worst practices.
Before I wrap up, I wanted to acknowledge all the teams and individuals that contributed to this project. It's been a great experience collaborating with so many cross-functional teams at Wayfair to unlock this power for our users.
Between writing that build-side plugin that manages the workspaces, all of that self-service tooling that sits between the web form, Terraform Enterprise, and a ton of Google Cloud-specific work that I didn't get into in this talk. Stuff like service perimeters, organization policy, IAM deny policies, VPC networking, workload identity — all kinds of fun and interesting stuff that we oftentimes push the limits of what Google Cloud offers us. It's been a true team effort.
We also have a great product management team that keeps the ship steered in the right direction, manages the signups and the scheduling for those workshops, gets feedback, tells the story to the upper management — all that great stuff that engineers aren't always good at.
Lastly, I wanted to thank our account team at HashiCorp, along with all the previous folks who have been on our account team throughout the years. We meet with them regularly to discuss how we're using and breaking Terraform Enterprise. They've been incredibly helpful at getting us the right experts on the line to assist with architectural decisions, getting our support tickets routed to the right people, all that good stuff.
That's it. I hope you learned something from our journey. Thanks for listening, and have a great afternoon.