Armon Dadgar sits down with two technical managers from Wayfair to talk about their cloud journey and why they switched to commercial HashiCorp Vault and Terraform.
In this interview, HashiCorp Co-Founder and CTO Armon Dadgar sits down with Jeff Dillon, Associate Director of Developer Platforms at Wayfair, and Travis Cosgrove, Senior Manager of Configuration Management at Wayfair, to talk about Wayfair's cloud infrastructure evolution, how they used HashiCorp Vault and Terraform on that journey, and why they eventually ended up becoming HashiCorp customers—adopting both Terraform Enterprise and Vault Enterprise.
Thank you so much for joining. I hope everyone's had a great time so far with day 1 of HashiConf. It's been a ton of great talks. Thank you to all the speakers. Thank you to our MCs. I'm super excited to end the day with a few guests who are joining us from Wayfair, who are going to chat about their journey, adopting the tooling from what was their initial drivers through to what they're doing and where they're hoping to go next. I'm excited to welcome Jeff Dillon and Travis Cosgrave from Wayfair.
Hi, everybody. My name's Jeff Dillon. I've been at Wayfair for the past decade at a bunch of different spots within infrastructure and SRE. At the moment, I'm an associate director on developer platforms and run a few developer enablement teams.
And I'm Travis. I've been at Wayfair for about 5 years. I've worked with a variety of our config management components, including Vault. Right now, I'm managing our configuration management team.
Thank you both so much for joining us and making the time. Before we jump into the adoption, let's set the scene a bit. Rewind to 2015. I know that was before you went on your cloud adoption journey. Give us a sense of what Wayfair's infrastructure looked like at that time.
It was large-scale enterprise, but it was physically owned datacenters, with a large datacenter on the East Coast, one on the West Coast, and one in Europe, and scaling up, up to that point, wasn't a constraint, but later, as with any e-commerce company, we started finding the seams of what we were pressing into on such sales days as Cyber Monday here in the States. That's when we started toying with the idea of cloud.
You mentioned e-commerce. For those who might not know what Wayfair does, give us a sense of the scale of it and Wayfair's business.
Wayfair sells all things home-related, decorative, couches, appliances. In addition, we're a technology firm as well. We have a 3D render suite. We have an artificial reality suite.
You alluded to some of the challenges on things like the scale days, but, Travis, I know there was a journey then to move out of the private datacenters, or at least to incrementally adopt cloud. Give us a sense of what the catalyst was, the driver for you guys to say, "We need to do something different than private datacenters. We think cloud is the right way." What was the driver for that?
The first thing we dipped our toes into was trying to solve some of our image problems. Selling so many products, we have a lot of images, and that was one of the most difficult systems to scale at the time.
We had great success through our first big sale day with stuff in the cloud. But we really were sold on this ability to scale dynamically during our first Way Day event, where we didn't have a lot of our customer-facing stuff in the cloud. We were shutting down physical infrastructure. We were repurposing whole hypervisors. It was very chaotic in this tangible sense.
After that event, there was almost no question that we needed something far more dynamic. And the cloud is obviously where you do that.
I guess Way Days are similar to an Amazon Prime Day, where you guys are getting a crazy spike in traffic.
Yeah. You can think of it that way.
So elastic scaling allowed you to move some of your images workload over to cloud. That makes a ton of sense. You started by identifying specific workloads that you could move to cloud.
I'm curious where Terraform and Vault started playing a role. I know both of those were fairly early as part of your adoption journey.
With that rare opportunity to test out the cloud, we wanted to do it right from the beginning. We didn't want to forklift existing human processes. We wanted to figure out the best process for the reality that is the cloud. That's when we decided on Terraform, in addition to that Consul, in addition to that Vault, in addition to that Packer.
We built the ecosystem that Travis just alluded to the right way for the cloud from the beginning. After we had success there and realized that what we were trying to do on-premises was almost trivially easy to set up inside Google Cloud Platform (GCP), we did a mind shift: "We can't have 2 separate teams. We can't have 2 separate entities within infrastructure where, if you needed a cloud thing, you talked to these folks and if you needed a non-cloud thing, you talked to those folks." We set out to unify that stack and try to see if we could hybridize it.
You mentioned a ton of Hashi tools, which we always love to hear. Give us a sense of what problem Vault was solving as you were thinking about cloud, and why did that need to come into the mix when it wasn't used previously?
Early on in the cloud journey, we were looking at how we manage secrets, and we had these clunky encrypted files with different keys to get things distributed. And when we went to bring that key handling and decryption handling to the cloud, it was unbelievably clunky just to make that happen.
Being able to shift that data source to an API that we could then program against in a far more dynamic fashion is what led us to Vault. Also, there are a lot of things that Vault has already solved in terms of database access or generating certificates for you and that kind of thing. The future prospects were really promising.
We started small with Vault, and it took a long time for us to start to adopt it wholesale and explore a lot more of the features, but it did get us away from that clunky workflow and led into a place where we had a lot more freedom and flexibility to get this technical aspect of delivering the secrets.
I love the unifying theme of doing stuff in a cloud-native way, that dynamic aspect, whether that was the elastic nature of what Terraform was letting you do in terms of scale-down or on the credential side with Vault.
Jeff, you said you didn't want 2 different teams, 2 different ways of delivering things. You wanted a standard way. Did some of the thinking and practices that you adopted in cloud change how you thought about running the private datacenter? Was there an influence, or did you have just 1 team but 2 different processes?
That's a great question. We were almost dead set that it had to be a single pipeline. The pipeline had to work because we have 2,000 developers. Just the process of getting the knowledge out there with 2 separate things was way too complex.
We started building a lot of our own providers in-house just to cook up the pipeline. As a developer said, "I need infrastructure. I need compute." It's the same ingress for them, the same portal, the same HCL code that they submit to the pipeline. And it's basically the providers and the variables that we submit in the backend that decide whether it goes to the physical, on-prem datacenter or up to a Google Cloud datacenter.
I know you guys were on the bleeding edge of doing some of this stuff. I remember we were together at your offices doing a meetup group several years ago. You guys were saying, "Here's our unified Terraform pipeline. We can go on-prem with our custom providers that we wrote to automate private datacenters as well as the providers that take us out to GCP. "
That was really cutting-edge in terms of thinking about modernization of the private datacenter. Now, to flash forward a bit, as you're scaling up your usage of the tools, as you're scaling up the engineering teams to 2,000 people, how did you think about what you needed to do to enable that scale-up?
We started with Terraform open source. And we discovered a constraint that we hadn't realized, or at least predicted, in the beginning:, as you scale up, the repetitiveness of HCL code throughout started to become a bottleneck. It wasn't as bad as what it was like back in 2015, but we were looking for the constraint in the system and always trying to produce value faster and move faster for the business.
We started eyeballing where that next constraint was. We were almost asking, "Is there a way for us to allow developers to get what they need as quickly as possible, to serve them, to serve their jobs, and to reduce the toil and need of coding on their end?"
I think this is the classic cycle we see people go through. Step 1 is, "How do we enable the teams to at least go to cloud?" Step 2 is, "We have that consistent pipeline that you described of: we have one standard way of doing it." But step 3 is starting to think about, “How do we do this in a scalable way where we don't have 2,000 developers all reinventing the wheel?"
I know that became a bit of a journey from open source onto some of the commercial Hashi products.
Travis, could you talk about that transition for Vault? What were the reason and the drivers?
We tried early on to manage the open source clusters. That started small in the cloud. We discovered all kinds of limitations when it came to how small we started, and it was surprising that it lasted for as long as it did.
We built out some more robust infrastructure with open source, but there was this reliability and scaling problem that we bumped into that we tried to solve on our own by replicating the data from one place to another. But that started to become operationally heavy as well, quite quickly.
The big step for us, the thing that drove us toward enterprise for Vault, was when we did a review: What does it take to grow this? What does it take to get to that next phase? We said it's the replication piece that's an Enterprise feature, and then adding in the support and the other future features that we know were coming along. That was the big step for us and the thing that drove us toward Vault Enterprise.
It's been funny to watch how many customers start out thinking, How hard could it be to build our own replication? How hard is a distributed system really? I totally appreciate that journey.
Jeff, I know there was a similar journey around the same time for moving to Terraform Enterprise. Could you share a bit about that?
It's an interesting story. We tried internally to figure out a way to dynamically create HCL code for Terraform. Do we do inputs? Do we have a form? Do we have a frontend? More than just 1 team was figuring out whether this was a capability.
Then we started re-discussing with HashiCorp, and the attractive part of Terraform Enterprise was the API, which was a real interesting conversation to have. It was almost like walking into a car dealership and saying, "I don't need leather seats. I don't need air conditioning. I care about the engine. The engine is what unblocks us."
What we discovered in our POC that led us to purchase it was, let's take a Google Cloud storage bucket, for example, with no other helpers and a multitude of API calls to get it set up.
We had to get the permissions correct and have out the door prod-ready for the developer. The Terraform provider in front of that simplifies those API calls for us, but it's still a multitude of HCL at that moment.
We swapped that out with Terraform Enterprise, and we were able to reduce it to an API call with all the parameters we needed. Then Terraform built the bucket for us and sent it back upstream for us. That was the selling point for us, the unblocker: I can give a developer a button or have a middle layer between that button and Terraform Enterprise. Developers can submit a YAML file and say, "I need these 40 buttons." The speed that they get feedback and the resources they need to do their jobs is tenfold now.
So it's really starting with using the open source Terraform on that core enablement to have a consistent journey across on-prem and cloud, but then it was the commercial products that help in terms of scaling up and having a commonality, creating an API-driven interface, having these multiple consumption methods. It also means you don't reinvent the wheel.
As you started going down this journey, did you think, "We have this initial set of use cases, but we've started finding other ways we might use products like Vault in ways that weren't what we originally anticipated"?
Travis, I know you've been doing a lot of work with it. Did you find some interesting use cases outside of that initial one?
Yeah. There was the initial replacement of this clunky YAML base to using Vault as a secure backend that we can put everything in. It's still key-value-based, and we'll go from there.
But as we started to get more into the cloud and more of this dynamic nature, with the Terraform tooling that got built up around it, we needed a way to distribute secrets beyond just the VMs that we were building. We couldn't keep the same interface that we were running anymore. We now had things that were running in Kubernetes, and we were adjusting our CI/CD pipelines. Where those were running, with the adoption of containers, developers were starting to do a lot more local development rather than in shared VMs.
This led us to say, "How can we enable Vault to distribute to all of these different places?" We set up a taxonomy of how we describe different interactions with Vault. We built out almost an entire internal algorithm to say, "Anything that is in Vault we can then get distributed in a unified fashion to any place that might need that secret, including things like approvals and checks.
That's where this journey became fascinating because now, all of a sudden, wherever you are, your secrets pathway is well defined and incredibly consistent.
It's interesting to see how it started with just, "Let's do the key-values and solve this immediate problem," but then branching out to, "What about the developer environments? What about the dynamic credentials?"
I think what we often see is that Vault is like an iceberg; people start at the very tip of it. It's like, "I just have the secrets management piece," but it's all that capability below the surface, that you dig into as you get deeper in the journey.
To wrap up, Jeff, you guys have made a ton of progress and are mature now in this cloud journey. Give us a sense of where you're going; where are you now taking the platform? What's the frontier for you?
We are moving fully into Google Cloud, which is interesting, because the journey that we set out on 5 years ago was just the first toe in the pool.
I think the story flows in that we brought in Terraform and Vault and other tools originally as a way to simplify our jobs as infrastructure people. But as I've managed the past couple years, I realized it doesn't really simplify my job. It helps simplify the developers' jobs.
One of the areas that we're concentrating on for the next year is a large-scale enterprise platform-as-a-service layer within everything. To give you a sneak preview, if you, Armon, were a developer and you started within Wayfair, the dream is that you come in and we have an open source frontend UI backstage.
Behind that we have a middle layer that we wrote ourselves in Golang, and that is our governance layer. That talks directly to Terraform Enterprise, and Terraform Enterprise talks directly to Google Cloud. So you don't have to write HCL. You don't have to talk to other teams.
We can set up SOX compliance around it. You can deploy to prod and deploy to dev, hopefully as quickly as you can code your app to head out in that direction. We have it tied into Kubernetes. We have it tied into Kafka. We have it tied into Aerospike to cache. The goal is now to speed up the developers. In my mind, they are my customer at Wayfair.
I think that's the end goal of infrastructure. It's ultimately plumbing. We're just here to enable the developers and the application development teams to go faster. They're the customer, and the end goal is to make them super productive.
I'm excited about what you guys are building and excited to get to partner on it. Hopefully, you'll join us next year at HashiConf and share where that platform development has gone.
In the meantime, thank you both for making the time to join us here today.
I hope you enjoyed the session. Thanks again to Jeff and Travis.