Hear how Booking.com created a self-service VM provisioning platform using Terraform Enterprise, Packer, and home-grown solutions.
Creating a VM at Booking.com used to take 45 minutes before Terraform, Packer, and other systems. Now it takes 30 seconds to log into a VM, and it's usable in under 5 minutes, all through a self-service UI. They can also do mass updates from a central location. Learn about their journey on their OpenStack, VM-focused system.
Welcome, everyone. It's been a while since we did this talk. I apologize. When we started writing the talk name, we said we were going to do 2,500 users. When I logged in to do the demo, we're at over 3,000 users today.
A little bit about Booking. This is a requirement for any of us that give talks. We got to do the corporate spiel. For those that are not familiar, Booking is one of the world's largest travel websites. We use technology to enable people to travel the world.
I was reached out to about four years ago this month by Booking. As you can tell, I'm originally from the United States. I live here in Amsterdam. And they were like, "Hey, do you want to come here, help us be a travel website?" I was doing a bunch of startups. I looked at the different companies I was considering going to at the time. I figured travel on the scale of evil to not evil more on the scale to not evil, so I decided to join Booking. And to help people do something that I enjoy — which is travel the world.
We have about 3,000+ developers, as was noted before. And we build lots of different tooling for people to book websites, hotels, and also flights — attractions such as book tickets, and to do all the fun stuff.
To the meat of the talk, which all of you are more interested in. We have various topics we need to cover: We need to enable developers to be able to control their development environment, interact with the legacy system that has existed for over 15 years — has made billions of dollars per year — not break those systems, and to move forward.
So high level, how do we do this? I think a lot of you have seen various other talks where people compose various Terraform modules. They also want to use Terraform and Terraform Enterprise. One of the key things we have to work around is: how do we place things into workspaces, and how do we split them up? How do we split up those various workspaces as an organization?
There are other talks here that go into much more detail about how we solve these various problems, but we do create virtual machines via images — via Packer, via image pipelines. Not overly shocking to people that that's done. That's a very common pattern. But that was a key thing that we did to make this work.
The other thing we did to accelerate replacing these legacy systems, and to do it as quickly as possible — the key thing besides Terraform that was the code we did — to orchestrate, get everything together, make it so we don't have to do a lot of mundane tasks and we can go to market faster — we used Terraform Enterprise.
What are we working on? Like how is this doing? As we mentioned, we have 3,000 various users. I did name them as developers. It's roughly 2,000 developers directly, 1,000 data scientists, and 1,000 designers that use the platform that we control — that we use and enable people to do their day-to-day jobs.
When you have 3,000 different users that are also your coworkers, and the system breaks, and they all reach you on Slack, your life sucks when things break. This was the reality of the system that we had and the legacy. It was constantly breaking — and consistently breaking — so we had to replace it. We had two different applications that were owned by 50 different teams. We had this organizational problem of what happens when the virtual machine infrastructure breaks? The application code that they run may break on those individual machines, and those 200 different applications that 3,000 different people may launch, I have to go talk to them and their 50 different teams.
One of the ways we solved that — and I said we're going to hand-wave briefly over it — is we set up an image pipeline. We allowed the two different applications to be able to create it in a factory. And instead of the infra team, which I'm part of — we pointed them to the people who actually owned the code that would break.
Because we’d say, "We're responsible for the provisioning, automation, and everything behind it. But if your application code breaks, we may be smart — we may be good at the infra side —we're not writing the actual website code. Please go work with those teams." And we pointed people to the various teams. That helped us scale quite a bit.
The other thing we found out — working with different users and different pools — for an average developer on Booking. They may spin up to do your job. It's not spinning up one single machine to do your application. This was a lot of different legacy code. They would have to spin up three, four, five, six, or seven different machines, have them talk together, and interact with each other to test out their different interactions.
They'd have application one, which could be the main website, and they may have an order system that they have to schedule. Maybe they're doing some Docker containers — more advanced — so they have to have a Docker VM. We would find out that people — instead of spinning up a single VM — would have to spin up five, and 10, and groups of them.
It served its purpose for the time that it was up. I didn't know what OpenNebula was, but it's a smaller scale. Think of a precursor or similar to an OpenStack-type system but with smaller pieces. It was based in Ruby — just happened to be Ruby. There were scaling issues and other architectural issues. We did replace it with OpenStack, which we'll go into in a minute.
The other big issue that I referred to earlier is there was a lot of confusing ownership issues — and this is a big piece. One thing we wanted to solve was you own your machines that you spin up. If you spin it up, it has an IP address, it boots, you can log into it — not my problem.
We want to be able to point you back to the other people once it comes up that it’s that their piece that broke. And I know this is sounding pushy — maybe not helping our colleagues as best as we can. But the reality is the best way we can help our colleagues in the situation is point them to the people who can actually fix the problem, not ourselves.
It'd be like, "Oh, maybe this dependency broke or something else. Please go talk to this other team." And we wanted, as part of when we built this, to remove ourselves from that loop. Instead of them to come to us and say this is broken and we redirect, we could just directly connect everybody. And that's one of the key things that we did.
We have a new system. Sorry, I'm not a graphics designer. I barely can code these days. This is a happy developer — she's super-happy. We want her to be able to do things for two different users.
I'm not going to go through the UI today. Mainly because it's very Booking-specific and not that interesting. We have a backend service that creates VMs as a service. It actually creates Terraform code, which I'm going to go through in a second. It commits to a Git repository. The human developer, if she chooses to, can go to the Git repository or use the UI. 99.9% or 99.3% use the UI.
There are a few of us — me included — I use the UI a lot. For advanced use cases, I go to the Git repo, which we'll go through in a second. And then it commits to the Git repo and goes to Terraform Enterprise, then it provisions the various machines on OpenStack.
I know, OpenStack. I know. I hand-waved over some other stuff. There's a lot of legacy involved. But to do this with a very small team, we had about five to six people working on the code — That's being generous; probably about three to four consistently.
We had to make all this stuff work for the 3,000 people across the 200 different teams with very limited interaction from them because their job isn't to help build development machines. Their job is to help Booking make billions of dollars doing hotel rooms and flights, etc. We had to do this in a very constrained environment.
I'll walk through a quick demo of what it looks like spinning up. I think you've seen this similar before and how it kicks off. This is the payoff, and I'll walk through how we get to this in a minute. But this is kicking off going to the UI. I clicked a button that said create a virtual machine. Then we're kicking off our plan. It's doing the exciting, fun part about building your virtual machine.
You'll notice here you'll see some stuff that may be a little unusual. We had to write some custom providers. This is the cool part, as well, some of our legacy that we often broke code and we made mistakes on.
We wrote Sentinel policies, and we used Puppet to be like, "Did you set your metadata correctly?" Because we wanted people that were more advanced when they forget to set metadata (which we all do) to make it so that — before their machine spun up and they called us that it was broken — we just checked to be like, "Did you remember to do this?" Relatively simple.
The other piece that we did, 95% of this, is standard. We just used the provided HashiCorp providers. We have some custom DNS tooling internally. That's the Longinus provider. Just some REST API calls — that's all it is — and it creates up some custom REST calls.
That's creating a virtual machine. Not super-exciting, maybe for you. Super-exciting for us because this is played in real-time through the system interaction through clicking buttons.
The provisioning in the machine is under 30 seconds going through Terraform. Going through our entire system. We can actually login in about 30 seconds — the previous system used to take 45 minutes.
As we go through the other common thing — we bug developers constantly: You have to delete a virtual machine, and you have to go do the reverse, do the dependency deletes. Most people will spin up their VM, they'll leave it up for months, and then we have to nag them and email, "Hey, security says you have to delete your stuff after 30 days." They'll get an email; they'll go through and delete. They interact through the UI. They click a button, and it goes through, and it deals with deleting the virtual machine and all of its other dependencies.
We also have some custom code involved that does a lot more advanced things, such as we may have firewall rules that need to be created, interact with our other things we have to do. And we just do this for the users because we have 2,500 different people. They may not know that particular role or particular application needs to have firewall rules. It would be broken. They would end up having to call various people to figure out why it was broken. And we made it so that it happened as part of this interaction.
We get back to the user if a firewall rule apply failed, or we put their virtual machine in an error state —and we'll allow it so that they can recreate it on the fly so that they don't get into a half-broken state. That they can self-service as much as possible.
As I alluded, we do have a UI. That's the more general use case. The more advanced use case is that many of us interact through the CLI and we modify the code on the fly. I'm going to go through and show the demo I did, where we did the create/delete.
One thing we learned when we were doing this system is that you know what you know, and you don't know what you don't know. And as we were going through this, and we were creating these different iterations, we had to constantly modify the code.
Let's take a look at one of our codes that modifies a machine. It uses a module. It has a bunch of custom variables we set for that individual thing that says this is the type of machine, this is the type of VM, which then fires off — it goes to the module.
This is one of the more interesting things we did., is when we go through, and we have automation, we want to allow users to be able to modify their code. And we also want to have our code be smart enough because we make mistakes, to go through and auto-update the modules for the users.
You saw briefly, there's a serialization header. We said, "Oh, our automation did it," we compare it. If it's the same as our serialization header said we will modify the file for the person. If a more advanced user went in and modified it, then "Uh-uh, it's on you. We're not going to touch it. We're not breaking your stuff." That's one more of the geeky, fun things we did.
Here we're going through and deleting — going through the UI. You can look at the history. You can see the other pieces that we did I apologize I couldn't do this live. It's more on the Wi-Fi. I did a few typos as we go through as well.
In the backend via the API, we can see a full history of what our service is doing in the background. It's generating files. It's doing code merges. This is through typical Python Git app's libraries. It's not overly magical.
As I alluded to, we also very calmly create groups of machines. I created four machines and a couple of DNS entries as well. Very common thing to do. I personally don't like clicking 10, 15, 20 buttons to delete stuff. I log into my repo, which I'll do in a second. I'm going to do an `rm*terraform`, and walk you through. You can see that the code did it, and this is where it goes.
The other advanced topic: There's so much you can do through a UI. Maybe there are 20, 30 different use cases that you can enable. Somebody says, "I want to have a very specific DNS entry to be able to do this very specific thing." And it's like one user.
Well, this is one of the reasons we allow people to have access to the repo. We've had a couple of different high-value use cases, for regulatory reasons, where we had to have regulators come in and test things independently. It would take us six to eight weeks — or maybe longer — to do it in the UI for that use case. We were able to update and create the records via the Terraform file, commit it to the repo, and do it. And it took about five minutes.
This is where we talk about advanced users. It did take us like two hours to train the ten people that had to do it. Then they were able to do this on their own, and then we'd walk away. It does come up occasionally — trusting your users a little bit more and giving them access to the stuff under the hood does pay off.
This is walking through the different screen that we're doing here. I'm deleting it. As a new user, I want to say, "I'm going to do these different changes." I want to take a quick look, and I'm going to do it from the CLI — and do it very quickly as I'm iterating through my development plan. S
See what it's going to do. Big shocker. Because I deleted the infra, the Terraform file is going to delete it. It becomes interesting and a little scary for people when they first start to do changes. This gives them the ability do modifications in their own workspace. It allows them to practice doing things independently before they do things in their scarier production workspaces as well. You can do things safely. You can go through and check things as well. After I’ve tested it, I know it's going to delete. I push and delete my actual files.
That is the payoff. I can spin up the virtual machines. That's the particular thing we're trying to enable. But how do you actually enable those 3,000+ people to do this through the UI? That's the harder part. Well, besides doing all the legacy code interaction and making that all work — that was actually really hard.
But how do we spin up these 3,000 Terraform workspaces with 3,000 GitLab repos, and interconnect all of the OpenStack projects for the users? And how do we glue that all together? That sounds hard. Not really. It's a big for loop. Couple for loops.
We go to the source of truth. Who are the users? We have a staff API inside of Booking.com. That's where we go to. We go to the staff list. We figure out who has access, what team they are part of. Grant them access, and we're going to create the infra. We have to create the GitLab repository for that user. We go to Terraform Enterprise. And then actually in that workspace, the bootstraps are OpenStack projects, Terraform Enterprise gives them access to the code. Inside that workspace, we use the Terraform Enterprise provider or cloud provider — big shocker there — and we bootstrap all of the dependent workspaces from there.
We'll go through a workspace creation demo. As I said, this runs in a loop. And this particular thing runs all of the time. Let's do it in a little bit of an inverse order. I'm going to log in to this dev.r08. That's actually a team. There are reasons we do smaller teams, but that's the name.
As part of that, when we go into the workspace, we have all of these different variables that are associated. That's an implementation detail of how we do it. I can, look at the history of all the runs, see what happened. You can see it’s fairly common for continuity.
We do capacity testing against our various systems.
I want to take a look at this particular merge. I want to go figure out what's going on. I take a look at the code. I see a bunch of changes were made — looks like a bunch of additions. I can look at this very quickly and understand what it was.
But I'm going to take a step back, and take advantage, and go look at the commit that did this — Or our automation to debug it. I want to see what happened. I click in, go to the commit. Now I'm into GitLab. I'm going to go look at the repo. And we see our user. We added a user, and it triggered all of those runs — changes that happened in the run.
Let's take a step back. Let's go take a look at the actual code that gets triggered when this happens. I'm going to go take a look at our repo. One thing we learned is we have a versions Terraform file. Versions of all the providers we have. We do this centrally, and we do this in every repo. We follow the best practices, and we go from there.
Something else that we learned —we have multiple clouds on the backend. This is our provider configuration, which we point to London, Amsterdam, and Frankfurt. If you look, it's an implementation detail as well, but there's a cloud's YAML. This is how we do more than one cloud. We point an alias and we go through.
Let's go take a look at the workspace. It's just set to local. These are all the users and the usernames here. We look up some different resources. This is the OpenStack project, same as in AWS account. It does a whole bunch of settings that sets things based on the user.
Much more interesting to me is the dev workspaces. We also have a module all at Bootstrap. Bootstrap's a workspace that we link for every single user. We first do the GitLab repo. We do the workspace. We put our variables that chain it all together so we can spin up our virtual machines.
We take all of our secrets that get generated here, and we propagate them to the workspaces. We do this all in one central location, so when we have to go through and rotate passwords, I can do it in a single thing. I don't have to log in, it propagates, and there's not a lot of work.
Let's go look at a workspace quickly. This is mine. And you can see, I kicked off earlier, I removed all VMs. This is one of the things that I kicked off from the CLI before. We have various variables that we propagate that are Booking-specific that we kick off.
We also have some secrets — our firewall, user and passwords — that we use to do to some orchestrations so I can work and do my various codes. In our code, we often update it in a single template. It'll update the master 100 workspaces, then propagate to the 3,000 workspaces when we update Terraform. So we can update it in a central location when we update a module.
It runs in for loops, reads through our code. Obviously, after we test it, and then updates everything at once. This is critical to how we do compliance. How do you do it centrally? How do you scale so that we don't log in to 3,000 workspaces and have to do it all manually? And this is how it's all tricked.
Then, as you can see, I have a repo. This is what you saw me interacting with before on the CLI as well. This is what's possible. This is how you can scale. Because very early in the days that we were working on this system, a lot of people would be like, "Terraform Enterprise doesn't scale. This doesn't work." Well, we're doing it with 3,000. This is for one use case. We have about ten other use cases where you have — I don't know the exact workspace count — It's in the tens, it's not a lot. And this is what we're doing, so it's possible.
By the way, we did this all during the pandemic. Holy shit, that was hard. All remote. Datacenter stuff broken; we couldn't get in. We had to build the cloud. Trying to talk to everybody over Zoom. We hired new people. Never met them. That was hard.
But what was achieved? Via new tooling, and processes, and system improvements, we went from under 45 minutes to spin up for virtual machine — not necessarily because of Terraform but because of the other things we did using Packer, pre-baking the VMs, using the new better-architected systems — we're under five minutes.
We got a lot of kudos from our users. It's quality-of-life-changing if you're a user that has to wait 45 minutes for something, and it takes five minutes. You don't click a button, wait for an email, go get a coffee, come back, forget what you're doing, and the next day and then go use it. You can click the button and then actually log in 30 seconds later — and have it usable in under five.
Via our factory approach we also did 100 applications, 3,000 users. From start to finish of the first application, it took us about three months. A lot of planning, a lot of automation. We made a bunch of mistakes, but because we had these things in place where we could upgrade as we went and we could change and modify when we started, we were able to move incredibly quickly.
As I said, we do mass updates from a central location. We upgrade Terraform code to new modules. More software engineering best practices by doing forward compatibility. Hopefully, if you go to some of the other best practices or you look at forward-compatibility is a big thing. You have to design it in, think ahead. And you can do these types of approaches using those techniques.