Learn how a team of around 100 developers helps an ops team of 4-5 build infrastructure modules at Hotels.com (a brand of Expedia) with the help of Terraform.
Hotels.com (a brand of Expedia Group) started its cloud migration journey a few years ago and infrastructure as code greatly accelerated and secured the process. Terraform provided their infrastructure coding platform.
In this talk, Oliver Fenton, a principal data architect, and Pradeep Bhadani, a big data engineer, walk through their use case for HashiCorp Terraform. They'll also cover some of the processes that they created for development teams to collaborate and help the operations team build infrastructure in the form of Terraform modules.
Oliver Fenton: Hello, everyone. I’m Oliver, and I’m presenting with my colleague Pradeep on how we use Terraform at hotels.com and, in particular, how we use it in a collaborative way, and have not just our operations team running Terraform code, but also get our development and developers to write Terraform, as well.
I’m a principal data architect, so you might hear a lot of things in this talk about the data world and data infrastructure, but it applies to all Terraform infrastructure.
Pradeep is a big data engineer, so also working in the data world.
Who are hotels.com? We’re 1 of about 20 brands of the wider Expedia group. You might have heard of our other brands, such as expedia.com, Egencia, or Vrbo, our vacation rentals business. But this talk’s going to be primarily focused on hotels.com and our journey to the cloud using Terraform.
Stepping back, 3.5 years ago, we were in the data center. If a development team wanted to post an application to production in that world, here are the steps they needed to go through: They start by ordering some hard drives, so they go to our procurement team. Then they had to install and configure that hard drive, so the networking team in the data center had to go and put the racks into the data center and plug in the networking. And the operations team had to go install the applications, the core services such as health checks, monitoring and alerting, CPU collection, disc usage, etc. And then, finally, the development team, a DevOps team typically, could go and deploy that application.
Now, if we look at some trends through this process, we can see that as we go down through the different phases, the time taken to do each task decreases, and it decreases significantly.
The procurement, if you’re in an enterprise business like Expedia, can take a few months. Physically installing the hard drive can take weeks, if not a month. As for deploying the application, we work in a DevOps world. We use continuous delivery, so we can deploy every day, every hour, or even every minute.
We can also see the number of skilled people in the different areas, increases as we go down these phases, as well. We have our procurement team and the data center team; there’s very few of them, and they work across the whole of Expedia group. In the operations team, just for data within hotels.com, we have about 4 or 5 people working in that group. But we have maybe 100 development engineers, just in hotels.com data engineering. So you can see, as we go down, we have more people and we can scale better.
Now let’s jump into the cloud. In the cloud, we no longer need to procure hardware, and we no longer need to physically install it. Those tasks get taken up by our operations team, so they do the procurement and they do the physical installation, but no longer is it physical. These days, they just use infrastructure as code, or Terraform, to do that work.
It’s not a big leap to say, "Why don’t we have our development teams stepping up a level and doing some of that heavy lifting of installing the infrastructure themselves?" And that’s what we’ve aimed to do to be able to scale the business.
We’ve got 6 themes we’ve been working through, and we feel they’re important to get to this level of allowing our developers to deploy the right infrastructure:
• Security and auditability
• Document and version-control our infrastructure
• Configurable environments
• Code reuse
• Automated deployments
No. 1, and always No. 1, is security and auditability. Whatever we do, we must ensure we have a good, strong security posture. So even though we’re giving our developers more power, we need to document and version-control our infrastructure. We use GitHub for that.
We need configurable environments. How do we go from lab to stage to production and ensure we have consistency as we step through those different environments?
And we want code reuse. A development team might use a piece of infrastructure, but maybe there’s another development team over here in another part of the business, or maybe in another business entirely, that wants to reuse that same infrastructure. We want to provide a way of allowing those kinds of reuse in a modular way so that we can scale and allow our developers to collaborate on infrastructure.
And because we’re now treating infrastructure much closer to any other code, we need to look carefully at how we do automated deployments and testing of that infrastructure.
I’m going to walk through the journey that hotels.com have gone through over the last 3.5 years moving to the cloud, how we’ve rolled out infrastructure as code and the patterns we use there, how we do code reuse, how we’ve been doing automated deployments, and, much more recently, how we’ve been looking at testing our infrastructure.
Let’s start with infrastructure as code. We use GitHub heavily for this, and the idea is that each application has its own GitHub. Traditionally, the application code and maybe the documentation was in that repository. But we want to have our infrastructure in the same repository. In theory, you could go to a new AWS account, you could run
terraform apply on any repository, and you would get a complete deployment of that infrastructure that would be ready to go. But, in reality, it’s not as simple as that.
Because we need to ensure our security posture, we separate out our networking and our IAM. So things like VPCs, subnets, macros, direct connects, they’re all in the networking layer and they’re run by our operations team. User permissions, roles and policies, they’re in the IAM repository. But we actively encourage the dev teams to go and make pull requests on those different repositories. So, if a team is running Redshift, for example, in developing an application, and they need a specific role or policy, they can go and make a pull request on the IAM repository.
And the ops teams that own that repository just need to assess that, understand the security, and if it’s accepted, they can then merge and push that through to production. But we try and allow our developers as much power as we can give them.
Let’s take a look at one of these repositories, our data lake.
If we look at the infrastructure code within that repository—don’t worry too much about the application code—this is how we deploy to different environments. Here we have a lab, in US East, and a prod, also in US East. And we want to ensure that we have consistency between lab and prod. What we do is we have a master directory, and the mastery directory contains the common code that’s going to be used in both of those different environments.
In hotels.com, and in the wider Expedia, our data lake’s called "apiary," and here we have a file called
apiary.tf. This is the infrastructure to spin up our data lake, and you’ll see more about that in the demo later.
We also have provider.tf and variables.tf. We try and separate out our Terraform files into clear, logical components to make it easier to understand. If we have a look in our lab US East or prod US East directories, we can see we have symlinks referencing those files. So we ensure consistency and the same environment will get deployed in the lab as in prod.
It’s a bit painful to set up this symlink infrastructure, but once it’s set up, it just works.
There are also 2 files that are specific to those environments. We have a backend.tf where we store our states. We use remote states that we store in S3 buckets within those environments, and we encrypt the file, and we also lock the permissions down so only the people who need to see it can see it. We highly recommend everyone does that because there can be secrets stored in those statefiles.
And there’s the
terraform.tfvars file. This is your configuration for the environment that’s specific to those environments. If we look at how we run this, it’s really simple. We enter one of the directories—here it’s
lab/us-east-1, and we run
terraform init, plan, and apply. And you can see, this is a reference to the backend file, so it’s just pointing out our remote states, and you can see, we encrypt the file, too.
Separating out the networking and the IAM gives us a level of security and auditability and some control around that. By using GitHub, we have documented, version-controlled infrastructure, and using the symlink pattern, we get configurable environments.
With that symlink pattern, we got some code reuse between those different environments, and if there’s another team who wants to go and deploy their own instance of the data lake, for example, they can fork the code, or they can copy and paste it, so it’s probably better code reuse than we had in the past when stuff was spun out manually. But we’re not really there yet. I’ll talk a bit about that next.
When we started with Terraform about 3.5 years ago, we were using Terraform 0.7, maybe version 0.8. We were in super-early days, and we didn’t really understand how we were using it. Since then, we’ve learned more and we’ve discovered Terraform modules, and we use these heavily. We highly encourage that these are used heavily.
If we go back to our GitHub repositories, we have repositories for our different components, whether it be the data lake or RStudio and Redshift, and our objective is to make them into thin layers that are deployable but source different modules that we can build, that are generic, and that can get reused.
Coming back to our data lake as an example, our data lake source is a module called "data lake." We’ve taken all of that core infrastructure code and we’ve moved it into a module. And this module can be found on public GitHub, and you can go and find it and use it yourself.
But not only are public people looking at this repository and this module, but across Expedia, other teams also use this module. So hotels.com has a data lake which is deployed using this infrastructure, but also expedia.com has a data lake using the same module. As do Vrbo, as do Egencia, as do lots of other parts of Expedia, which is great.
But if you’re in the data world and you’re talking to your analysts, having lots of siloed data lakes is quite frustrating because the hotels.com analysts want to compare Expedia data or Vrbo data or Egencia data. So we developed a module called "Data lake Federation," which makes lots of data lakes look like one data lake. It allows us to have silos and independent units but also join those data lakes together. And our main data lake repository, our deployable artifact, now sources that "Data lake Federation" module, as well.
The other module it sources is our tag module. Within Expedia, we have a set of governance and standards around how we must tag our infrastructure, so we created a module for it. The data lake repo sources that as well, as do all of our other repos. They all source the tag module.
We have modules, but lots of other things, as well, like Qubole, SES, IAM, and all of the main GitHub repositories have become thin layers that are sourcing these different modules.
Let’s take a look at another example, just to show a slightly different point. Qubole sources the Qubole module and the tag module, but the Qubole module also sources the Bastion module and the NLB module. Having a hierarchy of modules, where modules source modules, is a perfectly accepted and, in fact, encouraged practice.
How do we use these modules? It’s really simple. To use a module, you can go in and look at the repository on public GitHub. And there’s an example of a couple of parameters you pass into that module. Later on, Pradeep will show you the full set, and there’s a lot more parameters than this in the demo.
Some of our modules are on public GitHub, and you can go and find them there. Or they’re on an internal repository, but the ones on the internal repository are made public to all internal users, and we strongly encourage users to reuse, raise issues, and create pull requests on those modules.
About our collaboration: We have over 75 modules in hotels.com, and there are over 30 contributors. Earlier I said we have about 4 or 5 operations people and maybe 100 developers within the data part of the organization, so we’ve got about a third of our developers developing Terraform modules, which is great. That’s exactly the kind of thing we want to encourage.
We think we’ve done a pretty good job with code reuse.
Now I’m going to hand over to Pradeep, who’s going to talk about automated deployments.
Pradeep Bhadani Thanks, Oliver. Hi, all.
As we move into the production, we don’t want to deploy any stuff manually, because it’s error-prone and it’s kind of boring. We want to have some kind of a pipeline which can take the code from the GitHub or any version control, make the changes, and just apply into the environment which we want.
We use Jenkins for our pipeline, and we developed Jenkins in a master-slave architecture where slaves have capabilities to deploy stuff on our AWS. It has the proper permissions, and once the jobs are run, it can send email communication or Slack notifications to the operations team or to the respective team. It just takes the Terraform code from the version control, runs the plan, sends the plan, and if the person is happy, just approve it and then it runs an apply and sends notifications.
There is some application code which could require some confidential information or a password, something like that. For that, Jenkins can go and get those details from Vault or AWS Secrets Manager and pass it to the jobs, so those are not a hard-coded or password version-control system.
Here are some screenshots of our Jenkins, which deploys our apiary data lakes. We are deploying 4 different environments: analytics, lab, prod, and sandbox.
Using Jenkins for our Terraform code, we have checked the automated deployment box, we have confidence that the different teams can deploy their Terraform codes using Jenkins and this automated process.
With this automated process, we don’t have a test yet, and we don’t have the confidence that whatever the code developers are building is not going to do any harm to the existing system, or works as expected.
We want to go with some testing features. So how are we doing the testing? We don’t have really extensive testing for our Terraform at the moment. We do basic testing like static code analysis using the
terraform validate command, which just tests the Terraform syntax, whether it’s right or wrong. And we have developed some Python scripts, custom scripts which scan the code and just check whether the governance rules we have are answered or not. Those are not really extensive, and we continuously encourage our users to build more and more tests for the codebase.
To boost deployment, we use AWSpec, which is an RSpec-based library, to check the AWS infrastructure, and some custom RSpec tests which are not available out of the box using AWSpec.
Here’s a long wishlist which we have for testing, which we have categorized in 4 different ways:
• Offline development
• Functional testing
• Integration testing
• Production testing
In offline development, we want a developer who’s writing Terraform code to do some level of testing on his own workstation using static code analysis, which we have pretty much,
terraform validate. But the thing that we don’t have is a unit test, so we would like to have a mocking framework which could do some testing at the workstation. We have searched a lot on the internet and haven’t come across any good tool, so we are continuously working on developing something. If you guys have come across anything, please share with us as well.
For the functional testing, we deploy in the lab environment, see how the Terraform code is behaving, run some post-deployment AWSpec tests, and see if the system is working as expected.
In the integration testing, we deploy the incremental change in the pre-production environment, and this tells us the system behavior is still consistent and it’s not behaving in a weird manner.
And in the production testing, we want to know, once we deploy the code in production, that it works well, and we don’t want anyone to make changes outside Terraform. For example, if I’m creating a security group using Terraform, I don’t want anyone else using AWS CLI or AWS UI to make changes to the code. We want to manage everything by our Terraform code.
So we don’t have full testing, and we are continuously working on improving our testing strategy for the Terraform code.
With all these principles, let’s put it together and see how we deploy our Expedia data lake, which we call "apiary."
Before we look into the Terraform code, I’ll give you a quick introduction on what our data lake looks like. The data lake uses the different AWS services like S3 buckets, ECS Fargate clusters, RDS, PrivateLink, CloudWatch, AWS Secrets Manager, and other stuff. Our data lake has different types of resources, and it’s really hard for a DataOps team to deploy this data lake in a different environment without Terraform. If you go with a script, it’s really hard to deploy these data lakes in a consistent manner.
We developed a Terraform module which can deploy this whole data lake in a few minutes and in a consistent way, and we can deploy this in multiple environments. At the moment, I think, we have more than 10 different deployments in production using these modules.
The code becomes really simple. We called the source of the Terraform modules
apiary-datalake, and this is a label on our Expedia group GitHub, plus some basic AWS parameters, some data lake-related parameters, container size, and so on, and then it gives you the whole data lake, which we saw in the previous architecture diagram.
I have a demo. Here I’m setting our AWS credentials, and first I ran our AWS resource command to see if there is any resource with a tag "apiary demo." There’s nothing, which means this is a clean environment, where apiary is not deployed. From the application code, I’m initializing the Terraform repository, which downloads all the dependencies, including my public GitHub modules and different AWS resources, and the providers like null, template, AWS, and whatever is required for the codebase.
It successfully initialized and sent it back in. Now I’m going to run a
plan on that, and I’m going to save it in a
datalake.plan file just so that I can reuse that file and do auto-apply on that. Once I run a plan, it creates a lot of data sources, so this is to get different things from the AWS API, and scan all the code, and just show me what it’s going to do. It’s going to add 54 different resources, which is really hard to deploy when you’re doing it with a Bash script or writing your own code using an AWS SDK.
Now I’m going to take the plan and apply it. It will take some time, deploying the security group, creating RDS clusters, creating load balancers and different AWS components which are required for a data lake. It’s creating and running some local script to set
apply is completed; all 54 resources are up there now.
Now I’m going to run the same AWS command to see if I see any AWS resources. I see a bunch of resources. I see security group, RDS cluster, S3 bucket, load balancer, and so on.
It’s so easy; the whole process maybe takes less than half an hour. RDS takes the majority of the time. Within a few minutes, you get a whole brand-new data lake, and a few different things can use that module consistently.
We have been using Terraform a long time, and there are some tips and hints which we have learned we want to share with you guys.
We have learned to use a separate Terraform statefile for each application and environment. In the past we tried merging 2 applications in 1 statefile, but that doesn’t work too well because, if there is a problem in one application, it could cast the defect onto the other application. To reduce the blast radius, or contain the problem, have a statefile for each application and environment.
Logically separate the Terraform code instead of writing the whole code in a
main.tf, like 10,000 lines. Suggest to every developer to write a different
.tf file within the folder so those are logically separated. For example, if you have an application which is going to create few IAM roles, S3 buckets, or something else, have
IAM.tf, put IAM-related content there, and s3.tf, put S3 bucket-related content there, so that it looks clean for any third user that wants to read your code. It just looks cleaner.
We have learned modules are really good in Terraform. They allow very good code reuse and make code really easy and clean. Have your developer version them, so that if you’re doing incremental changes in the module, you don’t break any application code which is using some version of your module.
Sometimes, you might introduce some breaking changes, and other teams are not ready to accept those breaking changes based on their sprint planning, etc. One repository per module is better, because then you can version each module separately, and it allows you to scale development really nicely.
We can use AWS Secrets Manager or Vault to store sensitive information, encrypt the Terraform statefile, and make sure that the bucket policies are properly set so that their own set of users can access your statefile and others can’t make any changes accidentally.
Version the S3 bucket where you’re storing the statefiles so that you can back up if someone accidentally deletes it.
We also replicate those S3 buckets for a different reason, for disaster recovery, and we use data sources heavily, instead of Terraform resource remote state, because we don’t want to depend on someone else’s remote state for the AWS API which we can get from the Terraform data source.
To summarize, we encourage operators within hotels.com and different parts of Expedia to use Terraform in a collaborative way. We use modules, do code reuse, use Jenkins, different types of CI/CD pipelines, and do some testing. But all these things need a change of culture, which is really important in a company.
That’s it. Thanks for your time. Thanks for listening to us.