Box's journey to the cloud with HashiCorp Packer and Terraform
Nadeem Ahmad shows us how Box moved from traditional bare-metal servers to modern, elastic, cloud-based infrastructure -- in less than a year.
The Productivity Engineering team at Box has a mission to enable company engineers to be more productive. It owns the continuous integration (CI) pipeline for Box services and infrastructure.
Historically, Box ran its infrastructure on single-purpose, long-running bare-metal nodes. There were multiple problems that arose from this approach, but being a CI team, host divergence resulted in test failures being seen on one node and not another -- this was the standout problem for all engineers at the company.
With the aid of HashiCorp Packer and Terraform, Box has successfully moved its entire infrastructure from bare-metal servers to a hybrid cloud, consisting of OpenStack and AWS. Box is now able to tear down and bring up infrastructure with levels of agility unimaginable in the bare-metal world.
There were many challenges faced on the way, and some valuable lessons. Nadeem discusses those challenges, the solutions, and what they learned. His talk details how important Packer and Terraform are to those solutions. The session also covers some of the tools and user-friendly abstractions Box built on top of Packer and Terraform.
Speaker
- Nadeem AhmadSenior Software Engineer, Box
Transcript
I'm just going to go right into it. As you can already see, the title of my talk is Journey to the Cloud with Packer and Terraform.
Just to get some PR stuff out of the way first, I work for a company called Box. If you haven't heard of it, it's a modern content management platform. What that means is it allows you to store your files in the Cloud securely, share them with other members of your company or other people in the world as [00:00:30] well. Check us out at box.com.
The team that I particularly work on doesn't really care about the box product ultimately. We work for ... the team is called Productivity Engineering and our customers are the 300 engineers that work for Box. Our mission is to make them more productive in any way we can.
One of the ways that we do that is, we own the continuous integration infrastructure and pipeline. [00:01:00] We allow developers to rapidly run their tests and get feedback on them. In light of this, we wrote a tool called ClusterRunner. This will set the context for the talk. What this tool allows you to do is, it allows you to run your tests in parallel across a feed of machines. So, instead of your tests taking a really long time running slowly, if you use ClusterRunner, you can run very quickly.
This [00:01:30] is a dashboard showing ClusterRunner in action. The large squares here represent a build, a particular type of test [inaudible 00:01:38] that's running. The smaller circles represent different nodes. All these nodes I the ClusterRunner cluster are supposed to be identical, so your tests can run on any one of these nodes.
The reason that this is important is [00:02:00] the purpose of this talk. Just to make sure that everyone's in the right room, how many of you have run, or still run bare metal data centers in your companies? Quite a few of you still.
A couple of years ago, on the productivity engineering team, we used to run our CI infrastructure on bare metal. You can already imagine some of the problems with this. [00:02:30] Being a team that is focused on getting developers to run their tests quickly. This talk will cover how we took that bare metal infrastructure and turned it, and went to the Cloud and how we got there.
A couple of years ago, the [inaudible 00:02:48] was in quite a bad place. We had these [inaudible 00:02:53] feeder machines that were running entirely on bare metal servers completely underutilized. [00:03:00] There was no concept of elasticity. Once the server was up, it was staying up. If you needed to change [inaudible 00:03:06], we had to use the configuration management tools like Puppet to change the [inaudible 00:03:11].
In particular, for us as a CI team, we had a huge problem with those divergents. You can imagine you have about 300 machines in your infrastructure. Your tests could pick any one of those machines. If one node is different from the rest and your test fails on that one node and not the other, you have a huge problem and a huge drain in [00:03:30] productivity for your developers. Finally, this problem is a little more specific to Box, but we were heavily reliant on other teams to manage our infrastructure for us.
If you needed more servers, we needed, we had more developers join us. We needed more servers now or we had to put a purchase order through, and it would take months to get the servers up and running and ready for use. If we needed to reprovision a server, you'd have to go to the operations team and have them do it for us, which often meant that [00:04:00] we had competing priorities to deal with. If you worked in an enterprise company, you know that they're always focused on production servers. As a result, we decided to go to the Cloud, like any sort of person in this situation would do.
At this point we had sort of two options. We had all these bare metal machines still ready for use and then we had these public Cloud options, [00:04:30] so we had a decision to make. We had some budgetary constraints that prevented us from jumping immediately to the public Cloud, while we still obviously had these huge number of bare metal machines that we could use. We decided to use a hybrid approach, both utilizing an internal private Cloud and some public cutoff in AWS. The goal was that we would utilize our current servers till they reached end of life, and [00:05:00] then any future growth would happen on the public Cloud side.
Now you're probably thinking that, how do you set up your own private Cloud when you have no Cloud. It takes a lot of work to set up your internal private Cloud. This is where this vendor called Platform9 comes in. It allows you to utilize your existing bare metal infrastructure and get a private Cloud going in days, literally. [00:05:30] We started with this. They do charge quite a bit of money, but it's very good to get started with them. You can build on your automation using them, your private Cloud automation until you build out your own private Cloud. In our case, we got lucky, as our operations team ended up building our private Cloud for us. We started out with Platform9 eventually migrated over to our internal private Cloud.
As a result, when we were thinking about this problem [00:06:00] we sort of had to solve for these multiple Cloud providers. Obviously we had the internal private Cloud hosted on OpenStack and we wanted official growth to happen on Amazon so we needed to support both these Clouds. We also wanted the ability to be able to switch to different Cloud providers, like use GCP if needed or build across multiple Cloud providers like Amazon, Google, OpenStack, anyone of these. [00:06:30] Our goal was clear. We wanted to build this abstract system that would work across these multiple Cloud platforms.
What I mean by an abstract system is, we wanted to be able to deploy our infrastructure that spans these multiple Clouds. We wanted to do that in an automated fashion. This was our goal. Just to get some concepts out of the way, [00:07:00] I'm going to spend some time explaining how we went about this problem.
Traditionally, when you're running bare metal infrastructure, you sort of use configuration management tools like Puppet or Ansible to configure your machines the way you want them to be. What that means is that you spin up some sort of base OS and then [00:07:30] you apply your private manifest, your Ansible playbook, or your Chef cookbook, whatever you want to call it, to that base OS, to turn that server into something useful. If you need to make a change to that server, you would then go back to your configuration management tool, you'll make the change and then you'll run that configuration management tool on that existing server. These servers would be able to run indefinitely.
In [00:08:00] theory this sounds like a sound approach, until you start scaling up. You leave these machines running for a very long time. As you probably know, generally speaking, almost 90 percent of the cases, probably even more, you don't have the configuration management tool manage the entire machine. You have it manage, probably 80 percent of the machine. You could still have some portions of the machine that are unmanaged, [00:08:30] so if you make changes to those portions, the machine will drift from what it was intended to be initially. Whether a particular problem is ... if you need to change the role of the machine, it's very difficult to do so, because once you apply one role to the machine, it leaves a whole bunch of configuration bits on the machine. To change to a different role, you will have to do a whole bunch of deprovisioning work, which is very painful.
The natural consequence of this is to not use that model, and [00:09:00] instead use what is termed the Immutable Servers Model. In this scenario, instead of provisioning just base OS server, you provision an instance that has everything that you need for our servers to run. You bake everything that you need, for a particular service into an image and you deploy that image as an instance. Now, if you need to make changes to your instance, you don't change [00:09:30] the live running instance. Instead, you make a change to your base image and you redeploy the instance. You're always starting from some sort of known state and it allows this versioning of your infrastructure.
This is quite far fetched from having completely long running machines, so there's this intermediate approach that you can take, which is referred to as Phoenix servers, where [00:10:00] you still continue to run your configuration management tool after the server is up, however you redeploy your instances from scratch a lot more often. What this shows you is that your public manifest, your Ansible playbooks, can indeed build a server from scratch and that your infrastructure or your services will still run when they're torn down on frequent basis.
Eventually, if you're doing this enough, you can set goals like, I'm going to do this weekly. I'm going to redeploy my infrastructure weekly, and [00:10:30] then you get to the point of daily, and so on. If you're doing this enough, you just automatically get to the Immutable Infrastructure Model 'cause you're doing it so often. Generally, Puppet runs, for example, once every hour, if you are just redeploying your infrastructure every hour, there's no need for Puppet anymore, for example.
This is the approach that we took. We were like, okay, we're going to build these fully baked images, but we're still going to continue to run our [00:11:00] configuration management tool, like Puppet, on them, even after they come up. Eventually we want to get to the point where we can completely go Immutable Model. Obviously we needed a way to build images and verify those images. You don't want to just start building images and not have them, some sort of sanity checks on them to make sure they are what you expect. Eventually, you want to deploy instances from those verified or approved images.
[00:11:30] For logical purposes, we decided to call the building and verification of images as Phase One of our project and the deployment of those instances from those images as Phase Two. We also set some sort of arbitrary, I guess, requirements. Being a productivity engineering team, we wanted this system, that we ultimately will build to be used by other developers at Box to sort of build their own sandbox [00:12:00] environments and [inaudible 00:12:01] experimentation and that sort of thing. You want such a system to be easy to use and self service. We also want it to be fully automated. What I mean by fully automated is, say I have a Jenkins node that I'm bringing up using this system. I want the system to be, automatically attach that Jenkins node to an existing Jenkins master where I tell it. I push a button, and this just happens. This is sort of what our ultimate goal was. Of course, I've briefly touched upon this. We [00:12:30] wanted the system to have built in validation. That goes without saying.
Let's get into how did we go about building and verifying these images. Well, thankfully, there's an open source company called HashiCorp that specializes in this sort of thing. They have a tool called Packer, which abstracts away Dolphin stack APIs and Amazon APIs and Google APIs that allow you to build images very easily. [00:13:00] I'm sure you'll hear more about Packer, but the thing that I really liked about Packer is that there's multiple Clouds that it supports. It's very easy start. Today with OpenStack and Amazon and then, later on you decide that Google is offering you a better deal so you can switch gears and try for Google.
Another really important thing was the fact that Packer allows you to use your existing private manifest, Ansible playbooks, Chef cookbooks and recipes [00:13:30] without having to change them. This is very important. As you're doing something a certain way for years, you don't want to rewrite your entire configuration management system into something else. So this is very powerful. It allows you to do this out of the box. You can get up and running very easily. Great, we're almost done here.
What we realized was Packer is sort of built for people that are working with images on a day to day [00:14:00] basis. This is a sample Packer JSON file. This is hard to follow through just getting into Packer. We wanted to abstract some of the complex bits away and make it easy for developers to use. I'm just going to run through this file quickly just to highlight the different sections, and then I'll talk about how we solved this.
This is the provisionary section. In this case, this is a shell provisioner. What I'm doing with this provisioner is, I'm installing Python 33 [00:14:30] using "yum". Of course, this could be any type of provisioner. I could have the Ansible provisioner, Puppet provisioner, whatever I need to build my image. I'm saying that I want to ultimately build the image that I'm building into OpenStack as the builder name specified there. Notice how the credentials for this OpenStack instance are specified in the Packer JSON file. We didn't ultimately want developers to have to care about these sort of details. We just wanted [00:15:00] them to have an image.
We decided to write our own tool on top of Packer. The cool thing about Packer is that you could easily automate on top of this. It's very easy. There's JSON files everywhere. It's very easy to automate. The way that we decided to solve this problem is we just thought about how a developer would think about this. I have a source image. This could be a base OS image. This could be some other image that someone else put relevant packages in. Then I [00:15:30] want to run a set of provisioners on that image, and I want to bake that image into a complete image that I can then use to deploy instances from. Then I want to run some verification steps on that image. I will ultimately end up with this verified image that I can actually deploy, that I feel confident in.
The question that Packer didn't supply out of the box was the verification step. With Packer, you can use provisioners to [inaudible 00:15:59] on zero [00:16:00] and do some sort of that verification right when you're building the image. However, what the verification we wanted was, we wanted to actually bring up the instance separately from Packer and run the validation checks then, 'cause we ran some post processing steps on the image. We added an image, compressed the image, those sort of things. The image was fundamentally different from when Packer was building it. This verification happened after the image was already complete.
Instead of [00:16:30] Packer JSON file on the left, this is obviously a simplified version of the Packer.json file. There's a lot more going on there. With that tool, which is built on top of Packer, internal tool, we simply specify three lines of yaml. Notice how you have a source image section. You have a provisioner section. This is a list of provisioner section. These individual provisioners are just what you saw earlier. These are just separate files with [00:17:00] provisioner section that Packer expects. Now you have the power to reuse these provisioners for different images as well. You can just ... it's in the same repo, you can just embed it in your image definition. That's we call this yaml file.
Of course, you also have verifier section now, which is separate from what Packer provides. You can also reuse these verifiers as well. These verifiers also follow a similar format to the provisioners and they run separately. [00:17:30] So, instead of invoking Packer using the standard Packer build, Packer.json, we simply invoke, in the Pipeline build, we specify the name of the image and then we pass in the builder. [inaudible 00:17:44] sometimes able to use the same image definition and build across whatever Clouds that we support, currently OpenStack, AWS. If you wanted to build the image, same image definition, pass it in to image pipeline and use Packer underneath and you got [00:18:00] your image.
We store all these configurations for images in our repository, if you defer to it as Packer image config. This allows us [inaudible 00:18:08] version control the definitions, allows us to add CI to this stuff, allows us to quarter view this stuff, so it makes it very easy to manage. I'm just going to go over quickly the directory structure of this repository to get an idea.
Of course, you have the images section. This is all your image definitions. [00:18:30] You have provisioners section. These are all the list of different provisioners that you can use. You have verifiers section. These are all the list of verifiers you can use. It's pretty straightforward.
The way that we call image pipeline is using these Jenkins jobs and this simply invokes image pipeline for different Clouds and what it does is it periodically [00:19:00] builds these images. We say that this job runs every hour, give me a new image now and you will get less stuff, these verified images. If anybody's not verified, it won't get built, so ultimately we'll only have verified images on our Cloud.
Just to summarize, these are some of the benefits that we saw from writing, abstracting on top of Packer. We have, obviously abstracted credentials. It's a little bit more user friendly. We have some validation now, and we have the power to reuse [00:19:30] our verifiers and provisioners.
Earlier, I showed you this slide where we divided up Phase One, Phase Two. Thanks to image Pipeline and Packer, we can call Phase One complete. We are continuously building these images. This part is very important. The symbol you see here, the circle symbol, this implies that we are continuously doing this. This is very important. You want to continuously be building and verifying images. [00:20:00] This is only successful if you're doing the best on a continuous basis. You don't want to just do it once and be done with it.
Let's turn to how do we take these verified images. Right now, I have done nothing useful. I just have these bunch of images. There's no real infrastructure. Now we want to turn those images into actual infrastructure. Again, HashiCorp comes to our rescue. Terraform, which is a lot more powerful than just allowing you to create instances ... one of the uses that we found from [00:20:30] it was that it's able to create instances across different Cloud providers. I'm sure other talks will make you learn about Terraform, all the power that Terraform has, but for this time, I'm only going to focus on the part that allows you to create instances.
What we do is, now I have this verified image in, let's say, in OpenStack, [inaudible 00:20:50] image. I feed it to Terraform. Terraform allows me to create this instance. Not only that, Terraform also allows me to do some [00:21:00] initialization for this instance. What this means is, if I need to add this instance that I bring up to some sort of pool, I can do that with Terraform. If I need to set some host name on this instance, anything I need to do that can't happen as part of the image itself, that has to only happen when the instance is coming up, I can do with the Terraform provisioners. I can do this for any number of instances that I need. Now, that one OpenStack verified image can turn into [00:21:30] hundreds of actual real instances. Now you already started to see the power of this Immutable infrastructure sort of model. Not only that, if I decide to go to Amazon, I can use the same process and now I can have [inaudible 00:21:45] instances running in Amazon.
Similar to what we saw earlier with the Packer JSON file, we also found the Terraform configuration format to be very specific to people that are using Terraform regularly. [00:22:00] I'm just going to play it off some of the sections here. You have a provider section. This shows you that I'm using Amazon as my provider. If you had OpenStack, then you'd use OpenStack as your provider, and you would pass in some credentials. With Amazon's case, of course, you have access keys, so you can key in region. Then you have a resource section, which defines that I want a AWS instance of a particular type. In this particular example, I want hundred of these Jenkins node type instances.
Pay [00:22:30] particular attention to this provisioner section. I am calling a shell script to add a Jenkins node. What the Jenkins node is, is this instance that Terraform is creating and it's attaching to this Jenkins-master.box.com and applying the label Python 33 to it. This provisioner will be the same whether I'm deploying this instance in Amazon, Google, or OpenStack. I just want a Jenkins [00:23:00] node attached to a Jenkins master. There's a chance for reusability. With Terraform, I'd have to use the module structure or I would have to duplicate this code in the OpenStack configuration.
To simplify things a bit and to sort of fit our model, we decided to automate again, on top of this, and abstract some of these bits away. We [inaudible 00:23:21] a call to a Terraform configuration generator. So again, we can express our definition in simple yaml. [00:23:30] You have a provisioner section. You say I want a Jenkins node. This will take templated provisioners and passing these variables that I'm specifying here. My master is Jenkins-master.box.com. My label is Python 33. Any provider's specific configuration can go on the provider section. I want hundred of these things in Amazon and I want 50 of these things in OpenStack. I just specified the AMI I need for [00:24:00] Amazon. I specified the image I need for OpenStack and I'm good to go. This tool will simply generate out the Terraform configuration I need for both Clouds, which is ready to be fed into Terraform [inaudible 00:24:11]. This sort of speed that process up.
Again, we store these configurations in our repositories, in our repository once again. Again, same schpeel about version control, CI, quarter views, all that good stuff. [00:24:30] I'm just going to once again, point out some of the directory structure, some of the interesting elements here. We store all of our definitions, these are your yaml definitions like the one you just saw in the definitions directory. This allows other people to take a look at what definitions look like and reuse them if they need to. Any actual Terraform configuration files go in the generated directory. Finally, the templates house any reusable provisioners. So, when the configuration generator tool runs, it will take the templates, it will pass them, resolve [inaudible 00:25:01] variables and then [00:25:00] it will generate the necessary files.
Here that infamous slide once again. Now with Phase One, we had image pipeline. With Phase Two, we had Terraform. And again, to harp on the important stuff, doing this over and over again. You don't want to just spin up hundred Jenkins [inaudible 00:25:24] as one and leave them running. That means you got nowhere. You want to be able to [00:25:30] recycle, refresh infrastructure on a regular basis.
We did this with our happily named Recycle project in Jenkins. What this does is you pass in the Terraform definition file that I just showed you and it will figure out that, hey, I've got hundred instances of this type in Amazon. I got 50 in OpenStack. I'm going to take them all down and I'm going to bring them back up with the new latest image. So, this is the purpose of this project.
[00:26:00] Just to sort of illustrate where this is, in one use case, where this is useful. I have a Jenkins cluster that's very happy. Everything is running, all my tests, all the developers are getting consistent test feedback. All of a sudden there's a row of tests that say corrupts, get repo on one of the machines that is shared. All of a sudden, my entire infrastructure has gone bad due to this one flaky test. I figure out the test. I fix the test, but my infrastructure is still gone bad. Now I have to figure out uncorrupt the repo [00:26:30] using a [inaudible 00:26:32] or something.
Say this was a long running services, I have to actually fix the problem when I already know what happened. I already know that this was a bad test. We already fixed the test. My infrastructure is still in bad state. Well, thankfully, now, I just simply push a button. I say, hey just rebuild this thing for me. Just grab the latest known image from a state that you know and let me know when you're done. What you start [00:27:00] to see is this Recycle Project, it will smartly take a certain percentage of your infrastructure, it doesn't want to take the whole thing down, it wants to take down in stages, so over time, you're just literally waiting for this to happen. You're not actively doing anything. You're just watching things restore itself. Over time, my entire fleet returns back to normal.
We do this on a nightly basis. We recycle our entire infrastructure on a nightly basis, to always start from a known state. We want to get to a point [00:27:30] where we don't have to worry about nightly, we just do this all the time. As you can see, we've gone a long way. We started from complete bare metal structure, long running single purpose machines and now we're at a point where we can, with one button, we can recycle our entire infrastructure.
Just to illustrate some of the lessons that we learned with this ... [00:28:00] one of the important things that HashiCorp also mentions this a lot, is you don't want to treat your infrastructure any different than you treat your other software projects. You want to interact with them via an API, you want a version control, infrastructure definitions. This is critical to be able to get to this level of automation. You also want to make it very easy to be able to build images and verify images. [00:28:30] That has to be a very easy process because you want to be able to repeat that as many times as possible. I already mentioned this. You can use the power of these sort of tools, these infrastructure focused tools, and automate on top of them. You can take those tools, use them as a back point of your projects and then build your own customizations on top of these tools.
[00:29:00] We're in a good state now, but we want to continue this further. We want to get to another level of automation. In particular, our team is focused on, is a CI focused team, where we have a lot of infrastructure that is used at certain hours. So, during the day, our infrastructure is heavily utilized. At night, it's almost idle. It's barely used at all.
Auto scaling is a perfect use case for us. We already started investigating into this. How we can [00:29:30] leverage auto scaling and only use infrastructure when we actually need it. We can double the [inaudible 00:29:37] during the day and completely almost go to zero during the nighttime and save a lot of money for our company. Along those line, we can also start looking into containers to solve some of these problems. The difference between virtual machines, one of the primary differences, is containers boot up time is very low compared to virtual machines. [00:30:00] Even if you're spinning up just one virtual machine, you still have to wait for the rest to boot up and that can still take anywhere from five minutes up to even probably further. You can do the ... you can probably spin up hundred instances in those five minutes, but you still need those five minutes.
With containers, you don't have this problem. You can instantly spin up a container. For a team that runs tests, there's a use case here where you can just spin up a container, run your tests [00:30:30] and delete that container. So you're only utilizing your infrastructure for when you need it. There's no more efficiency than that. You going to tear down, you run your test, you tear it down. That's the level that is completely feasible in today's world.
That's all from me today. Thank you all for listening.
Q&A Highlights
So, for our use case, we tried the modules portion. What we realized with the modules portion was we were still duplicating a lot of the provider specific fields. You probably notice that in the Terraform definition, there is no area where you specify, I want [00:32:00] this username of this password, or use this variable. With modules, we realized you still have to do that. That was one of the reasons we did that. Also, we thought that the way that the Terraform configuration generator does it, it was very easy to follow for someone, where they're like, "Okay, I have a Jenkins node provisioner that I'm using and it's right there." With modules, you had to sort of, there's a lot of backtracking with [inaudible 00:32:23] and those are the reasons that we decided to switch to a generator. [00:32:30]
The way that we solve some of these issues of caches [00:33:00] or large packages on the machines, we actually just prebake them into the image. So, the IV cache is a perfect example. We have a job that runs every night [inaudible 00:33:10] IV cache. We pull that IV cache into the image itself. This is happening asynchronously. This is not happening when you're building images. It's happening on the side. You take that IV cache, you put that into your image, now all of a sudden your IV cache is up to date, and you don't have to wait for things to slow down for you.
The way that we actually verify images is, you know how ultimately, when we're deploying the instances, we use Terraform to deploy the instance. The verification step actually invokes Terraform to bring up the image itself, and then in that, we just use Terraform local exact provisioners to do verification steps, sorry [inaudible 00:34:57] exact provisioners where it actually logs into the machine and is like, "Is [00:35:00] this service running? Is this file present on this machine? Is this package present on this machine?" Those sort of checks.