Case Study

How Reddit Conducted a Large-Scale Infrastructure Migration Using Terraform

Hear the story of how Reddit adopted, mastered, and migrated large swaths of their infrastructure with Terraform.

In this talk, we outline how our use of Terraform has evolved from then to now, and how we utilized Terraform’s strengths to conduct a large scale migration across AWS regions.

Prior to 2017, Reddit's data engineering team was using manual processes to administar all of their AWS resoruces. Once they adopted Terraform, things got a lot more automated.

In this talk, Krishnan Chandra, a senior software engineer at Reddit, shares how his team adopted Terraform, improved their practices, and recently used Terraform in a large scale migration across AWS multiple regions.

Speakers

  • Krishnan Chandra
    Krishnan Chandra Senior software engineer, Reddit

Transcript

My name's Krishnan and today we're going to be talking about how Reddit migrated its entire data infrastructure between two different AWS regions and how Terraform made that process easier.

Before we get started, a small show of hands: how many of you use Reddit? It's like almost everybody. But for the small number of you who don't, Reddit is a network of communities where individuals can find experiences and communities built around their interests and passions. I know DevOps and sysadmin are very popular on Reddit and I'm sure many of you look at those as well. Since so many of you use Reddit, you probably have an idea of what the scale is like, but here's some numbers depicting exactly what the scale is actually like. Right now, we have over 330 million monthly active users, 12 million posts per month, and 2 billion votes per month. You guys are all very active and we appreciate that.

Today's talk is mostly about three things.

  1. What is this migration and why did we do it?

  2. How did Terraform help us execute this migration?

  3. What are the lasting benefits that we've gained by managing our infrastructure using Terraform?

Let's start with the why. Why do we do this?

The old infra

To give you some background on what our data infrastructure looked like before, essentially we had all Reddit services in a single AWS account, but in different regions and the data infra was in one region while all of the rest of Reddit's infra was in another region. Specifically, the data infra was is in West and the Reddit infra was in East. And if you asked me why that is, I actually don't know because a lot of this happened before I joined the company.

But the crucial problems that we ran into with that, were that the data infra was sort of oriented around data scientists, most of whom are based here in San Francisco. That turned out to be pretty useful for them to have the compute infrastructure near them. It was less so for App developers who were working off the Reddit infra in us-east. The other problem is that data transfer is really, really expensive and if any of you have tried to transfer data, even between AZs in AWS, let alone across AWS regions, you've probably run into this issue before. Doing so, made a really high barrier for entry and meant that it was really hard for other teams to actually utilize our data. The problem statement was, what if we moved our primary data infrastructure to another account in us-east-1? That way, we get a little bit of isolation, we get the data infrastructure on its own, and it's now accessible to everybody else who's running in us-east-1 as well.

Operational context

To give you some background on the organizational context behind why this was necessary, when I joined Reddit it was about 25 engineers. Even fewer of those doing like DevOps or infrastructure type work. There are only a handful of us doing that.

Since then, we're now at well over 100 engineers, a lot of whom are trying to use the data infra and they're in this weird position where we all share AWS resources and that leads to a lot of contention, as well as they want to make site features based on real time data like search indexing. If you post on Reddit it should be indexable into search almost immediately. But before the way we were doing that was we just had like a daily ETL that would index all the data into search.

If you made a post, you couldn't search for it until the day after that, which wasn't so great.

Operational problems

An overarching theme of the bad stuff in our old setup is manual maintenance. A lot of stuff was done manually that really should never have been done manually. Specifically, we had all of our AWS assets created directly through the AWS console, including things like launch configurations. IAM roles, policies, Route53 records. That's a lot of overhead and it's very, very easy to get things wrong.

Where this came to a head was that eventually, Kafka is one of our biggest services within our data infrastructure and anytime you had to replace a Kafka host, you had a 14 step guide that you had to follow to do all the manual steps to replace a host. It's very, very easy to mess something up if you have 14 steps.

Security problems

Cool. Aside from that, not only was there a lot of manual step problems, there were also a lot of problems in managing things like IAM security groups and routing manually. This results in a phenomenon that I call Zombie policies, which is you have some IAM policy or some security group rule applied, you don't know when it was applied, you don't know why it was applied, but you're very, very scared to get rid of it because you don't know what will happen.

This happened to us a fair amount and as a result, we ended up with these like really bloated security groups that would have hundreds of rules and we weren't really sure why those rules were there or how to get rid of them.

Other problems that happened was that we had many different kinds of data co-located within S3 buckets and we relied on IAM permissioning by path to make sure that access controls were done correctly. While that does work, it is very scary because it's also very easy to mess up and it's something where if you do that repeatedly, you end up with these very convoluted directory structures within a single bucket, which I definitely would not recommend.

Finally, permissioning is really hard when everybody shares the same AWS account because you want to give some people access to some resources but not others and it's really hard to that level of granularity. The easiest thing to do is just to split those out by account.

Goals

To recap, what were the goals of our migration? Really it was:

  • Save money—don't spend so much money doing data transfer anymore,

  • Get our stuff into a different AWS account,

  • Get rid of a lot of our bad practices and a lot of our manual maintenance,

  • Have some sort of inventory of what we're running and what policies we have, and

  • Make it easy for other teams who want to create their own accounts in future to do that.

Why Terraform?

Before I gone on to how we did this, it's important to take a moment to stop and say, why did we use Terraform to do this?

  1. Terraform already had some adoption at Reddit before we—the data team—decided to do this migration.

  2. I personally like writing HCL a lot more than I like writing JSON files or YAML files. That makes it a lot easier to read and it reads a lot like code, which is much easier to context switch from if you're doing Dev tasks and then you have to switch over to doing Ops tasks.

  3. The AWS provider is pretty robust and it had good integration with all the components we were already using.

  4. Modules turn out to be super helpful later in this presentation.

  5. Finally, we could use Terraform to manage infra in different cloud providers. This migration mostly talks about AWS, but we do have some stuff in GCP, which I'll mention briefly.

Application

Now, let's talk about application, which is how did we actually do this thing? Our original migration plan looked something like this where we were just like ... much like the Underpants Gnomes from South Park. We were just like, "We're going to do it. Everything is going to be great. We're going to save money and it's going to be awesome." As it turns out, wasn't nearly that easy.

The first thing we had to do was basically take inventory, much like the gentleman in the stock photo and figure out what exactly do we have, how much of it do we need, and what do we want to migrate over? Thankfully, a colleague of mine made this diagram, which I then stole and it gives you a sense of what our data infrastructure is actually like. At a high level, what this is, is you basically have clients which are like if you're on Reddit, on the web or on mobile apps, they fire events. Sometimes they go through a CDN for client side things. For server side things, they just talk directly to our ELB.

From there, it gets picked up by a set of servers, which then writes it out to Kafka. We do a bunch of stream processing in Kafka and then archive it out to our analytic stores, which are hive tables in S3 and BigQuery, which we also use for data warehousing.

Let's break down what kinds of services exist in that setup. You've got stateless services, which were our event collectors, which are just a web service and Kafka consumers. One interesting detail about those Kafka consumers is that we had some old ones that were run inside Mesos and containerized and newer ones which were run solely on EC2 instances, configured using Puppet and stateful services, which are a little bit harder to spin up. And those involved like Kafka and Zookeeper, which are very interconnected as well as Mesos and Marathon, which we are using for our container orchestration.

Before you can do any of that, there's a bunch of stuff that's not even mentioned in that diagram, which is a lot of the base AWS components you need to even start spinning up any of this stuff. That is, you need VPCs, you need route tables, you need S3 buckets, security groups, all of that set up before you can even start spinning up infrastructure because you need to know how to manage it. We also had a lot of what I would call foundational infrastructure components, which are monitoring and we use Graphite for that. Log aggregation, we use Vault. For secrets management, we needed a Puppet master so that we can run Puppet. There are a bunch more things that I'm not mentioning here, but we'll just skip over those for simplicity.

The obvious choice was to start with foundations because without that, you can't do any of the other stuff. Basically, what we did was we started out by creating a base module. The idea behind the base module was that it would be reusable, other teams could pick it up and spin up their own VPCs with it if they wanted to. Basically, you would set up a VPC, you can specify which AZs you wanted to have that spanning, you could specify a certain set of subnets and routing configs and it would just go in and create those route tables for you. In addition to that, we also wanted to make sure we did the right thing with respect to S3 buckets, creating those separately for separate use cases and establishing good isolation.

Now, one of the interesting problems we ran into while doing this was that some services had additional dependencies which are not really documented and we didn't really learn about until we tried to migrate them. This by the way, will be a very common theme of this talk.

RDS databases: Occasionally, we had Graphite needed an RDS database to manage its metadata and it turns out that just existed for a long time, and we didn't know about it. Then we tried to migrate it and we were like, "Right."

The good news is all of that is also manageable through Terraform and the AWS provider. It was pretty smooth for us to get all of that rolled into the migration as well.

Once we finished up with infrastructural components, we moved on to stateful services and this is probably the second hardest problem after the foundational components. What we did was we started out by Terraforming Kafka and Zookeeper. I should say that we had Puppet configs for most of this already, so it wasn't like we were completely starting from scratch.

But there were aspects that needed to be improved. In particular, one thing Puppet would try to do is they would try to figure out what type of instance you were running and then based on that, try to mount certain discs and do certain other types of configuration all just by inference. That was not really a great setup because that's a very fragile thing trying to detect what type of instances underneath.

Instead of that, it's much easier to start with Terraform where you know what kind of instance you're spinning up, and you can have disk mounting and stuff done through Cloud-init. You can run all of that on startup of the instance. What we did was we replaced a lot of pieces of Puppet code with Cloud-init instead. Now, Terraform just runs this Cloud-init code whenever an instance starts up.

The good news is, we turned both Kafka and Zookeeper into Terraform modules and we shared them within the organization. If you want to spin up a Kafka and Zookeeper Cluster at Reddit, it looks like this. While, this is not exactly accurate, it's like 90% accurate. There's a few more parameters that are not super relevant and wouldn't fit on the slide, so I just took those out.

But all you have to do is specify that you want a Zookeeper, specify how large your ensemble is, and what your instance types are. Then the Terraform module will just go up, spin a cluster of that size, and make sure all the nodes can find each other. Same deal with Kafka. You can just specify what cluster size you want and the instance type. Again, there are a lot of Cloud-init scripts we have in there that will spin up your Kafka instances, register them to Zookeeper, and make sure that the cluster is in order. It's much, much lower friction now for teams to spin up this type of infrastructure than it was before.

Immediate benefits

It was now much easier to manage everything around Kafka because we'd end up with things like really fat security groups, like one's on Kafka where everything has to talk to it. It would have like dozens of rules. Now, the Kafka module just exports a security group and we just use that in other Terraform code when we're spinning up new services to set the security group rules.

Our infamous 14 step replacement process is now down to three and all three of those steps involve Terraform and that if you need to replace a Kafka broker, you can just do terraform taint to mark the broker that you want to replace as tainted, do a plan, and then do an apply and it will just go in and replace that broker. The new one will come up, register itself to Zookeeper, do all the DNS shenanigans it needs to do and just like that, that removes a huge chunk of manual management for us as well.

This has gained very quick adoption within Reddit as we already have other teams who are using these Kafka and Zookeeper modules.

Moving containers

That's Kafka. The next part of the talk is about moving our container infrastructure. This was a very hard challenge for us as we had a few consumers running on an old version of Mesos. I cannot stress how old, old really is because this was before DCOS and before any of the nice new stuff that Mesos has—like really, really old.

We had no real inventory of what apps were running on there. The only way we could figure that out was going into marathon and grabbing all the JSON files for the app configs. We didn't really have much in the way of Terraform configs for spinning up Mesos and we didn't have a good provisioning story or configuration story there in general.

That cluster had been created in 2015 and was virtually untouched since then. Essentially, I went into it looking like this where I was just like, "All right. We're going to try to figure this thing out and let's hope for the best."

Terraform to the rescue!

Now an interesting thing happened in that a lot of other teams at Reddit were like, "Mesos is cool and all, but there's this new Kubernetes thing that we like a lot more and we're going to use that instead."

As it turns out, one of the other teams had a Terraform module for spinning up a Kubernetes cluster. Essentially at that point, the tradeoff became, "Do I spend several weeks trying to move Mesos or do I spend like 10 minutes and get a new Kubernetes cluster?" Of course, I chose the latter option because that was a lot faster.

At that point, all I had to worry about was moving over the app configs, which instead of running them on Marathon, I now had to run them on Kubernetes and this turned out to be not as hard as Terraforming Mesos.

This only took me about a week or two to do and it shortened our expected migration time by several weeks. This is my teammates saving me from having to Terraform Mesos.

Stateless services

The last type of service that exists after the foundations and after the stateful services are stateless services. Now, these were actually pretty easy, so I won't spend too much time talking about them.

They were mostly Kafka consumers and they were mostly already standardized using Puppet. Essentially, the main benefit we gained out of using Terraform here was ASG management and also making sure that our AMIs were up to date. We have a separate build process that builds base AMIs for us using Packer and we just referenced that directly in Terraform. If the AMI ever updates, then we can quickly just cycle that out in the launch configuration and do that very smoothly without impacting running services. Any new instances of that gets spun up after those changes are applied, automatically get the new security updates, which is pretty nice.

Flipping the switch

Then once we figured out how to Terraform all these components, we had to figure out how to actually flip the switch and execute the migration. This, as it turns out, was a slightly more involved process than the three step process shown earlier, but turned out to be pretty manageable. In that what we did was:

  1. We spun up a parallel data pipeline in us-east-1 and we replicated over a sample of data from us-west-2, which was our old region.

  2. We left both pipelines running because we wanted to make sure that all the data was making it through and that we weren't dropping anything along the way.

  3. Once that's done, we would replicate all the data over, and…

  4. Flip production traffic from us-west-2 to us-east-1 by changing routing at the CDN layer.

  5. Then once that's done, we could call all the old stuff and actually profit.

Rough edges

It's important to know that there were some rough edges around all of this and things that we haven't really solved yet.

Parts of that are that S3 data is very hard to move between accounts and if you do want to do that, it's a very long and expensive process, so we didn't end up doing that. For some things we ended up having to make cross account policies so that our new stuff and the new account could still reference the old stuff in the old account. I you've ever had to make cross account IAM policies, then you probably know that you end up having to make policies in both accounts and they have to actually work with each other and grant the right access permissions. This turned out to be a pain, but that's not really a Terraform problem, it's more of an AWS problem.

Second problem was that, we had some issues sharing state across accounts and blueprints. The way things have evolved for us, we now have multiple teams running multiple accounts, but they don't all live in silos and they all still need to talk to each other from time to time. To do that, we have to set up things like VPC peering, we have to set up security group rules, and try to manage remote state between different blueprints. Actually, referencing items within that remote state is something we don't have a great solution for. If any of you do have great solutions for that, I would love to talk to you after this.

Third problem we ran into is that we had these modules and they were really awesome, but sometimes it's hard to extend modules to add more functionality without breaking them for the original users. This is particularly true when modules are shared between different teams because you can add something for one team that breaks it for another team. Our current workflow for that is not ideal. But usually what happens is, another team we'll make like a v2 version of the module, copying most of the original, and adding whatever new functionality they need. This results in a few problems because you end up proliferating a lot of code around and a lot of copypasta that you don't necessarily want.

In addition, sometimes we have to do manual state file editing and this is probably one of the scariest parts of working with Terraform because if you mess that up, you can do a lot of damage. Always take backups—always. It's something where we occasionally run into this when Terraform tries to destroy an instance which is already terminated. Sometimes, it can't figure that out and so, you'll end up having to go to the state file and remove that instance manually and then Terraform is all happy.

Finally, this is also not really a Terraform problem, but a quirk of our particular setup at Reddit. Both Terraform and Puppet changes frequently need to go out in tandem because some application level configs live within Puppet and a lot of the infrastructure level configs live in Terraform. Oftentimes, you'll need to do the Puppet changes first before spinning up the infrastructure. Otherwise, it won't exactly work.

Key wins

The last bit of this talk I will mention all the benefits that we've gained from using Terraform and what's the lasting impact of doing this migration. Let's talk about our key wins.

  • By doing this migration, we did actually save a lot of money in data transfer cost and I'll show a couple of slides later that shows exactly how much.

  • We have a much better inventory of our AWS assets now. We actually know based on the instances we have, what are they, what are we using them for and do we still need them? Shortly after we did this migration, since we now had an inventory, we made a bunch of reserved instance purchases and that saved us even more money.

  • There was a lot of good knowledge sharing with other teams. Internally, we've given Kafka and Zookeeper modules to the rest of the org. They've given Kubernetes modules to us and so it results in a lot of good collaboration and a lot of saved work that might be duplicated otherwise.

  • We reduced a lot of our operational overhead for stateful services, no more 14 step processes and hopefully never again.

  • We also enabled other teams to start utilizing this pipeline. This is still kind of a work in progress, but search indexing the example I referenced at the beginning of the talk, is now actually done in real time and they consume off our Kafka stream to index posts as they are made.

Architecture

The architecture today looks not that different from the original architecture that we had. The clear differences are at the top of the diagram it says us-east-1, instead of us-west-2.

One of the neat things is we've got different S3 buckets used for different things as they should be.

We've got more and more services being used in GCP, which is something that happened mostly in parallel to the events of this migration, but also provides us a good opportunity to use Terraform there.

We managed to do this without any super major outages. I don't have any great war stories from this.

The bottom line

I promised you some numbers. I will give you some numbers. This graph is for the data transfer cost and it's about down 50% from the original amount and so that's saving us like tens of thousands of dollars per month.

Likewise, EC2 instance costs are down 40%. If you look at this graph, it's reflecting the reserved instance purchases because the green graph at the bottom is our utilization in instance hours and the blue graph on top is essentially saying how much we paid. We're paying a lot less for about the same utilization.

Well, thank you everybody for attending. I'm Krishnan, you can come talk to me afterwards. You still got a little bit of time remaining, so enjoy the rest of your HashiConf.

More resources like this one

  • 1/6/2021
  • Case Study

Self-service discovery at scale with Consul at Bloomberg

  • 1/5/2021
  • Case Study

How Roblox Developed and Uses the Windows IIS Nomad Driver

  • 12/17/2020
  • Case Study

Consistent development and deployment at Comcast with Terraform

  • 9/2/2020
  • Case Study

Service Mesh in the Real World