Case Study

Enabling Infrastructure as Code at LinkedIn

Learn about Linkedin's Terraform as a service for spreading infrastructure as code uniformly across the engineering teams at scale.

Transcript

I'm going to give a talk on enabling infrastructure as code (IaC) at LinkedIn, our journey from imperative to declarative workflows. I'm hoping most of you have heard about LinkedIn and use it often enough. If not, please log on there. Daily active users always help. 

But to go through the motions, a little bit about LinkedIn. We are the world's largest professional network with about 850 million members in more than 200 countries worldwide. Our vision is to create economic opportunity for every member of the global workforce. And we've set out with the mission to connect the world's professionals to make them more proactive and successful.

A little bit about me. I have been at LinkedIn for about three and a half years. I'm a senior staff engineer, and founding member, tech lead for the Terraform asa Service (TFaaS) team. The current focus is on IaC, policy, and workflow platforms.

Agenda

In this talk, I'm going to cover certain agenda items like the need for IaC, and Terraform as a Service adoption strategy, and progress. The main chunk of the talk is focused around adoption strategy, and how we tried to move from imperative to declarative workflows. And what we had to do to get it going at LinkedIn.

The Need for IaC 

I'm not going to talk about the benefits of IaC and Terraform in general because I'm assuming most of the folks in the room are already well-versed with that. But we are going to take a top-down approach on this as to why we needed IaC at LinkedIn.

First off, LinkedIn is growing. As I stated earlier, we have about 850 million members with a 26% increase in revenue year-on-year. Increasing revenue generally means more infrastructure, more services, and more features being added in the backend.

To sustain this growth, we are constantly busy getting the site up and running and scaling up. Towards that, as I mentioned about the top-down approach, we have something called fabrics, which is nothing but a site, or a stack, or a datacenter, if you will. 

These fabric builds take time. I think, originally, we built one every few years and so when it took time, it was like — we are doing this every four years. If it takes six months to nine months, that's fine, let's pay the cost. But as we have grown rapidly, we have to come to a place where we want to develop multiple fabrics in a year.

Towards that, we wanted to be more agile and nimble. To give you an idea on the complexity of the fabrics, our control plane just for the fabric requires working with 59 systems across 15 different orgs. It's always challenging to coordinate with different teams, get them working, map out the dependency, figure out how it's brought up.

Complete bring-up requires working with 1,500+ services, including data, infra, stateful systems, and offline and online applications. So, it's a fairly involved task — going from zero all the way to a live fabric that is serving LinkedIn.com.

We encountered over the course of various fabric builds that tribal knowledge is everywhere. There was no uniformity in terms of automation, bootstrapping, or provisioning itself. Some had Bash scripts, some had Python scripts, some had a wiki doc which someone would run at the time of provisioning. And some teams had that, “one person who's been here for a long time who knows how to do it, and he'll do it again,” approach.

We needed something better that would standardize across LinkedIn. To give you an example, our previous fabric build times were around six to nine months each. It was a fairly engineering-intensive toil that we had to take on every time a new fabric had to be built.

TFaaS

To solve this problem, we built TFaaS and adopted Terraform. We set out with the following goal around TFaaS, that is, provide a standardized platform for IaC developers. In a nutshell, we were Terraform remote execution as a service. 

We provided all the bells and whistles and made sure whatever was needed to help productionalize IaC, we would do it inside of TFaaS, and the user would focus just on writing IaC. No worrying about encryption address, high availabilities, backups, state file management, upgrades. If they're going from Terraform version x to y – and there's a breaking change, which we have seen a couple of times — we manage that process for them.

We wanted to allow the user purely to focus on authoring IaC and nothing else. We also wanted to unify our workflows across different organizations enabling CI/CD for IaC. So, no more custom Terraform implementation by org x versus org y.

Given one goal is providing a standardized platform, the next step we also wanted to do was support both on-prem and cloud deployments because that's where we are. Then as I mentioned at the beginning of the talk, we are going to take a top-down approach, building fabrics, that's one of the main use cases. 

One of the biggest requirements was the ability to bootstrap a fabric. That meant TFaaS would be one of the first services to come up in a region and be operational without any dependencies on LinkedIn primitives.

It should be operational without LinkedIn auth, Git, and other services being there. It would bring up LinkedIn services one by one, and then switch over to start using LinkedIn primitives when it's completely operational.

Towards this, we had to implement something called managed mode and unmanaged mode. Unmanaged mode was pretty much pure vanilla Terraform with bring-your-own org story, where you had to work with Azure primitives, and get it up and running. It was a very secure environment, only a few people were allowed to use unmanaged mode. But once the seed  was there in the fabric, we could switch over to managed mode, and folks can use Terraform using LinkedIn auth and other providers.

One big goal for the platform was to make the adoption of declarative workflows viable. One of the challenges we faced — and we knew from the get-go was — there's a lot of inertia associated with using CLI and UI in getting things rolling, or the way users interact with different provisioning services. So, we set out from day zero like, we will add features which will make this journey from CLI UI to IaC easier for our folks. 

TFaaS Architecture

Real quickly, I'll go over briefly, the TFaaS architecture — it's pretty straightforward, nothing much complicated. What you see here are the core components. We have other services and dependencies which help run our particular workflows. But at its core, we have an ALB router which routes requests to the respective TFaaS API server.

The TFaaS API server will go get the relevant metadata — where is the state file stored, what's the Git repository, perform authorization, auth checks, and any policies that are there at the global level. Then create a job and queue it into a service bus.

On the other side of the queue, we have TFaaS workers, which are listening for this queue. They pick up a job, they run it, update the state file and database when needed. This allows us to scale horizontally and vertically. We could handle any amount of load that was being thrown at us by just adding more servers both on the API side and the worker side.

This was our MVP and beta that was released already to a lot of customers. From there we wanted to grow our usage going from those who are excited about Terraform to getting those on board like — this is new, do I really have to do it?

Adoption Strategy  

Part of our adoption strategy was to make it easy to move to IaC. We have this fancy system, and we are trying to sell it. We don't want pain points to come up for our users when they're starting this journey. One of the first things we added was called one-click import and it auto-creates IaC review pull requests by pointing TFaaS to existing resources.

We were well aware that TFaaS came in the middle where there were tons of resources already provisioned and managed. So, it would be hard to ask folks to leave those resources and start fresh. Having them to do the import, and manually craft the IaC was also not a pleasant experience. We genuinely wanted to avoid folks to have that toil of it. 

So, we created a provider-agnostic solution. As long as a provider has the input implementation there, you could point to a resource. And provided that provider was part of our ecosystem, it would facilitate importing of existing resources to TFaaS. We would automatically create the IaC for the user, and fill up all the fields, create a GitHub pull request. The user can just accept the PR or RB, and it will automatically merge — and the resource will be managed by TFaaS going forward.

 

This is the example command that you would have to run — `ulps tfaas import apply`. We have some metadata information in the CLI, but the most important part is the Kafka topic ID resource type and the ID that you want to point it to. And in the results section, you can see we have a link to the RB that is created, as well as what will be imported in there. 

Apologies if this is too small, and a little bit blurry. But here is the dif  that's created automatically by our tool, which would have the resource block and all the fields filled out.

You can see that if you have a fairly complicated resource block, it's easier for users to just do this rather than try to fill out the IaC themselves. That's one of the workflows we saw emerge as well with this tool — where people would use their existing tooling of choice UI, CLI, click, click, click, click, and then point TFaaS to that resource, get the IaC created automatically rather than them having to craft it manually. 

It was an interesting approach — a side effect of it — but it works out well. In the end, our goal is to make sure folks start using IaC to capture their intent. So, if they do it this way, that's fine because going forward all the updates will still happen through IaC.

Sandbox Workspaces 

We’ve talked about import workflows. The next one is sandbox workspaces. One might ask why we need this. Some of our providers — the in-house ones — were auto-generated based on the REST schema that we have internally.

A lot of them provided only server side validation. You could select a name — and then only after an apply would the server side come back and say do not put upper case letters or dashes in them.

Of course, folks could put this in the provider validation check itself, but because it was auto-generated, this was starting to be really cumbersome for our developers. The review and merge requirements for TF apply lowered developer productivity considerably. Especially when they were experimenting with the new things — and this was the first time they're trying to bring out a resource, and they want to play around with it to figure out what's happening.

In some cases, the metrics we saw were more than 30 commits to get it right. That was never a pleasant experience, and it left a bad taste in the mouth. Towards this, our solution was to create something called a sandbox workspace. Each workspace would get assigned to a user — specifically the sandbox workspace. This would allow applies to happen without committing to master, but only in a test environment. We had dedicated environments where folks could write IaC, commit to the local branch and then run apply on it, and see what comes up.

Of course, we did not want to make this the de facto standard of creating IaC where folks are able, "Oh, I can get sandbox workspaces up and running, and I don't need code reviews. Here, my infrastructure’s up, my service is running, people will not bother us anymore." 

Towards that, we added some special features in this sandbox workspace. One of them is our workspace will expire after x number of days — and it will automatically clean up, destroy itself. After a certain number of times before clean up, you would stop taking any requests on action. You couldn't do a plan, you couldn't do apply, you can't read the output, but the only thing you could do was destroy it by yourself if you wanted to.

Developer Experience. 

We added many features to support this, and most of these you would find in other implementations of Terraform Cloud, and Atlantis as well. Some of them were integrations with the review systems. So, whenever a PR, or whenever IaC change is made, the testing done section gets automatically updated. We also added failure classification and next step messages. 

We were a small team when we were growing this and having an operational load of many engineers trying to use Terraform for the first time, we wanted to keep that low. Towards that, one of the first things we added was the ability to parse messages and assign the next steps for it. 

So, hypothetically, if your resource failed, the provider owner could add a parse and say, “if this failed, go to this stack overflow location and look what happened there.” This was done not at the provider level, but at the UX level — at TFaaS itself so that it could be highly customizable to the current scenario. 

We also added resource destruction protection. This is inbuilt, and we added this after an outage that we suffered. Where folks as they're coming new onto Terraform, sometimes it's easy to misread the plan — a clear before a destroy in  place, and then create. Or a whole destroy happened and they didn't realize why..

Towards that, we added a two-factor to any destroy actions that would happen. If the plan detects destruction is happening of any resource, we would stop you on the track and warn you with a big red sign like the resources are being destroyed —  are you okay with this? If you're okay with this, this is the one-time code that you need to provide us for us to execute this job further. 

Some Terraform-friendly teams found this annoying, but others found it useful. Globally, this was enabled, but we allowed certain teams to disable this feature if they were totally confident in what they were doing

The other thing I've already mentioned in the platform: We also took care of automatic version upgrades. Even if there were breaking changes, even if there were code changes required on the IaC side as well. Going from zero point, I think we started with 0.12.2 all the way to 1.2. We had certain gravity in different usage of versions. To consolidate all on one, the only way we would make it less painful for the user was to do migration for them automatically. Whether it was updating the state file internally or updating the IaC files to capture the change. We took care of that as well.

Adoption Strategy  

Going forward, again, in the adoption strategy. Helping folks who say, "You can take my CLI from my cold dead hands. We've run into this many times — why can't I use CLI? Where can I use CLI? Why can't I keep using my CLI? And they had good reasons as well.

There could be a break-glass scenario where something's on fire, "I don't have time for this, let me run this one command, and it'll be fixed." Or just a force of habit. We found that within a team itself, some embraced IaC culture while going through the review process But others still wanted to use a CLI. So, you would keep running into situations where there was a mixed bag, and the outcome was never nice. 

Drift Detection

Towards this, we added something called drift detection. I'm sure most of you might have heard about this coming out of Terraform natively as well in Terraform Cloud. We've had this for about a year and a half, where we run your IaC, we run a plan against your deployed resources and compare if this drift actually happened — and warn the user that drift has happened, please resolve the drift before moving ahead. 

In this example here, you can see the use for health value has changed from true to false, sorry, the other way around — probably by a CLI. This would help alert the user if something has changed and IaC is no longer up to date. 

Drift Mitigation

To take it a step further, we had something called drift mitigation. There are two ways to resolve drift. One is to enforce your current IaC — drift happened. I know that was a mistake. My IaC is king. I love using that.

Or you can update the IaC to reflect the changes that were made outside of IaC. This is specifically the CLI example that I used where someone in the team has used CLI even though there's IaC because they wanted to keep using CLI.

To bridge that gap, we built a drift mediation recommendation engine. Whenever drift would be detected, we would try and create a review or a pull request by TFaaS, which would capture the changes made outside of TFaaS into the IaC file. We would try to update the IaC file to see whatever was changed outside and run a plan to ensure drift would be resolved by the IaC changes that we have made. 

Here's a sample RB that gets created — and I apologize for the small text again. But the main portion is we have a drift remediation recommendation for x-y-z. In the testing done section, we run our updated IaC and see there are no changes. So this is good to go to help you reconcile it.

This is a sample RB. Going by the same example that we saw — ”use for health.” If you look at line 20, you can see “use for health” has been updated to reflect what was the value change by CLI. The interesting part here is this is all done using Terraform state files and IaC files that we have internally. 

We do not snoop on CLIs. We do not query resources or event history to see what changed and try to reconcile it. It's all natively in Terraform. As long as the resource is managed by TFaaS, we can help do this.

Limitations 

One limitation of our engine right now is — you see that big green block? That is because by default we also put the default values in there if they were not captured in the IaC itself. We're working on improving it, but that's where it is right now. 

This is how we solve the drift remediation problem. On a very high level, internally, we are building a DAG of resources based on the state file representation. At the same time, we do static analysis on the IaC files to link resources that are presented in a state file to actual IaC file blocks.

Once we have those nodes built, we know what the input and the output are — that is variables coming into a particular resource and variables going outside. Then once a drift or plan a detection is run, we know exactly which resource changed. And based on that we would propagate up or down the value that changed and see where it stopped. 

Of course, this had some limitations. So far, we have only solved this for flat IaC. And as we are observing, modules are getting more and more popular in our usage, so it does not work for most modules. It's not a limitation, it's just that we haven't focused on implementing that yet — we know how to implement it for modules as well. 

Similarly, I think the other limitation is that this does not work with local variables and interpolations: If your value changed, which then is joined with other variables together to form another value — which then gets up signed to a resource — we don't go that deep. We just work on one-to-one. In our analysis though — interpolation problems — we actually saw very little usage of that in where this would happen. So, we haven't focused on that, and we don't plan to in the near future for now.

Progress With Fabric Builds

We generated IaC for migrations. We did some centrally, some we asked the teams to write themselves and gave them the format, and the providers, for them to write it. Based on that, we created six functional fabrics within two years. Three productions, three tests, and 15+ ephemeral fabrics as well for various levels of testing. 

I would like to point out that fabric build is not the only use case, but the one with the most measurable impact – that is, we could time it. Earlier it took six, or nine months, now it's taking a month or so. 

We have more traditional use case adoption as well. Some networking services backends are using TFaaS to implement accurate rules, and network rules. Some are managing their resources in Azure using Terraform directly or resources in LinkedIn itself through TFaaS.

High-Level Metrics 

As of FY22 Q3, this is still with fabrics not being completely live — that is the bootstrap but not wrapped with user traffic: We had about 1,950 workspaces, average workflows we were executing more than 8,000 per month. Peak executions, I think we maxed out at about 36,000 in one case. After that, we said we can keep scaling out, but we don't want to; just throttle your requests a little bit.

Then the number of users, I think we've seen is more than 500, the number of services more than 200. And around 7,000 resources were being managed through Terraform with TFaaS. 

Why Not Terraform Enterprise? 

And I don’t like to get into uncomfortable situations being in HashiCorp  conference. I still want to talk a little bit about why we did not go with that. First and foremost, the biggest challenge for us was integration with the LinkedIn ecosystem. We are huge. We have our own  load of tooling investment happen on the backend already with tons of gravity around that.

At the time of evaluation, our engineering org system would not have worked with Terraform Enterprise from the get-go. That was a big problem. Then we had issues with CD system integration — where our CD system would have to be re-architected to accommodate for this use case. 

LinkedIn internally also uses something called Rest.li, which is an open source standard around rest for schema management. To support this in Terraform Enterprise, we would've effectively had to build a proxy in front of TFE — which seemed like overkill, to be honest. And to get the core functionality working off TFE, did not actually take us too long.

I already went over the custom workflows that we absolutely needed to make IaC a success at LinkedIn. Creating a smoother transition from declarative workflows was really important for us — the import workflow, the drift detection, and drift remediation. From day zero, we thought these are the things we want to solve for going forward. 

Of course, we have a massive scale. So, cost does play a factor in deciding a few things — 1,500+ applications across 6,000+ engineers could be massive. And we didn't want to be in a situation where folks are structuring their ops or IaC code just to save one cost.

That concludes our talk. I would like to thank the audience for taking the time. TFaaS team members — here on the screen you see some TFaaS team members both past and present who made this happen. I think we started with two or three engineers for about three or four quarters before more folks joined in and helped us take it to GA.

More resources like this one