Learn how State Farm created a GitOps workflow using GitLab and Terraform Enterprise.
Mae Large: Hello, and welcome to our presentation on GitOps and TFE. My name is Mae Large. I'm an Architecture Manager here at State Farm, primarily responsible for the Delivery Engineering topic. That means my team and I partner heavily with different teams at State Farm; change Management and the different platform owners at State Farm enabling our cloud environments — both within our walls and out in the public cloud.
We partner with these teams to deliver a compliant developer-friendly way to get code changes to the hands of our customers. I am joined today by Pinky, and I will let her introduce herself.
Priyanka Ravi: Hi, I'm Priyanka Ravi. As Mae mentioned, I also go by Pinky. I am a software developer on Mae's team. For the last year and a half, I have been primarily involved in helping bring GitOps into the organization.
Mae Large: Thank you, Priyanka. We both work for State Farm. In fact, I've been here for about 14 years.
State Farm is a known name in the insurance business. We've been number one in auto insurance since 1942. Number one in homeowners since 1964. We are 36 on the Fortune 500 list. As far as our manpower goes, we have 19,200 agents, 59,000 employees, and 1,800 software developers.
These software developers are our customers for the topic that Pinky and I drive here at State Farm. And just bonus info, that building they're showing is our Phoenix Hub, where both Pinky and I are located. To wrap up introductions, let's get the disclaimer out of the way — Pinky and I are both here expressing our individual opinions and experiences. We are not speaking on behalf of State Farm.
These are the things gaining adoption and momentum around State Farm as it relates to the topic that we drive.
About four years ago now, as we were venturing out to the public cloud, we came to this realization. On one side, the developers got to understand that they better be aware — better be understanding — of the infrastructure that powers the software that they wrote. On the other hand, our infrastructure folks came to a realization that we need to start managing and operating on these infrastructure resources just as we do with code.
These two realizations naturally translated over to Terraform. Terraform is State Farm's strategic infrastructure as code solution. About a year and a half into using Terraform open source, State Farm — with my team's involvement — did a proof of value on bringing in TFE. Fast forward to today; we have been using TFE for almost a year now.
You saw in our stats, we've been number one for several years. We're serious about keeping that competitive advantage. One way we're doing that is by significantly reducing lead time for changes. How quickly do we get innovation from idea and into the hands of our customers? That's where GitOps comes in.
GitOps is the continuous deployment methodology for cloud-native applications. It thrives in the fundamental principle that Git is the source of truth for your entire declarative system. And what I mean by a declarative system — it starts with your declarative infrastructure, the config that gets applied on it, the application — or applications — that then get applied to those infrastructure resources. Even the policies that are enforced on every single one of these layers or resources — those are all done via Git. GitOps manages it. Because it's all managed as code, it naturally resides in Git.
Second, every code change. The validation that takes place, making sure it's of quality, the different levels that test, and making sure that it's free from security vulnerabilities — those all happen in Git too in a pipeline. Last and the key piece. The collaboration that takes place in every code change as committers — collaborators — weigh in, in a code review also within Git. All the way to the approval of that change, via a merge request to get realized in production; everything is done in Git. That's what GitOps is. Its operations-by-pull-request. Now let's talk about the why's, the benefits of these solutions that we brought in.
You can Google this, and you're going to get a ton of results on why I should bring GitOps to my organization. But these are the top three reasons why State Farm adopted GitOps.
We know that our ability to stay competitive is in the hands of our developers — innovating nonstop, hands on the keyboard. What better way to promote that, but to provide a way to deploy their changes — get their changes realized in the hands of our customers right away, remaining developer-friendly.
And it is because our developers are familiar with Git. They use it day in and out. They don't have to understand yet another tooling — yet another solution — just to get their changes out the door. They stay in the ecosystem — which is in Git — that they thrive, and they collaborate with fellow committers.
I know I may be sounding repetitive here because everything's Git, Git, Git. The rich history is already in Git. That is the history — the audit trail — we can harvest and present to our auditors. How did this change make it? What were the validations that took place? Who approved it? That audit trail is all in Git. And we're just harvesting that in the case of an audit.
Last and certainly not least on the 'Why GitOps?' We are now comfortable with component-level changes getting realized straight to production. By component, you can think of a single microservice with a change. And the developer who made that change — if they had gone through the proper pipeline rigor and the approval — it will get realized in production right away.
Key DevOps principles, small miniature changes, less context switching. Because it's such a minute change, we're able to properly observe — monitor — how this change is performing in production. And even if it fails, since it's such a small change, we can definitively make a decision on whether to fix it forward, which is easier — not much context that has changed — or even roll it back, if it comes down to it. That's why GitOps.
Two main reasons why we brought in TFE.
Number one, as you saw, we cater to 1,800 software developers, and not all of them are operating in the public cloud. We brought in TFE to eliminate the responsibility from our customers, from having to figure out where to stash the state files behind the infrastructure that they're now managing. With TFE, you don't have to worry about state management. TFE does it for you.
In an organization like ours, we have compliance requirements that we need to abide by. Sentinel gives us that. And even better, we can describe these policies as code. It shows the transparency on what are these policies that we're running nonstop in every one of these workspaces.
This is a match made in heaven, after all. In retrospect, we could have released GitOps much earlier with Terraform Open Source. At the time at State Farm, we were still actively standing up TFE. And yes, we've allowed early adopters to use our alpha and beta versions of GitOps. But it was the right decision to make GitOps generally available with TFE. And here's why.
The product that goes into the hands of our customers — the product that's constantly changing — that translates into a TFE workspace; that same product translates into the TFE teams that we provisioned for our customers that are now tied to our identity provider. That convention carries throughout our entire solution stack. That achieves faster onboarding because the convention is just the product name.
Faster troubleshooting also. When things break at the different integration points, all we have to ask our customers is, “What is the product name?” And we can take it from there. In fact, even they can troubleshoot these problems by themselves. Again, because of the convention.
First off, we love the secured variables. Why? Because — as you know — in every automation, there are secrets involved. We’re able to set those secrets in our customers’ TFE workspaces without giving them the ability to glean into the values of these secrets. Such a key feature — such a huge lift for us.
Second, RBAC. The permissions model in TFE is so well thought out. From the organization permissions — which we so use because we’re the owners of GitOps — down to the individual workspaces. The permissions that product teams and product team members can add themselves to — following, yes, the convention again — to fulfill their specific responsibilities. It naturally isolates these individual workspaces as well.
Last, VCS integration. In hindsight, TFE with VCS integration is very similar to when GitOps was first released by Weaveworks using Flux. I'm just going to leave it at that — hold that thought, please — because Pinky will get into the specifics in a little bit.
A bonus on the 'Why GitOps and TFE': 1,800 software developers. Not all of them are operating in a public cloud. Yet, we're able to give them that experience on using Terraform — on operating in an infrastructure as code ecosystem. They're now taking on the responsibility of managing infrastructure as code on resources that they're very familiar with — GitLab Groups, GitLab repos — which we also call as config repos.
Now the benefits — the whys — are out of the way, I'm going to turn it over to Pinky, and she will walk us through the different GitOps workflows that we've enabled here at State Farm.
Priyanka Ravi: Thank you, Mae. I'm going to be walking through a few workflows for the platforms that we've enabled.
I'm going to be starting with our Kubernetes Flow. As you can see, it starts pretty much the same way with the developer pushing to a source code repo. You probably already have a CI/CD pipeline set up in your flow that currently does your tests in your deployments. What’s different is this concept of the config repo. That is not something that we've come up with here at State Farm. It's an industry term associated with GitOps. But we've set it up differently.
The purpose is to provide a full audit trail from the code change to the progression in the delivery pipeline. It logs who approved the change, and it goes all the way to the deployment of the change. The creation of the config repo is automated in our process, and the integrity of the repo is enforced on a recurring schedule. Mae is going to go into that in a little bit — in further detail.
The first thing to note is our Master branch in our config repo is our default branch — and it's also protected. This enforces that there is no possibility of a rewrite to the history in that branch. We also have settings in place that make sure nobody can do a direct push to Master.
This merge request has some permissions that are locked down as well. One of those permissions is that it requires an approver. But it also requires that the approver has to be from a set list of approvers that are the product’s owning managers — so one of them has to approve it.
Also, merge request committers cannot approve their own change. Let's say even though Mae is the manager of a product and she initiates the change; she cannot go in and approve her own change. Lastly, the config repo has settings so that it cannot be deleted — and that's for auditing purposes — to keep that audit trail.
Between the CI/CD pipeline to the config repo, you might see this line that says GitOps CLI. This is something that we've developed internally. We have the GitOps CLI deployed as a docker image, and we call that in our CI/CD pipeline. It does a few things to get to the config repo. Firstly, it creates a branch in the config repo, and it uses a naming convention of the commit SHA. The commit SHA from the source code commit is how it's named to the config repo.
Secondly, it creates a commit on that branch. The commit that it creates has the files that are required — the instructions — to perform a deployment. Lastly, it creates a merge request using that branch. That merge request contains details such as our evidence of tests link, and also it links back to the source code commit.
There's a merge request in our config repo. When Mae goes in, she can check that the test's passed, and she can see the change that was in our source code repo. And she can decide whether she thinks that's a good change that should be going into prod. If she thinks that the change should go into prod, then she will approve it and merge it into Master.
Once that change is in Master in our config repo, it will do two things. First, you see on the right side that it says the Kubernetes Cluster, and it has the Namespace. Then it has an operator in there — and that operator that we use is Flux.
Flux is going to continuously pull the Master branch in the config repo on a set interval that we've set. If it sees a change in those deployment files in Master, it will automatically deploy that change to production. It also has a git-sync interval that you can set. That interval will check for anything that's out of sync. It'll automatically make sure that everything's back in sync. It'll force apply those changes.
The bottom thing in this diagram is our webhook. We have also created our webhook internally. This webhook is listening for merge request events. It's continuously checking for merge request events. If it sees that the status is merged, it will kick off a change record to our asset inventory. That change record is also going to use that commit SHA from the source code commit. In that way, the commit SHA gets carried all the way to production — and to even being recorded in the asset inventory. That's how that's tracked.
This workflow is exactly the same to the config repo — nothing has changed there. The only thing that changes is that it uses continuous deployment pipelines to get to production. There's no Flux in this process. In these continuous deployment pipelines, we've added lots of reusable pipeline stages to ease with the adoption of GitOps by our consumers.
This one is the exact same to the config repo — nothing has changed. But you see the integration of Terraform Enterprise here. Instead of Flux or the pipelines, we are using Terraform Enterprise.
If your deployment is as simple as a terraform plan and apply, then you can make use of the VCS integration in TFE. When something is put into Master, it kicks off an external pipeline that runs that terraform plan and apply in TFE. In that way, it acts like Flux — is what Mae was alluding to.
You can also take advantage of that CLI and API if you don’t want to use the VCS integration. And you can still do an external pipeline as well. It will still kick off a pipeline in TFE. I'm going to kick it back over to Mae, who's going to go more in detail on how we set up that config repo and how we maintain it.
Mae Large: Thank you, Pinky. Now you have a better understanding of these workflows. What is the config repo? Let's talk about what powers that.
These modules – developed in-house — are what we use to create locked down GitLab Groups and GitLab repositories. If I have product A, that translates into a config group, a locked down GitLab Group to use GitOps. This product A is made up of a UI and API, a database, and maybe a couple of Lambda functions — five components that make up this product A. Each of these translates to a config repo nested underneath that config group. It's also locked down.
How did we do these modules? Well, we leveraged the GitLab Terraform provider. And truth be told, we do have null resources that delegate down to some custom scripts to achieve the true locked down requirement that we have to comply by.
Next, you may be wondering, where are these modules hosted? There in TFE. We are a tenant — GitOps is a tenant in our TFE instance. These two modules that we provide for our customers are hosted in TFE using the Private Module Registry — easier for our customers to see what did these modules do? How do I use these inputs, outputs? The ReadMe documentation. It’s all in that TFE UI and even the versions that we released to our customers.
I've also touched on the workspaces. That's how we achieve a lot of that convention. Even with these TFE workspaces, we use Terraform to stand them up. It's pretty obvious in every one of our automation pieces; if we can use Terraform, we use Terraform — embracing our infrastructure as code solution.
We call it The Enforcer. Every night, we have a scheduled pipeline. Obviously the recurrence can be adjusted — the beauty of it being a scheduled pipeline. Every night, it goes out to every single one of those TFE workspaces under the GitOps organization — and it does a terraform apply. It's our way of, again, embracing infrastructure as code. The code is your source of truth. That is what gets realized as far as how your resources are configured and stood up. Not just the first time you went through it, but we reinforce it on a recurring basis.
I'm going to turn it over again to Priyanka to wrap us up.
Priyanka Ravi: Thanks, Mae. We wanted to leave you guys with a couple of thoughts. One being — what's near and dear to Mae and I’s hearts — that Terraform is not just for public cloud. We hear this sometimes even in State Farm, where people come to us, and they say, "Oh, we're going to Kubernetes. Why are we using Terraform when they're setting up their config repo?"
We want to reiterate that there are so many more uses to Terraform. In our case, we're using the GitLab Provider to stand up our config repos, and we're using the TFE provider to stand up our workspaces.
It was fun to get to see different uses for it — and we hope that we've inspired you guys: One, to look into GitOps for your own company — your own use. And also to try to see how you can use Terraform — and use TFE as well. We thank you guys for this opportunity to share our experiences. Thank you all.
The Path to Modern Infrastructure Automation: Revisited
Packer & Terraform: New Features for Scaling Immutable Infrastructure 2022
Terraform AWS Cloud Control Provider – Under the Hood
Opinionated Terraform Best Practices and Anti-Patterns