What causes our infrastructure's configuration to drift over time away from our original intended state? And how does Terraform help?
For a deep-dive tutorial on how to use Terraform for mitigating configuration drift, read our HashiCorp Learn tutorial: Managing Resource Drift, and our blog guide: Detecting and Managing Drift with Terraform.
We often describe Terraform as a configuration as code tool. Meaning we're taking our configurations, capturing, and defining them as code. A common question that comes up is: how do you deal with "drift" that takes place in the real world? Drift meaning our infrastructure starts to vary from what we've defined it to be. How do we actually think about detecting that, managing, and correcting it?
I think what's helpful is if we step back quickly and think about how Terraform works. What we're doing is capturing in sort of an infrastructure-as-code way, what our infrastructure should look like. So we might say we want a VM, the VM consumes a database, we're going to put a load balancer in front of that, and I want a DNS record that points to my load balancer. This might be a simple web application, a few different tiers of infrastructure. We capture all that as code and then we call Terraform and tell it, "please apply, go make this."
In a day one scenario—meaning none of this yet exists—what Terraform is going to do is go create all that, nothing yet exists. So Terraform will go connect, connect to our cloud providers and infrastructure providers and go create all this infrastructure. From nuts and bolts all the way up. As part of that, what Terraform does is spit out a state file. And a state file is a way of tracking: what is all of the infrastructure related to this project?
This might be the ID of the virtual machine, the ID of the load balancer, the name of the DNS record, all of those kinds of things, Terraform is going to track all that and what's associated with this.
As we get to a day two setting, now this infrastructure already exists. We created it day one but now we're going to modify it. So we might modify it in two different ways. One way is we actually modify the code definition. We realize, actually I need an additional VM and I forgot I should put a CDN in front of my DNS record. So I'm going to make some changes, day two, to my definition.
In the meantime, a user has come along, logged into let's say our cloud portal and decided that they're accidentally gonna kill this VM. This is how drift starts to take place in the real world. They've killed this VM and they've accidentally come in and changed our TTL (Time To Live, for nodes), from, call it 30 seconds to 3 seconds.
Some changes have taken place, right? So when we're talking about this there are sort of two different types of drift. There's drift between how Terraform thought the world looked, which is: we thought there was a VM but someone's killed the VM. There's drift in the configuration, meaning the user is actually changing the configuration. They've asked for a new VM, they've asked for a CDN, so this is a purposeful sort of drift change and config.
The way Terraform works is, when we invoke Terraform in a day two setting, there's existing infrastructure so what Terraform is going to try and do is run what we call a refresh. The idea behind a refresh is Terraform queries the world and says, "Tell me what's changed?" This is how Terraform will detect this VM is gone. In the meantime, somebody deleted it, but also in the meantime, somebody modified this TTL. So Terraform will pick up on the fact that the real world doesn't look like the world we think it should be.
The second part of this is Terraform figuring out: how do we fix this? What needs to change to correct the drift? That's what Terraform calls a Plan. Plan is a comparison between the world as we want it to look, this, and the world as it is.
Here's where Terraform can start to correct that drift. What Terraform will do is inform us what it is going to do, what it is going to change. That's part of the Plan. It'll tell us these two VMs, they need to be created. This one needs to be created because somebody accidentally deleted it. This one needs to be created because you just asked for it. Our load balancer needs to be modified because it needs to know about both of the VMs. Our DNS record, we gotta correct the TTL, so we need to snap it back from 3 back to 30, which is what we asked for. And we gotta go create.
The goal at planning time is: how do we give the operator confidence that they understand what's going to change if they push go? This Plan is not a best effort guess, this is exactly Terraform's execution plan. This is what Terraform will do without deviation if you push go.
This very trivial example and the value of that isn't as clear because it's such simple infrastructure, you could just guess what the tool is going to do. But a real infrastructure is hundreds of thousands of resources interacting in a complex way and so the value of the Plan is: do we give operators confidence that they won't cause data loss, they won't cause downtime, they won't cause user facing impact? They can look at this to understand what the change will be. If that looks good, then you push Apply and now Terraform actually makes those changes—create the VM, update load balancer, add DNS record, fix the DNS record.
Terraform will spit out an updated version of the world state file. What this gives us is a version history of how our infrastructure has evolved over time. Then we can use this every subsequent time we invoke Terraform to go through this same exact motion. So the next time we run Terraform, we'll do a refresh, we'll do a Plan, and we can do an Apply against it. This way we can detect and manage both types of drift: drift that happens in-environment, external to Terraform, and drift that's initiated by using the Terraform code itself.
Hopefully that gives you a better understanding of how we detect and prevent configuration drift with Terraform.
Network Automation on Terraform Cloud With CTS
On-demand access to earnings via self-serve infrastructure at Earnin
HashiCorp Deep Dive Demos from Ignite and KubeCon Europe
How Remote Work is Driving the Need for Multi-Cloud DevSecOps: How to Build a Pipeline