HashiCorp co-founder and CTO Armon Dadgar explains the differences and trade-offs between mutable and immutable infrastructure.
One of the key aspects of The Tao of HashiCorp is the notion of immutability, the idea that once we create a thing, we don't change it after creation. One thing we get asked often is, "What's the difference between a mutable approach and an immutable approach? What are the advantages or trade-offs between them?"
When we talk about a mutable approach, what we're really talking about is, let's say I'm creating a server. This is maybe a VM, maybe it's bare metal, it doesn't really matter. Let's call it a web server.
I'm going to deploy something like Apache 2.4 as my web server and then I'm deploying my app as well. Let's call it web server version 1. I have traffic coming in, users making requests to this thing.
But over time, I wanna make changes. I wanna upgrade the version of the web server, maybe I wanna update to a more modern version of Apache, or maybe I wanna switch to a different web server like NGINX. The way to think about what we're doing is, over here we've defined version 1 of our web server.
What we're gonna do is go back and define what we want version 2 to look like. We're gonna update our web server, let's just say to version 2, and instead of Apache, we're gonna use NGINX.
We have NGINX, and this is our version 2 of the world, and this could be a server or VM or what-not. In a mutable world, what we're gonna do is try to upgrade this existing server to this new version 2 configuration. We're gonna mutate it, modify it in place, to get into this new configuration.
Typically this is gonna be done with something like configuration management. We have a configuration-management tool, this could be Chef, Puppet, Ansible, something like that. We're gonna run the config management the first time around to make the world look like this, and then we'll rerun it once we've updated our definition to go from version 1 to version 2.
What's nice about that is, we already have this existing server. Maybe we have data that we've written locally and that our web server is consuming. When we update in place, we don't have to worry about moving the data around to other machines, creating a new machine, all of the infrastructure already exists. All we're gonna do is perform this upgrade.
All we're gonna do is perform this upgrade. The challenge of mutability, and sort of the trade-off with it is, what happens if this doesn't upgrade perfectly? In the real world, things go wrong. Maybe when we trigger this upgrade, the first thing we're gonna look to do is say, "We need to install this new version of NGINX because we don't use NGINX over here."
We'll try to run an apt-get install of NGINX, and we want that installed, but this could fail. It could be that, at that moment when we ran the tool our network was flakey, maybe DNS was down, maybe our APT repos weren't responsive. Who knows, there are a million reasons it could fail.
We end up in this funky state where NGINX didn't install, but we did manage to deploy version 2 of our web server. Now we're in this interesting situation where over here we tested what version 2 looked like. We understood that version 2 of our app, with NGINX, works. We've tested it, we've validated it. And version 1 with Apache and our web server, we understood, validated, tested it.
But now we're in neither of these things. We're in a weird version 1.56. It's not well understood. We have Apache still running, we don't have NGINX, plus we have a new version of our web server.
You'd call this a partial upgrade, right? If we think in terms of database-land, it's a partially committed transaction. Part of the changes took place, part of the changes did not take place.
What does this introduce for us? This upgrade process has the downside of introducing risk. The risk is, now we're in a half-failure scenario and the other side of it is, it adds complexity. If I was doing QA, I understood what version 1 looked like. I tested it, I validated it. Same with version 2. I tested it, I validated it.
I never tested or validated 1.56. I don't understand what version 1.56 is, because I never anticipated being in this state. But here I am, and I have the complexity of figuring out, is there an upgrade path possible from version 1.56 to version 2 and what is the experience of the users as traffic is now hitting version 1.56?
Does the website work, are they getting errors, what's happening? So this is complex, even with the single machine, but now imagine I have a fleet of many hundreds or many thousands of machines and they all fail in slightly different ways. Maybe one of them failed to install NGINX, but the other one installed NGINX but failed to install the web server. Maybe a third machine failed to uninstall Apache, so you end up with a proliferation of "Well, there's 1.56, but 1.54, that's 1.78."
You have to start thinking about your versioning not as a discrete version 1 and version 2, but as a continuous spectrum where everything in the middle is also possible. This becomes a complex scenario to be in.
You might say, this seems impractical. In practice, 99% of the time, this thing just works. That's true. The problem is that, 99% of the time something working times a thousand machines means a fair bit of the time, it's not working. You end up with these complex-to-debug problems: One in ten requests gets an error, or one in ten requests is slightly slower than it should be.
And these become incredibly hard to debug because your system is in a poorly understood state.
That brings us to the alternate way of thinking about this, which is, if this is mutable, then how do we think in terms of an immutable world?
The difference is, when we go immutable, we don't want to ever upgrade in place. Once the server exists, we never try to upgrade it to V2. We'll create our server, call it version 1 again, we'll install Apache, we'll install our web server, and we'll take a snapshot of this image. We'll call this version 1 of our server.
Then we'll boot this, we'll create, let's say a VM, and then we'll allow user traffic to start coming into this. Great, we've deployed version 1 of our VM, just like we did in our mutable configuration. But when we go to version 2, what we're going to do is create a brand new server. This one will have web server V2, plus it'll have NGINX, and this is V2. This is on a new VM. If this was VM1, this is VM2, so it's a distinct machine. We're not trying to upgrade the existing infrastructure.
If there's any error, we'll abort this, throw this thing away and try it again. But if we successfully created V2, there were no errors, everything installed, then what we'll do is switch traffic over. Our user now starts hitting V2 instead of hitting V1. Then we'll just decommission version 1. We'll take that VM out of production or destroy it or recycle it for some other purpose, and so on and so forth. But the goal is, we never try to in-place modify this system.
What this let's us do, is have a notion of discrete versioning. There was either version 1 running and that's where traffic went, or there was version 2 running and that's where traffic went. This middle zone doesn't exist. There was no version 1.5 in between these things. The advantage of this becomes, as we think about risk and complexity, there's much lower risk, because we don't have these undefined states that aren't validated, but we also reduce the complexity of our infrastructure.
Now I can talk in terms of histograms. I can say I have 50 machines in version 1, and 20 machines in version 2, as opposed to having some distribution of machines and different versions. So it's much lower complexity as I reason about what this infrastructure looks like.
It's not without trade-offs. What if this application had state? What if this app was writing to its local disk and it had data that mattered to the application. Here what we said is, we create a new machine, delete that machine, including its data, including its disk. That clearly doesn't work.
To make this effective, what you generally need to do is externalize the data. Instead of it being on box with the same application, maybe I use an external database that's shared. My first VM is writing to the database, but it's the same one that my second one is using.
As I make this transition, I don't have to worry that I'm destroying the data on this box, because I've externalized it. This becomes key. The externalization of data allows the immutable pattern to be applied here. In general, database-like systems tend to be updated much less often than things like our applications, so we might say, "You know what, we're gonna use a mutable approach to managing databases because it's so infrequent and we don't have to bother with data migration."
Or if you're in a cloud environment and you have things like Elastic Block Store or externalized software-defined storage, maybe the underlying disk is mutable, but even the machine running our database is still immutable.
What we might do is shut down the VM that's running, who knows, MySQL version 7, we'll shut that down, we'll bring up a new one running MySQL version 8, and reattach it to the same disk. In this way, the data itself isn't being lost, it's just the machine, the compute is being moved from one version to the other.
There are different approaches in terms of how we would make immutability work. This is fundamentally the distinction: Do we take existing infrastructure and try and upgrade in place, or do we take existing infrastructure, create new infrastructure, and destroy the existing thing in place?
That's the core distinction between mutable and immutable infrastructure.