Engineering reliability by expecting failure: SRE with HashiCorp Nomad
Site reliability engineering (SRE) doesn't just mean preventing system failures. Increasingly, it focuses on expecting failure and having systems in place to automatically heal your infrastructure when that inevitably happens. HashiCorp Nomad is a lightweight scheduler that can help you build self-healing systems.
IT organizations always want their production environment to be up. Downtime costs money. But software systems are inherently unpredictable and will always fail at some point.
How do you build a system that adapts to that reality and produces high availability through intelligent redundancy and failover in your infrastructure? Alex Dadgar, the Nomad team lead at HashiCorp, gives an overview of how HashiCorp's Nomad scheduler provides this intelligent adaptability with failure modes— whether you use virtual machines or containers.
Software Developer, HashiCorp
Hey, good morning everyone. I'm happy to be up here talking about failure modes and how we deal with them in Nomad. For those of you who don't know me, I'm Alex Dadgar. I'm the team leader of Nomad.
As an industry, we're really obsessed with reliability. We talk about it all the time. Is your service HA? What's its failover strategy? Is it active-active? We do it because it's really important. It affects our users who rely on our services. It affects our businesses both from cost and reputation. It affects us as operators too because we don't want to get paged in the middle of the night.
If we want to build reliable services, why do we over-complicate things? Should we just keep them all as simple and as static as possible? Run them on good machines, and don't touch them, maybe it will stay up.
» High availability and its complexity costs
Well, it's not that simple, and I want to motivate it with a really trivial example. Let's say we're creating an API for our internal services, and it has a really low request rate. A simple machine can really serve all the traffic we need. We just boot up one machine in AWS. Our users start using it. Everything goes great. But then after a few months, maybe Amazon migrates that instance, and our service goes down, and now, our users are unhappy.
You go back to the drawing board and you say, "Let me add just the minimal amount of complexity to deal with this." What you do is you boot two instances. You throw up a load balancer in front, and you statically assign backends to the load balancers. Many more months go on, and things are going great. You solved the problem. Well, now another machine goes down, and you still have work to do, because now you only have one more instance, and if that fails, your service goes down again.
You need to boot up another machine, add the route to the load balancer, and you still have more work to do. The real solution in this case is you put service into ASG (Auto Scaling Group), and you make your load balancer dynamic. Even with the most simple service, 10 requests per day, even if you won't operate that, you'll end up with a complicated architecture.
Why does that happen? It's really because of this fundamental law, the second law of thermodynamics states that the entropy of an isolated system will never decrease over time. What that means is when you first start something, that is the lowest entropy state it'll ever be in. It will only increase in entropy, and the reality of that is, that things will fail regardless of how simple you keep it.
» Intro to schedulers and Nomad
In production, you really need dynamic systems to react to these failures. If you want to build reliable systems, you need to expect these failures. That's a core design tenet of Nomad. When we're designing any feature of Nomad, we really think about all the failure modes that can happen, and how it will affect the jobs we're running on your behalf.
Nomad is a cluster scheduler. There are a lot of words for this—cluster orchestrator, cluster manager, container manager. Really what all of these do is they sit on top of your nodes that are your infrastructure. This can be in private data centers, public clouds, and they provide a unified API where users can then submit jobs to the cluster scheduler.
The cluster scheduler's job is to take those jobs and nodes, and map them together, and run them in the optimal way. If you're running an API server, if you ask this cluster scheduler to run an API server, it might pick a good node for you. But if that node dies later on, the cluster scheduler will detect it, and move that job to another up node.
In Nomad, the job the user submits looks something like this. In this example job, we're defining a Redis container. So we say, driver equals Docker. We're telling Nomad, please run this with the Docker driver. We specify some resources and tell it what image to run. Then when we submit this to Nomad, Nomad will go find a place on the cluster, and run this for you.
This is a very simple job. We'll see a [few] more job files. This isn't too important. What makes Nomad a little different from other cluster schedulers is we try to design things based on workflows, and not technologies in particular. The effect of this is you can use Nomad to run both containerized applications, and legacy applications, either on Unix platforms, or on Windows.
These drivers, which are what actually run user workloads are pluggable. If you have a bespoke user case, you can model that with Nomad. We also want Nomad to be incredibly easy to use. Easy to use both for operators, which are bringing these systems up, and making sure they stay up, and another user—the developer who's defining these jobs and running them. To do this, we made Nomad have a built-in state source, so you don't need to run etcd, Zookeeper, or any other data store out of band.
Nomad's a single binary that you can deploy in both server and client mode. Another important thing for Nomad is that it scales beyond the needs of your business. We have use cases of over 10,000-node clusters. We also made Nomad very fast. Even if you did schedule jobs in real-time, Nomad can support that. A while ago, we benchmarked Nomad and we were able to place over 5,500 containers a second, on a 5,000-node cluster.
What that ended up being is a million containers scheduled in just over 18 seconds. Nomad's also publicly used by many companies we all know. It's been used in production for a while now, so we also try to make Nomad incredibly stable. That's what Nomad is, and it's motivating the problem of needing dynamic systems to react to failures.
These are the various failure modes I want to talk about today: There's the control plane, which is actually the things that are making decisions on your behalf, based on the state of the cluster. There are the nodes that are actually running workloads, and then the applications users submit.
Let's get started with the control plane. I'm going to keep this section brief, because I more want to talk about things that affect the jobs you all submit.
» The control plane
What is the responsibility of the control plane? You can almost boil it down to just the top bullet, it reconciles user intent with cluster state. If you ask it to run a hundred jobs, its job is to always make sure those are up, as you've defined it. What that means is it schedules new jobs when you submit them. If nodes fail or come up, it should then react to them and place new jobs, or migrate old ones, and much, much more.
To model this failure, we decided Nomad should be a CP system. When we're thinking about distributed systems, you have two options. You can be AP or CP. Nomad is consistent under partitions, because you cannot reconcile if there isn't a source of truth. Nomad's job is to take what you've told it to do, and make sure that's happening in the cluster. If the servers don't have a consistent view of the world, they can't do that effectively.
We've built Nomad on top of Raft. We reused a lot of work and hardening done by Consul. What this means is the control plane can handle an over minus one over two failures. What this looks like in practice since you deploy systems using consentaneous protocols and odd numbers, usually three, five or seven. You can handle one failure in the three mode, or two server failures in the five case.
If a server fails and comes back up, the Raft protocol will quickly send snapshots to the servers, to catch them up with the most state possible. In the deltas, it will send logs of every change that has happened. When a server does fail, when a new one comes up, it can quickly become part of the quorum. That's all I want to talk about there.
There's a lot more we've done to handle various failure modes, through sets of enhancements which we term autopilot. The most interesting ones are read replicas and redundancy zones. If you're interested in handling more failure domains in your control plane, please take a look at autopilot.
» The nodes
The next thing I want to talk about is the failures that can happen on the node. A node, when I say that for Nomad, is what's actually running user workloads. The user workload can be fairly complicated. You might be asking to run an API server using Docker, but so that anyone can use your service, you'll also register it with Consul.
Nomad can register and reconcile with Consul. To get secrets we leverage the Vault. The responsibility of the node is to manage all of this information for you, so your application doesn't have to. This is what a typical node might look like [10:54]. In the middle, we have the Nomad agent, and then we also usually have a Consul agent on the same host, so you can do service discovery. Then you'll have a set of drivers.
I picked Docker here because it's one of the most popular runtimes. Then off of the machine, you'll have Vault, which does secret management. We'll also be connecting to the Nomad server, so that we can see what jobs we should be running. What are the various failure modes we can handle on this type of machine? The answer is actually all of them. We'll go through and see how we handle each thing failing on this node.
Single driver failures
The first one might be the Docker engine. Docker engine has gotten a lot more reliable, but it can still crash. What does Nomad do? Well, when Nomad first starts up, what it does is fingerprint the machine. We will try to detect everything that's running on the machine, such as how much CPU do you have, how much memory do you have, can I connect to a Docker daemon? What it does is it gathers all of this information, and it sends it up to the servers. So when you schedule a job, it'll only place it on a machine that can run your workload. So if you have two classes of machines, one that's running Docker and the other that isn't and you ask for a Docker container, it should only send that workload to the right machine.
When a Docker daemon fails, or deadlocks, or you're going into maintenance for it, what we do is during runtime we're constantly health checking the Docker engine as well. So if we detect that it's not functioning properly, we'll actually set its state to being unhealthy. And when it's unhealthy, we'll stop sending workloads to that machine. So the operator can quarantine it, fix the problem, and reintroduce the node. So your application shouldn't get bitten by a single Docker engine being down.
That's how we handle single driver failures, by both fingerprinting to make sure that they're there, and health checking to remedy. And you can easily hook into this API and then automate recovery. So if you detect Docker engine's down, you can restart the service.
With Consul, when it goes down ... we'll see what we do. In the job file, you can define a service. This service says, I want you, when you run this, to register it with Consul so that other people can discover and then use my service. So here we say, I want you to name this Redis cache, and I want you to health check that it's alive every 10 seconds using the check we've defined and advertise the port. What's interesting here is the port isn't a number. It just says "port = redis." Well, that's because Nomad is dynamically picking a port for you. So you don't have to think about, "Oh, I need to be on port 6200." Nomad will just pick it for you and pass it to you as environment variables, and make it available to Consul so you can register it, so you don't have a static port selection problem. In the rare cases where you need that, you can do it, though.
If a Consul agent fails, Nomad starts actively reconciling with Consul. We will detect when the Consul agent comes up again and healthy, and we will detect any drift from the desired state and what Consul is saying, and we'll de-register stale services and register the desired ones for you. So if Consul goes down, as soon as it comes back up, we'll reconcile state immediately and that'll get updated in the catalog, so you don't have stale information in your cluster.
Now if Vault goes down, what's the impact to your job? Well, in the Nomad job file, we try to make it very easy for users to consume secrets without having to think about the overhead of renewing your lease with Vault. So those of you who aren't familiar with Vault, Vault is a secret management engine. And the way it works is you define policies that describe what data you can access in the system. Here in this stanza [15:23], this job is saying as the API, I need access to talk to the payment API and our user database. And what Nomad will do on your behalf is go get you a Vault token that has those permissions, and Vault tokens generally have a time to live. You need to renew that token, and after it expires, you can no longer use it. So we handle the renewal for you, so your application can just pretend it can access that forever.
But if Vault goes down, you might not be able to renew the token. What we do is, generally tokens have leases on the order of magnitude of a day to three days. If Vault goes down, hopefully your operators have enough time to bring it up within that period. But what we do is we start renewing the token at half its duration. So if you have 24 hours, we renew at 12-hour mark, 6-hour mark, 3-hour mark, so on, until we can successfully renew it for you. So you're isolated, in the meantime, from Vault being down.
But if Vault doesn't come up in time for us to renew that token for you, what we will do is in the background, we'll generate a new token as soon as Vault is available, and we can then send you a signal if your application supports live reloading, or we'll restart your task, setting the correct Vault token for you, so your application is totally isolated from Vault failures.
Nomad agent failures
Then we also model the Nomad agent itself failing. It's an interesting thing to model, but we do. What we do is we have a distributed heartbeat mechanism. What that means is Nomad agents will send a heartbeat to the Nomad servers periodically and say, "I'm alive, I'm alive." And if the servers ever detect that heartbeat not being made in time, it assumes the server has failed or there's a network partition between the two. As soon as the servers detect that, it'll scan all the workload that is on that Nomad agent and immediately reschedule them onto other nodes that are healthy in your cluster, so your application doesn't have extended downtime.
We also model—from the client node perspective—failures of Nomad servers. That's the control plane we talked about, usually three or five servers. Nomad constantly, as I mentioned, does heartbeats to the servers, and it uses the same RPC mechanism to see what jobs it should be running or stopping. So if one of those RPC connections fail, we can actually fail over to another healthy server. And the way we do that is, on the response of these heartbeats, Nomad will periodically send down the set of healthy servers. So the original set you configured the Nomad clients to talk to might change over time as you update Nomad. But you'll always get the addresses back of the current set of healthy servers. So if the one Nomad fails, it can immediately fail over to a healthy server and resume communication.
That works in the single server or a smaller set of servers failing. But if your entire control plane goes down, either through operator error or something like that, Nomad can't do much because there are no servers to talk to. But what we'll do is continue to run the workload of your applications. We don't ever stop them because Nomad as a control plane went down, so you don't get impacted. But we'll start initiating the discovery loop so that we can re-detect the Nomad servers no matter what their addresses are. And the way we do that is we scan the Consul catalog to detect any Nomad servers in our region. So as soon as the operator fixes the problem, Nomad will automatically register itself on Consul, and the clients can rediscover and pick up the workload that was running. That's the various failure modes on the node.
» Application failure modes
Application failure modes, what I mean by this is the failure of the things you've submitted to Nomad. And there's really two categories of these failures. There's setup failures and runtime failures. When you submit a job, you might need to download some config from repos. You might need to pull a Docker image. You might need to talk to Vault to get a token. And runtime, once we've done all that, we start your application and it's running for some period of time, and then it might exit with a non-zero code. And what's interesting about this is, unlike the other failure modes, is Nomad really doesn't have any information to make smart decisions with. Either it succeeded or it failed. We don't know whether, if we try again, will it work or will it fail? There's no information for us. Maybe your artifact registry was down momentarily and if we try again in 10 seconds, it'll work again. Or maybe what you've specified is never going to work.
What we can do is we can provide you—as an operator—tools to model how aggressively you want to retry or give up. We do this in two ways. You have local restarts, which happen on the node that has been running this work. And then we have rescheduling, which will fail it locally, tell the scheduler about it, and the scheduler can then replace the workload on a different machine.
Local restarts look something like this [21:23]. Here we define a restart stanza, and we say, "I want you to try to restart this if it fails at least twice. And between each restart, delay 15 seconds. But if you've failed twice in five minutes, give up, because I don't expect it to work." And so when you give up, this marks the allocation as failed, and then it kicks into the rescheduling stanza. So the server gets notified that it failed locally, and then we consider rescheduling to place it onto another node.
The rescheduling stanza looks fairly similar to the restart stanza. You can say, "I'm setting an interval of 30 minutes. And between each failure of allocations, I want you to wait 30 seconds, as a start, to replace it onto a different node. If it keeps failing, I want you to do an exponential back-off. And keep exponentially backing off until you hit maybe a max delay, because I don't want you to be waiting 24 hours and then 48 hours between restarts. Unlimited means, for services, I want this running, so keep restarting because it might be an upstream of mine that has failed. So as soon as that's up, I want Nomad to restart me and keep going.
The various delay functions the rescheduling supports is constant, exponential, and Fibonacci. Constant is very clear. We put a constant delay between every restart. Exponential, many of you probably know. You just double every time. Fibonacci is a bit unique, and it's actually very nice for services. It has the property of adding up the two most recent, and that's the score you take next. And so the property that's nice about this for services is in the beginning you get very fast retrying. By the time exponential is doing a 32-minute back-off, you're only doing six-minute back-off. But then, as the failures keep compounding, the delay starts growing much faster. So in the beginning, you try to retry very fast to get your service up. But the more retries you do, the less your expectation is that the next one will work, so then you start growing your back-offs. So that's Fibonacci.
When we do these reschedulings, what we do is we put an automatic anti-affinity to avoid placing you on the same node that your application just failed on. Because it might actually be a node problem, maybe the disk is failing, or some other attribute of the node is problematic. So we try to push you around and we also have this max delay so you can always have a bounded understanding of how long it will take for your application to retry.
The last point is a little nuanced but imagine my application failed because the database was down, so it kept retrying, retrying, and eventually the delay was 30 minutes. But then I stay up for a day or two before I fail. I don't want the next restart to take one hour, because the last one was 30 minutes. So if Nomad detects the last placement has stayed up successfully for a duration longer than it was failed, it will reset you back to the original delay.
So that's the application failure mode. So we covered control plane, node, and application. But there's actually one more failure mode I'd like to talk about, and that's you as the developer.
» The developer
Developers introduce failures through the lifecycle of the applications, right? We first set up and download the artifacts, then we start running them, we saw how Nomad deals with that, but eventually you've been coding in the background and you have a new version of your API that you want to submit to the job. Well, all sorts of things can break when you submit this job, so we try to build tools to protect you from potentially doing an update that takes down your entire service.
To do that we have this concept of deployments in Nomad. They allow the operator to choose the safety level they want. You can either do a rolling upgrade where we will take down a set of instances and place new ones at the same time and only continue as they're becoming healthy. You can also do canaries, so if I think it's a high-risk upgrade, I can have my tenant web servers and I'll just boot up a few extras alongside those, and I can maybe route traffic differently to them. So I only send 0.1% of my traffic to my canaries, and I can check telemetry, logs, and make sure it's up. Only once I'm very sure it's gonna work, I can promote the canaries and it'll start doing a rolling upgrade.
If you want to avoid all of that and you want to do traditional style, you can also set your canaries to be equal to the old count. So you can bring ten of the old servers up and ten of the new ones at the same time, and only when you're certain, you can then flop them. So you don't have any downtime as you're ramping up the new set.
We also do something like Terraform plan for operators. This is fairly unique to Nomad. So with Nomad you can do a Nomad plan, and I'll show you the output of what the scheduler would do if you ran this. I'm actually going to do a live demo to show you these features, so wish me luck.
» Demo: The Nomad plan command
This is always a good idea. So in this tab, I have Nomad running up top and Consul on the bottom. That doesn't matter, and what the set-up is, I have a single job running on the cluster already. It's Fabio, which is a load balancer that natively integrates with Consul. And so I will be submitting a web application to Nomad, and we're automatically registering Consul and then we'll be updating the web application and see the canaries as a feature.
So, I'll open up the demo job I have here. So what's important is I have count three, can everyone see that by the way, or should I make it bigger? A little bigger? Alright. So here we have count three, I've just compiled a very simple web service. It almost looks identical to what James Nugent showed, but it just shows some static content. I'm telling Nomad to give me a single static, dynamic port, and to register that in Consul. So Consul will be health checking to make sure my web service is actually up.
I'm going to close this and run the demo. What this output shows is we've created three and they've started running. I can run Nomad UI and give the job, and this'll actually pop up our UI directly to the job and we can come here and see Nomad has placed these three allocations. I have the Fabio routing table up, and we can see I have three backends now and they all have dynamic ports. But it doesn't really matter because Consul has registered them all in its catalog so Fabio can pick them up and route traffic to the correct address. This is the port of the load balancer, so if I refresh it we can see if live demos are terrifying. Which is truth.
Next I'm wanna actually update this job because this demo's going fairly well so far, so maybe they aren't terrifying. I want to change my demo from V1 to V2. Which has some slightly different content. But now let's actually take a look at this update stanza. So this is what configures deployments and allows the operator to choose how they want that to get rolled out to the cluster. So we'll talk about what each field does.
Max parallel of one says when you're updating this job via a rolling update, I only ever want you to take one down, and replace it with the new version, one at a time, and make sure the new version is up before you move on. Up and healthy, which is important. The healthy time should be at least five seconds. So I want you to keep checking if it's healthy and if it's been healthy for five seconds, then you can move on to the next one. This value is really low for demo purposes, you wouldn't set five seconds normally.
I want you to do a canary whenever you detect a change to the job file that would cause things to go down. So before we run this now, let's actually do a Nomad plan. So the Nomad plan here says "I've detected that you've changed the binary from HashiDays V1 to HashiDays V2." And that will force Nomad to take down the old version and create a new one. But because you've set the canary field, it's going to only watch one canary first.
So if I do a Nomad run now, we can see it only launched one new deployment and it hasn't touched the old instances. If we come here again, we see the UI shows that we now have an active deployment happening. It's launched one canary and it's already detected the canary as healthy. But because of the fact that it's a canary, it's waiting for us to promote it. So if we come to our load balancer now we can refresh, and luckily the new text showed up the first time, "HashiDays, Welcome to HashiDays EU." But as we keep refreshing we see most of the time I get the old data up, because our route table now has four instances and only one of these is the canary.
So most of the time, three out of four, I should get the old data. So what you would be doing at this point, if it was a real application, is you would be checking your logs and telemetry and making sure that canary is behaving as desired. But once you detect that's true, you can do a Nomad promote, you essentially promote the canary and so when we promote it, if we come back here and we start refreshing, we'll see that slowly we've quickly converged to Ollie saying "Welcome to HashiDays EU." And that's because the deployment has finished, we've placed three healthy versions, and now if we come look at it, all of them have been rescheduled to be running version one.
We also have that notion of job versions and Nomad tracks what has changed between them. So Nomad knows the first job you submitted was V1 and the second job you submitted was V2. And so things looked good when we canaried, but maybe as an operator once we roll out we, think it's bad. Nomad also supports reverting back to old instances. So we could quickly say "Nomad revert to version zero."
So that's the set of failure modes I wanted to talk to about. So thank you very much. I hope you enjoyed.