Learn how multi-cluster deployments with Nomad Enterprise 0.12 enable organizations to deploy applications seamlessly to federated Nomad clusters with configurable rollout and rollback strategies.
Hi, folks. My name is Tim Gross. I'm an engineer on the Nomad team.
Today I'm going to talk to you about Nomad's multi-cluster deployment. It's a new feature that we shipped in Nomad Enterprise 0.12.
But to set the scene, I want to talk quite a bit about the scalability and fault tolerance in Nomad and some strategies for scaling up your Nomad deployments. So if you're not a Nomad Enterprise customer, there should be plenty here for you, too.
Say you want to deploy your app around the world. Maybe your users are geographically distributed, and you want to provide an equally fast user experience for all of them. Or maybe you have EU customers who enjoy strong privacy protections, and you need data sovereignty.
Or maybe you're a large enterprise with both cloud and on-prem datacenters. Or you've just acquired another organization, and now you find yourself in a multi-cloud situation.
No matter what the reason, if you're going to scale Nomad around the world, we're going to need to dive a little bit into Nomad's architecture.
Every Nomad cluster is divided into servers and clients. Clients are where you run your workload. Those might be Docker or Podman containers, JVM applications,
raw_exec processes, KVM virtual machines, whatever Nomad task driver you want to run.
A cluster can have many clients. There are clusters with thousands of clients.
The servers are the control plane, though, the brains of the cluster, and this is what most of this talk is going to be about.
A production cluster of Nomad always has 3, 5, or 7 voting servers, and servers have to agree what jobs they're going to run on clients. They come to an agreement on what that is with an algorithm called Raft.
It's also used by products like etcd, which is the data store used in Kubernetes.
Why no more than 7 voting servers? Why not just scale out the servers as much as we want? It's important to note that in Raft, all consensus has to be serialized through the leader.
Nomad does a bunch of clever optimistic concurrency tricks to make scheduling decisions really fast despite that. But at the end of the day, in order to persist a decision in the Nomad state store, the leader has to replicate the Raft log to a quorum of followers and receive their acknowledgments before it commits the entry.
Raft has some implications for scaling out Nomad in other ways.
Every follower receives every log entry. The total storage and bandwidth requirements are going to grow linearly with the number of servers in your cluster.
When you're tuning the timeouts for Raft, the time it takes to commit to a quorum of followers, what we call the "broadcast time," needs to be well under the election timeout, which in turn needs to be well under the expected mean time between failures, and those failures can be either the failure of a server node or a network partition.
Otherwise, Raft is going to lose quorum, and it won't be able to make forward progress on decisions. But as you add more server nodes, the likelihood that any is going to experience an outlier latency and break that equation grows linearly.
In practice, what this means is that Nomad servers in a Raft cluster need to be pretty close together, within about 10ms. To get that kind of latency, all your cluster is going to have to be in the same physical datacenter.
Larger cloud providers, AWS, Azure, have built groups of datacenters close together geographically, and they've networked them together in such a way that you can treat them almost as if they were all in one big datacenter.
The cloud providers refer to those as regions, and so a Nomad cluster is referred to as a region. You'll hear me use "cluster" and "region" interchangeably throughout the rest of the talk.
If you're going to deploy to multiple regions, there are a bunch of different strategies you can have, and each strategy has different trade-offs between scalability and overhead.
If you're a small organization—you're just getting started, you're a startup—and you're planning on growing a lot in the future, start simple. Start with some of the first ones I talk about, and you can always transition to a more complex deployment later. Nomad is pretty good about not making you start over from scratch.
The first approach you can take is to centralize your servers together in 1 region and then geographically distribute the clients. Unlike the Nomad servers, the clients don't participate in the Raft algorithm, so they don't need to be that close to the servers. If they're a half-second away, no big deal.
One use case we've seen for this is in IoT applications, so distributed across farms or industrial sites. Nomad clients at each site can run applications that communicate with the sensors in their local network, but the deployment decisions about those applications are all controlled from the central region of servers.
A variant on that strategy is to take advantage of Nomad Enterprise's read scalability feature and to distribute non-voting servers. So long as you've got those non-voting servers so that they're not running schedulers, they can act as a local proxy to clients in the network. Clients register with the non-voting server local, and then those servers will forward their RPCs to the centralized leader.
The problem, of course, with a central region of servers is it becomes a single point of failure. If that region loses power or it loses connectivity, you've lost control over all the clients.
The way to avoid a single point of failure is to run isolated regions: fully independent Nomad clusters with 3 to 7 voting servers in each one. The regions aren't in the same Raft cluster. They're too far apart.
The speed of light is too slow to get between Bangalore and Amsterdam in less than 10ms. But that's OK, because what it does is it provides fault isolation. If something goes wrong in 1 region, you don't need to worry about it cascading to your other regions.
For isolated regions, though, you start to run into some administrative overhead. You need to deploy a secure connection for each region, MTLS or a VPN. You're going to need to deploy that for your CI/CD systems as well. You're going to need to have separate access control policies and ACL tokens.
If you're running a lot of clusters, it can start to feel like you're fumbling with a ring of keys. So we let you loosely join regions together with Federation.
In Federation, you connect to one region, but it forwards your commands to another region, and we don't join the regions together in Raft.
So you say to your Amsterdam cluster, "I have this job to run in Tokyo," and Amsterdam will pass that message along to Tokyo.
You only need to have connectivity to 1 region at a time, and you configure your command line with only one set of ACL tokens.
You get the convenience of communication between regions. If the region fails, that fault is still isolated to that region.
But if your goal of federating the cluster is not just to reduce administrative overhead, but to run the same application across regions, you might start finding that it's a little hard to keep your regions in sync.
Suppose you have 3 regions running Version 1 of your application, and now you want to run Version 2. You'll have to deploy to each region individually or have your CI/CD system do it.
But what if your regions aren't all the same size? Maybe they expect different levels of load, or one is a staging region. Now you need to deploy your job with a different count in each region.
Or what if you want to deploy to 1 canary region first and wait for it to be healthy, and only then deploy to the remaining regions?
While you're in the middle of that deployment, what happens if you're deploying to the remaining regions and one of them fails to deploy? Do you just roll back that region? Do you roll back all those regions?
There are a lot of decisions to make here.
Those are the kinds of problems that organizations with large multi-region Nomad deployments were bringing to us, and that's why we built the new multi-cluster deployments feature for Nomad Enterprise 0.12.
With the new multi-region stanza, you run a single job, and you deploy it across multiple regions. It gives you configurable rollout and rollback strategies, and you can even template the jobs so that each region gets a different number of instances of the job.
We knew when we designed this feature that Nomad already had the right design for Federation.
We wanted to keep clusters fault-isolated. We wanted to be able to have very high-performing scheduling decisions. We definitely did not want to impact how Nomad worked for your single-region use cases.
We wanted to keep coordination low. We didn't want to make some kind of cluster of clusters that would make reasoning about correctness difficult. Raft, again, doesn't scale to this kind of a thing.
We wanted to do so without introducing any new operational overhead for you as the operator. We didn't want to have an external coordinator, unlike Kubernetes.
Multi-region deployments leverage mostly 2 existing features of Nomad. The first is RPC forwarding. Any message sent within a Nomad cluster might be forwarded to 1 or more other servers.
The most common case of this is what we talked about, forwarding to leader, because that's where we originate all our Raft entries. When you connect to one of the followers, it's always going to forward that RPC to the leader.
Then there are some commands like
nomad alloc exec or
nomad alloc logs that forward from the server to a specific client agent where the workload is running.
In federated clusters, we have a region flag that tells the server to forward the request to the servers in a different region. For multi-region deployments, we just reuse the same mechanisms so that we have region, region, RPC, without having to create some kind of new cross-region discovery process.
This also helps reduce the overhead of managing access control. In a federated cluster, one region acts as the authoritative region for ACLs (access control lists). ACLs that are written to that authoritative region are replicated asynchronously to all the other regions.
That lets you create Nomad API tokens that are scoped only to 1 region or scoped globally, and we'll need those global tokens for multi-region deployments.
The second component we need for a multi-region deployment is the deployment watcher. This is a routine that runs on every Nomad leader that monitors deployment states.
In a single-region deployment, we have this state machine. Single-region deployments start in a running state, and they can transition to either successful, failed, or canceled. While the deployment is running, clients will update the servers with the health of their allocations: "This one failed. This one is complete."
That's what triggers the leader's deployment watcher to schedule additional allocations, to continue to roll out, or to fail the deployment.
This is an example of a single-region deployment update stanza. We have max parallels set to 1, so we're deploying 1 allocation at a time. We have a healthy deadline of 1 minute and auto-reverse set to true. If any allocation takes more than 1 minute to become healthy, we revert the entire deployment.
In our multi-region or multi-cluster deployments, we combine RPC forwarding and the deployment watcher to come to decisions without having a full Raft consensus between regions.
Here's how that works.
In a multi-region deployment, we extend that state machine of the deployment watcher so that every deployment, instead of starting at a running state, starts in a pending state. Nothing happens until it's kicked off.
We're going to come back to what that means in a minute, but the important thing to note in this diagram on the screen is that at this point, no allocations have started.
You as the operator send your Nomad job run to 1 region, and that region is going to be responsible for registering the job with all the other regions.
It's going to do all the usual permissions checking, job spec validation locally before we kick things off, just because we don't want to send a bunch of messages around if we don't have to.
Each region is going to get a slightly different version of the job. Fields like region or datacenter get interpolated with the values from the multi-region stanza.
That receiving region is going to query all the other regions for their existing version of the job, and it's going to get the version and the current check index. We're going to use the highest version of the job that all the regions know about, and we're going to forward the interpolated copy of the job to each region using that region's check index.
What that does is it makes sure that when we make an update, there wasn't a concurrent update from, say, an autoscaler or another user in that region.
We're going to wait until we've submitted the job to every region, and then we're going to go around and check that every region has a deployment in the pending state. It's kind of like a 2-phase commit. We've registered, and then we've checked.
That tells us that at that point, every region has validated the job—it's valid in that particular region—and it's written that job to its own Raft.
What happens next depends on the multi-region configuration.
Let's look at a minimal configuration, and we'll look at more complex ones later. In this one, we have max parallel 1, so 1 region at a time, and 4 regions in this order: Tokyo, Bangalore, Amsterdam, Sao Paulo.
Our receiving region is going to kick off the deployment with a deployment runner RPC to max parallel regions in the order listed. In our case, we're going to have 1 RPC to Tokyo. And once all those RPCs have been sent, we're going to return the success message to you as the operator.
At this point, we've made 2 round trips to every region. But because every region is doing the scheduling work, which is kind of the expensive computational work, in parallel, you can expect that a successful multi-region submission is going to return almost as quickly as a single-region job, just basically a little bit of added time for each extra region.
We've submitted our job. Now what happens?
At this point, the deployment watcher in each region that's been kicked off is going to be in a running state.
The deployment watcher is going to go through the exact same process it would during a single-region deployment. It's getting allocation health updates from the clients. The difference is what happens when the deployment updates its total status.
Once the deployment has placed all its allocations and they're all healthy, it's going to query the status for the other regions. We want to make sure that in the meantime, while we're running that deployment, the other regions haven't failed. It's basically a safety check.
Then the running region is going to find its place within that order list, and it's going to descend a deployment run RPC to 1 or more regions that come after it, such that, at most, max parallel regions are now in the running state.
In this case, we're going to say the next one in the list is Bangalore, and we're going to set Bangalore to running.
Once our region has sent that RPC, it's going to change itself into the blocked state. That means that region has completed its version of the deployment, but it's not going to consider itself a successful deployment until it's been unblocked. That needs to happen before the deployment progress deadline, which by default is 10 minutes.
We'll come back to what that means in a sec.
The next region to run is Amsterdam, and then onto the last region, Sao Paulo.
The last region in our list of regions has a special role, which is to identify that our multi-region deployment is now complete. Instead of blocking, it's going to enter the unblocking state.
Again, there's no coordination that just happened for that. It just found its place in the list and said, "Oh, I'm last, so I'm the one that needs to unblock."
But it wasn't like a leader election between the regions that had to happen. The unblocking region will send a deployment unblock RPC to all the regions. It's going to transition them into the successful state and then will transition itself into the successful state.
That's it. At that point, our multi-region deployment is complete.
Now let's look at what happens if we didn't succeed.
If at any point our region's deployment fails, if it queries the status of another region and finds out that that failed, or if it sends a deployment run or a deployment unblock RPC to a region and that fails, the deployment watcher will check its on-failure configuration.
That's going to tell it whether it should send deployment fail RPCs to the other regions and which ones and whether it should roll itself back.
In this way, the failure state can be made to cascade throughout the other regions, without having to guarantee that all the regions are reachable.
We're going to walk through a few more examples of failure modes in a minute.
Likewise, if a deployment is ongoing and an operator submits a different version of the job, we want to cancel the current deployment. It's just like a single-region job.
Whenever one of the deployment RPCs finds another region is on a newer version of the deployment now, meaning the deployment status we expect to see there for that deployment ID is canceled, it's going to cancel its own deployment.
Just like in a failure, the region is going to send a deployment cancel RPC to all the other regions and ensure that all those regions have gotten the message.
That was a bit of a look into the distributed systems logic for multi-region deployments, but keep in mind that for you as the operator, the experience is still a Nomad job run. "I have a job. Please run it everywhere," just as it is when you run in a single region.### When Things Are More Complicated
Right about now, you might be thinking, "But how does that algorithm work if max parallel isn't 1?"
Instead of kicking off 1 region, we're going to kick off 2 at a time.
We'll kick off by sending the deployment run to Tokyo and Bangalore. Whichever region finishes first is going to check the status of all the other regions, and it's going to know that it can send RPCs to, at most, 1 more region.
Here, Bangalore finished first. It sees Tokyo is still running. That means it can hand off to Amsterdam.
But if Tokyo were to finish concurrently with Bangalore, it would come to the same conclusion and try to hand it off to Amsterdam, too, and that would be a race condition.
It could result as being down to only 1 running region, when our max parallel was 2.
Except that Raft in the Amsterdam region will have a total order of the state changes.
Whichever region loses that race is going to be told, "You know what? I'm already running." So it can go and kick off the next region instead. That way, regardless of concurrent deployment, at any point in time, we will be very close to, at most, max parallel regions running their deployments.
We've seen the workflow. Now let's look a little bit more at a slightly more complex configuration.
In this version of the job, we have 2 groups. We have a web group, which has a count of 0, and a load balancer, which has a count of 1.
Groups with a count of 0 are going to be interpolated with a count for each region. In this case, Amsterdam has 5 web allocations, and Tokyo will have a count of 10. But the load balancer has a specific count of 1, a signal to us that we're not going to interpolate it.
We will end up with 1 load balancer in each region. This is also how we prevent multi-region deployments from breaking the Nomad Autoscaler. The Autoscaler is setting a specific count, so updating the multi-region job won't interfere with it.
We can also see that we have different datacenters set for each region and a meta stanza. Job specs have a meta stanza task group and job level. This is already existing.
For multi-region, we add the regional-level meta, and meta stanzas are applied in order. The task meta overlays the group meta. The group meta overlays the per-region meta, and the per-region meta overlays the job-level meta.
Failure is always an option. Let's look in a bit more detail at rollbacks.
Earlier, we saw that when a region detects failure, it's going to send the deployment fail RPC to the other regions. But how does it know which ones to send it to?
The multi-region strategy block has an on-failure field. The options here are:
Fail all, which is "send deployment fail RPC to all regions.” That's kind of what we saw previously.
Fail local, which is "don't send the deployment fail RPC at all. Just fail that 1 region."
The default behavior, which is to send the deployment fail RPCs to all the remaining regions, meaning ones that fall after it in the list. This is probably the most complex behavior, but it also gives you as the operator the most flexibility.
Say 1 region is having problems, but you want to keep the already-deployed regions serving traffic. Maybe the database in the region is having a problem. We're head of resources in the region and we need to scale out. Or we forgot to update the config for that new feature we're rolling out again.
A lot of times, if you have 1 region that's in a good state and a region fails after it, that failed state is transient. So we want to make sure that we're not having to roll back the entire thing automatically, in the default case.
In this example, Tokyo has already run, and it's in a blocked state. It's waiting for that deployment unblock RPC.
Amsterdam and Sao Palo are pending, so they're waiting for the deployment run RPC.
But Bangalore is unfortunately having an outage. In the default behavior, Bangalore is going to send a deployment fail to Amsterdam and Sao Paulo, while Tokyo remains blocked.
It's going to stay that way until the progress deadline expires, which is 10 minutes by default, at which point that is going to fail.
Now you might be saying to yourself, "What good is that?" A blocked region can be manually unblocked by you via the
nomad deployment unblock command. That marks that blocked region as successful, so it can continue to run the workloads.
This gives you a lever to operate your application in a degraded state until you've recovered.
What would happen if the failed region couldn't send a deployment fail RPC, if Bangalore were to lose communication, its network was completely cut, or it lost power completely?
If it loses communications or can't make RPCs to those other regions, those other regions will eventually time out. Their deployment progress deadline, by default 10 minutes, will expire, and then they'll mark themselves as failed.
What actually happens when a region gets marked as failed? The behavior within a region at this point is entirely controlled by the jobs update stanza, just like a single-region deployment.
In our previous example, we saw that each region will eventually get marked failed. In this case, the update stanza has auto-revert set to true. When each region gets marked as failed, it's going to begin to roll back the deployment, completely independently of the other regions.
This lets us coordinate a successful rollout between regions, but we're going to fail safe and assume that we can't coordinate in the case where a single region's deployment has failed.
We can use this interaction with the update stanza to build more complex failure behaviors.
The blocked region in this case is running Version 2 of our job, and the pending regions are still running Version 1 of our job. The failed region could be in a lot of different states.
Maybe no Version 2 allocs were ever placed, so we're all running on Version 1 still. Or maybe every Version 1 alloc has been replaced already, but then the database fell over. Or anywhere in between.
With auto-revert, the failed regions will all make sure they're on the previous version of the job. For regions that were pending and were marked failed, nothing's going to change. They're already running Version 1.
But the region running the failed deployment is going to try to roll back to the previous version of the job. Notice that that means that each region handles its own rollback, so there's no coordination required during the case of failure.
That was multi-cluster deployments in Nomad. A lot of what I talked about today came from internal design discussions for this feature that I had with my colleagues, Mahmooda Ali and Chris Baker on the Nomad team. So thanks to them for their contributions to the design.
Thank you for your time today, and enjoy the rest of the conference.