HashiConf 2018 Keynote: Nomad 0.9 - Affinities, Spreading, and Pluggability
Nov 14, 2018
HashiCorp co-founder and CTO discusses the upcoming version of Nomad, which includes affinities, spreading, and a new plugin architecture for task drivers.
Founder & Co-CTO, HashiCorp
I want to spend a little bit of time also talking about Nomad 0.9 and highlight a few of the new features that are coming as well. One of the big focuses for us has been looking at: How do we make the scheduler smarter? What we’ve seen over time, as we get increasingly sophisticated use cases, like CircleCI’s, is people want more control over how Nomad places and schedules and manages their application. So a common ask has been: affinities and spreading.
» Affinities and anti-affinities
The idea behind affinities is: How do you hint to the scheduler that you would prefer to be placed in a certain location? Nomad has always supported the notion of what we call a constraint. A constraint is something that must be satisfied. So if you tell Nomad, “My job must run on Linux,” or, “My job must run on Windows,” there’s no way for Nomad to fib that. If your app is designed to run on Linux, it’s not going to work on Windows, so the constraint must be satisfied.
An affinity is more like, “It’s a thing I would like.” And a perfect example of this might be: Maybe you’re able to use certain kernel features that you know make your application faster on certain platforms, or there are certain hardware instructions that your app can potentially leverage that make it faster on uncertain classes of hardware. In these cases, it’s not make or break if you’re not running on that environment, but you’d like to provide the hint such that if that hardware is available or if that OS is available, then all the better, your app can leverage it. What that looks like now with the latest version of Nomad is you can add this affinity block. What you’re hinting to the system is, Here’s a set of things I would prefer to optimize for.
You can optimize across anything you want. Could be operating system, it could be version of application installed, it may be a Docker version, could be hardware properties—really anything that Nomad knows about you can use as something to optimize your job against. In this simple example, we are just optimizing against the kernel version. So we’re pending against the kernel version and saying, “We would prefer to be later than 3.19.” Maybe there are some new features that we’re leveraging in that, and so we’re going to say, “Great, if that’s available, put us there. If not, find some other Linux machine, and still place us there.” It makes it super easy for us to customize where we’re placing the app. And you have this weight property that lets you express a preference of, How much does this matter to you?
If it’s super important, then great, we’ll try harder and maybe move workloads around to make that possible. If it’s not a huge deal, you can provide a lower weight to it. At the same time, there’s the reverse of that, which is, What if we don’t prefer a certain class of hardware? This is an anti-affinity, and there might be reasons for that. Maybe we know that our app doesn’t do well on spinning disk, and we need an SSD, so we have a strong anti-affinity to spinning disks. Or we know it doesn’t perform well on a particular version, things like that. We can do almost the same thing as affinity, but are just assigning now a negative weight and just saying, “We negatively prefer the same attribute as we would have for an affinity.”
The other one that’s super interesting is “spreading.” Typically, what Nomad will do is try and impact applications as much as possible. So we’re trying to fill up the existing hardware and the fleet and maximize our utilization. So we want to get all of our existing hardware to be fully utilized before we start utilizing new hardware. This lets us scale down and get rid of machines that are effectively idle by packing in things onto other hardware. This is Nomad’s default behavior, this bin pack. But there are good reasons why we might not want to do that. We might want to specifically spread out our applications on purpose. Typically, it’s for things like availability or modeling our fault-tolerance zones. A perfect example of this is: Suppose we have 2 data centers, Data Center East 1 and East 2. We might not want all of our applications to end up being bin-packed in East 1 so that we have East 1 at 100% utilization and East 2 at 0% utilization.
Instead, we’d like to specify to Nomad, “We want a spread between these.” You can pick any property, much like with affinity, and say, “On this axis, on this dimension, I’d like to spread workload.” We’d might like to spread our workload between 50% East 1, 50% East 2. Then within those regions, Nomad will still bin-pack, so we’ll still make sure we’re packing Data Center 1 and packing Data Center 2, but spreading across that dimension. So this lets you get into either modeling things like availability zones and data centers for fault tolerance, but also if you’re worried about bugs and kernel versions—you might have a spread across different versions of the kernel that you’re running against, so if there’s a bug in one runtime, it doesn’t affect all instances of your application—you can pick different axes to spread your workload across.
Another interesting property of Nomad scheduling is that it’s priority-based. Any job you submit to the system, you tell us what’s the priority, zero to 100, and Nomad tries to give preferential access to resources. So higher-priority jobs get scheduled first. If there’s contention on resources, they win out. We try and use this priority as the deal breaker so that you can give us your business logic of, “Hey, my frontend-facing user traffic is more important to me than my nightly batch processing is.” You can encode that, and the scheduler will respect it. Now, the challenge is, What happens when the cluster is totally busy? You have a cluster that’s 100% full, and now new work shows up. What Nomad does is basically park it in a queue. There’s no capacity available, so put it in a queue, wait until room frees up. Then when there’s room, we’ll go schedule that application for you. This is nice from the perspective of, We can treat Nomad as a job-queuing server; we don’t have to build reconcile logic of if it’s busy, come back, and try again. Nomad will deal with it, but you can end up with a priority inversion. If I have a cluster that’s totally full with low-priority work and a high-priority job shows up, that high-priority job is now sitting in a queue waiting for low-priority work to finish. This is a classic problem known as priority inversion. You can’t have high-priority work wait on low-priority work.
The way we’re solving this is by adding preemption in 0.9. When the scheduler detects that, in this configuration, it’s totally full, there’s no additional capacity, but now we have high-priority work that’s arrived, how do we evict low-priority work to make space for the high-priority work? The low-priority work can sit in the queue. The high-priority work is the one that should go execute right away. We’re adding this across the board. The system scheduler in Nomad, which is responsible for running all of the agents that run monitoring, logging, security agents, things like that—the system scheduler that is responsible for that is gaining preemption in 0.9, in the open source, and then the service and batch schedulers are gaining this ability in Enterprise, subsequently.
» Plugin system
Another common feedback we’ve gotten with Nomad is, “How do we make this thing easier to integrate with my ecosystem? Nomad is already relatively unopinionated about how things should work, but how do we make it even more flexible to bring in different ways of thinking about running jobs or different devices, things like that?” So with 0.9, there’s been a big investment in the pluggability of the system. The first 2 things that are showing up are task drivers. Task drivers are how Nomad executes. Docker and LXC and QEMU are examples of tasks drivers. When you provide Nomad a Docker job, the Docker task driver is the one responsible for executing it. But what if we want to customize that? We want to build our own task driver specific to maybe node.js or we want to take the Docker one and modify the way it behaves to fit into our environment. Now, what we can do is support that being a pluggable thing. It no longer has to be compiled into Nomad. You can have a custom driver and do specialized logic within it.
The other common challenge is how do we manage not-stock devices? So FPGAs, TPUs, GPUs, other specialized hardware. How do we make Nomad aware of it so that it can manage it and schedule against it? Historically, it’s been: You have to make the core product aware. Now we’re adding a whole driver interface so that you can plug in additional devices and Nomad will become aware to be able to do scheduling and management and resource isolation of that hardware. All of this is part of laying the foundation to have even more plugins in future releases. So looking at how do we make this system super pluggable and super extensible. One plugin that we’re super excited about is the NVIDIA GPU. A lot of our use cases, as I mentioned, are around this large-scale batch processing. Increasingly, we have users and customers who want to leverage graphic arts to accelerate workloads like machine learning.
If you have a super-parallel scheduler attached to super-parallel hardware like GPUs, this is a perfect match for building things like complex ML pipelines. This is something we’re super excited about that’s landing with the next version of Nomad. As we talk about 0.9, it’s this focus on how do we make the system smarter and richer as we see these different use cases emerging? So there’s a whole host of new features. The 0.9 preview will be available shortly, maybe as early as next week. The generally available release should be as early as early November. Keep eyes peeled.