Nomad 0.9: Advanced Scheduling Features, Plugins and More
Dec 19, 2018
This talk goes into detail exploring the new features of Nomad 0.9. See how Nomad's capabilities as a simple, but powerful cluster scheduler are advancing.
HashiCorp engineers cover the powerful new controls in Nomad 0.9. The talk covers the following new features:
Affinities and anti-affinities: Tell Nomad that one application should be co-located near another application—or an anti-affinity: that an application should not be co-located with another application.
A new plugin system for task drivers: Support different execution drivers and different hardware types including GPUs, TPUs, and FPGAs.
Specifying spreads: Let Nomad know that you want to spread workloads rather than maximize density on nodes.
Native preemption capability: Evict lower priority jobs to accommodate higher priority jobs.
Software Engineer, HashiCorp, HashiCorp
Nomad engineer, HashiCorp
Michael: I'm here to talk to you about Nomad 0.9 and the new features.
» Nomad is composable
To talk about where we are going, I want to talk about where we are today, and the design principles that Nomad as always adhered to. Nomad has always been composable and what I mean by that is you can use Vault with it to manage your secrets, you can use Terraform to provision the infrastructure, you can use Consul to connect all the services. As well as, Nomad itself can use Consul to self assemble your entire Nomad cluster with zero configuration.
Not only that but Nomad can operate as just one component in your overall platform. We see a lot of people building on top of Nomad. It's like humans may not even be interacting directly with it but instead using a platform that they've built to use Nomad just as the base scheduler beneath it. And so you see that you can hook your CD/CI up to Nomad, you can hook an autoscaler up to it, a dashboard. You can also hook advance batch processing platforms up to it. There's a lot of ways in which you can use Nomad as just a single component in your overall platform.
And it's all pluggable, right? You can use whatever pieces you need. If you don't need Vault, don't use Vault. If you don't need Consul, don't use Consul. Nomad is trying to be as composable as possible and fit your workload.
» Nomad is flexible
And part of that means running on whatever operating system you're on. So we've always supported macOS, Linux and Windows.
And we've always supported a wide range of run time task drivers. Qemu for full virtualization, Docker for containerization, rkt as an alternative containerization platform. As well as directly running Java jars and whatever binaries are executable, you want to give to Nomad.
» Nomad 0.8
So Nomad as of today, the 0.8 release, is composable. It's one component of your platform. It's flexible, will try to run anything, anywhere. But it's static.
What we've shipped to you is all you can use. It's not easily extensible. And if you do want to add to Nomad, that requires you going to GitHub, forking the project, submitting a PR, opening issues, communicating with us directly and then waiting for the next release of Nomad before that feature ever ships.
» Nomad 0.9 is extensible
So we are very excited that in Nomad 0.9, a major focus is making extensible. And to make Nomad extensible, we're starting by creating task drivers and device drivers. They're half driver plugins and device plugins.
» Driver plugins
All of our existing task drivers, Qemu, Docker, rkt, Java, all of them are going to be plugins themselves. The plugins that the communities create, that you create, that you create internally, are going to have the same functionality and API's available to them that are built in plugins too.
And so this should open up a lot of opportunities for all the varieties of runtimes that exist today. Not just other containerization platforms, like Podman and other operating systems containerization like BSD jails, but also alternative platforms like .NET Core on Windows. We would love to see more people building out new workload support on Windows. I'm very excited to hear about Jet.com at this conference and they have built out almost like a plugin before we provided a real plugin architecture to support a new workload on Windows. And so it's very exciting, thank you to them for creating this, and I hope we can help them port that to the new plugin architecture. Cause it's exactly what that is designed for. As well as, Singularity which there is going to be a talk on later today. It's another containerization approach. And it's customized more for HPC workloads which we're very excited to improve support for. So this afternoon look for that Singularity talk as well, cause I believe they're going to demo the first third-party implementation of a plugin.
Along with plugins, we're going to be improving the way you can configure drivers. So, at the top of the screen, you can see the old way that we allowed you to customize some of the behaviors of the built in plugins. It was this very ugly, map from strings to strings. You had to refer to the documentation and typo could really mess things up. So in the plugin world we are going to make plugins stand as first class. plugins define their configuration schema, Nomad makes sure to verify and validate the configuration given to it. And so your plugin will only ever receive valid configurations that it defines and you get a full type system. So you'll notice the Boolean
true was as an actual Boolean. It will support numbers and more complex data types. No more having to stuff all kinds of information into these strings and parsing it back out.
And so the nice thing for plugin authors is that Nomad handles all of the parsing for you. And then when your plugin actually receives the configuration, it can receive it as a well formed object.
Jobs in the plugin world will look exactly the same. They'll be fully backwards compatible. If you don't care about plugins you can ignore everything I've said, your job that runs today will run tomorrow with zero changes. Now behind the scenes, things have changed a little bit. We took this opportunity when implementing plugins to also implement their configurations using the new HCL2 library that if you were around for the Terraform talks they're using as well. Unfortunately we are no enabling all HCL2's features in all of the job files quite yet. But we are very excited to use plugins as an opportunity to start enabling more of the features today. And too, you can expect in future releases that will be enabling a lot more extensible job file creation.
The plugin architecture communicates between the Nomad agent and plugins over gRPC. Because plugins are actually run as external processes to Nomad. And the reason we chose that architecture is, it's something we've done in other HashiCorp projects like Vault and Terraform. Because it really allows you to isolate faults in each system. Nomad can crash, a plugin can crash, everything can recover just fine and your tasks will continue to run unaffected by those faults.
gRPC also gives us a lot of really nice features out of the box, for plugins and plugin authors. It handles bi-directional streaming, cancellation, log shipping, all kinds of great things out of the box. As well as, being cross platform. So while Nomad is and always will be written in Go, if you want to write programs in another language we would love to enable that in the future. And having a gRPC API should make that a much simpler proposition.
» Device plugins
So the other plugins that we're going to enable in Nomad 0.9 are device plugins. And this is an entirely new feature unlike drivers, which we've obviously always supported. We've always supported a basic set of devices. If you've used Nomad at all, you're aware that we support memory, compute, storage and network resources.
But we are very excited to announce that Nomad 0.9 will have Nvidia GPU support, which should enable a wide number of workloads, optimized for GPUs—ML, AI—all to run across your GPU-optimized cluster. And you'll use it just like any other resource in a job file definition. And so in this example you can see it's using an Nvidia GPU, it wants a node that has two GPUs, each GPU should have at least 2GB of memory. And these attributes are definable by the device plugin. And they support a number of units and different ways of scheduling. So it can even take power consumption into account, it can take bandwidth for any network devices that might get implemented into account. It can even, you can even specify the model name of the specific device you want. If there is specific features of that device that you really require a very specific model for.
We are very excited to see what people built with this in the future. While we are shipping Nvidia today, we are hoping to see a lot of community support for other devices in the future, especially as devices, even in the cloud, became more and more common, like FPGAs, HSMs, TensorFlows, TPUs. There's a lot of exciting opportunities for Nomad and this device plugin world.
» The future
So, Nomad 0.9 is extensible. Devices and drivers being the first plugins that we're offering.
But really what I'm the most excited about, as a Nomad engineer, is the plugin architecture is going to enable a lot of the features that many of you have been asking for, for a long time, like storage volumes, networking plugins, login plugins. And we'll be able to support those first class in Nomad. As opposed to many people using Docker specific plugins and therefore having to use the Docker run time. These sorts of things we want the support in Nomad core, through our plugin architecture and we don't want to make you wait for us to release a new version of Nomad to use new capabilities of devices, drivers, storage volumes, and networking plugins.
All of these things should be able to have a life of their own, maintained by a community. But obviously we'll still be maintaining all of the drivers that we maintain today and all the devices that we're launching with the Nvidia GPUs.
So I'm very excited about the future and all of the advanced plugin implementations that people create, but that's not all that Nomad 0.9 has. And I would like to introduce Nomad's engineering manager Preetha to talk about some of the advanced scheduling features that Nomad 0.9 is bringing.
» Advanced scheduling
Preetha: Alright, so we've been working on making quite a few enhancements to the Nomad scheduler and I'm very excited to share all of that with you all today.
So just a quick overview of the Nomad scheduler itself, it's a very critical component of Nomad. It's what's responsible for assigning tasks to the client machines. It uses a bin packing algorithm, which optimizes for resource utilization and it respects constraints. So whatever constraints, you as the job operator, puts on the job, such as like these are the resources I want, or here's a machine that I want this thing to go to. It tries to respect all of that.
So if you're not familiar with bin packing, it's basically a placement algorithm that optimizes for resource utilization by minimizing the number of machines that are being used. A lot of people use Tetris as an example when they are explaining bin packing, that's fine more power to them. I happen to think that computer science concepts are actually all around us, even in the daily world. So I like to use this pantry example. Just imagine a box in your pantry and you are tying to fill it up with different items, you know breakfast, cereal boxes, rice packets, whatever you want. There's a way you can arrange all that in that box such that you are able to fit everything in. That's basically bin packing, except that instead of a pantry box, you've got a machine that you know x amount of CPU, disk and RAM, and you're trying to fit running tasks in it.
Bin packing's great. One of the nice things about bin packing is that it just works. As an operator you don't need to think about it, you just submit a job and Nomad figures out where to place it. But there are some issues with bin packing though, so the thing that I want to really get into is…
» Problem: Failure tolerance
So, lets take a simple example here. We've got a business critical web app, you want to run it on two different data centers, us-east-1 and us-west-1 and you want to run six instances of it. Now, Nomad's bin packing algorithm, prior to 0.9, is purely looking at resource utilization. So it could end up placing it in a way that shown here in this diagram where five instances end up on us-east-1 and one instance ends up on us-west-1.
There's a problem here. What happens if you have a data-center level failure. Your entire us-east-1 data centers out so that means you can't route any traffic to any of those machines now you've got a single instance that's having to handle all your traffic and it might fail over or topple over because of a thundering herd of requests.
What we ideally like to see though is a distribution that looks something like this. Imagine if Nomad placed three instances on us-east-1 and three on us-west-1. With this you've actually improved your failure tolerance, right? Because if us-east-1 goes down you still have 50% of instances left and so you're fine.
» Solution: Spread
So, the way we're solving this in 0.9 is with a new concept and a new stanza called
spread. With spread, operators can specify target percentages across any node attribute or metadata. Going back to my same example of placing that business-critical web application, all you'll need to do is make a simple change in your job spec file where you add a new spread stanza by default it's gonna do an even spread so we're not adding anything extra if you see in this example, the only thing I'm specifying is what is the attribute I want to spread on, it's the data center. So, that's going to lead to even spread so if there's 10 instances Nomad will place five in us-east-1 and five in us-west-1.
You can also get more complicated with spread so in this example, we, we're actually specifying target percentages, so, in this one I'm saying I want 70% of my allocations or instances to end up in us-east-1 and 30% in us-west-1. You might be curious why would I want to do this, like I should probably want even spread right? But the ultimate used case for something like this is a lot of times you don't have homogeneous capacity across multiple data centers. One of your data centers might be the primary you just have more compute capacity there and maybe your load balancers also set up route in a 70/30 way so in that case it doesn't make sense to do an even spread—it makes more sense to have additional application instances in the data center that's receiving more of your traffic. So this is a use-case for spread where you specify target percentages.
Another thing we could do in Nomad 0.9 is have multiple spreads attributes so here in this example, I'm trying to spread eight instances across two different attributes, data center and rack. So imagine that I had two racks in each of my data centers. Nomad is gonna spread four in each data center and then within each data center, it's gonna spread across the number of racks I have. So, it's two per rack. Again, with this, doing something like this we've just increased the amount of failure tolerance you have because within a data center, say if rack 1 goes out for whatever reason you still have capacity left in us-east-1 and the advantage there is like a rack level outage doesn't go into a data center level outage because of spread.
To summarize what we're hoping spread will provide to operators is to increase your failure tolerance across any domain and it's really limited to your imagination in terms of how you model nodes in your data center. The richer metadata you can have like let's say you start modeling power, power strip information, rack information, data center information, even like building information, the more you can put there like the more you can increase your failure tolerance.
So that was spread, that's the one big thing that's coming in 09. The next problem I want to get into is…
» Problem: Placement preference
So what do I mean by, "placement preference," here? So in general, like I mentioned with bin packing you don't have to do much, you just specify a job and resources and Nomad will figure out where to place it, but sometimes that's not really ideal. Sometimes you do want to nudge the scheduler, "Hey, maybe try this, like, maybe put it here." You know I have a certain preference and there are many use cases for that so one of them is machine learning workloads. Machine learning algorithms run faster on GPUs but there also generations of GPUs and you know in your environment you might have multiple generations and the newest generation is probably where it's gonna be the fastest. You could have old hardware lying around that's spinning disks you don't necessarily want to throw it away. You want to put workloads that don't have too much I/O, they don't need fast I/O they can run on spinning disks, whereas, things that actually need fast I/O you should still put on SSDs.
Encryption is another great example, so with encryption there are newer chips that are available where there's a hardware level instruction set for doing encryption and there are encryption libraries that work really fast when it's run on that hardware. And so again like four workloads where you know that this particular service is doing a lot of encryption, you might want to target that to hardware that's running that fast chipset.
So one way to do this in Nomad 0.8 and prior is through constraints: If you're not familiar with constraints like you can add a constraint stanza like this to your job and you say three things about it. You say what attribute like what is the thing that you're constraining on. So in this example, it's
node.class and then what's the operator? I'm looking for equality. And then what's the values?
c4.large. So I'm saying that I want this job only run on c4.large instances. So what happens when this hits the scheduler? So when the scheduler sees the constraint attached to a job it's gonna treat it as a filtering mechanism. So on the right I'm just showing some nodes in a cluster, I've got all sorts of notes, but I only have two c4s. So when it's actually coming to a point where it needs to place those 15 instances, it's going to remove all the other nodes in the cluster and only try to place it on the two c4s. And that's probably fine if you have plenty of capacity in those c4s, but lets say that you don't.
Like you don't have capacity, both of those c4s are full, then you'll see an error like this. It'll say,
waiting for additional capacity to place the remainder.
So that was fine for certain types of use cases, but in Nomad 0.9 we're introducing the notion of a softer constraint and we're calling it affinity. An affinity is similar to a constraint, in the sense that it allows you to express a placement preference but the interpretation is different that's what's important about an affinity. It's going to be a best effort rather than must match. Going back to the same example, let's say that the only thing I changed was, I changed the word
affinity. I still have everything the same, so, now the way that this is going to be interpreted by the scheduler is that it is going to use it as a scoring mechanism.
So all the nodes in my cluster will be scored according to whatever affinities are provided in the job. As you can in this example, the c4 nodes will still get the highest possible score, but some of the other nodes are also considered or they're also scored. And then, it does an approach where it's gonna try to fit as much it can on the two c4s, because they're the ones with the maximum score, but once all the c4's capacity is filled, then it will place on like a couple of other nodes here, the blue and the red with the .57 and .32.
We are also supporting multiple affinities so you don't have to start with one and affinities can have weights so in this example, this job has two different affinities, one for
rack and one for
node_class and the rack affinity has a higher weight so the way this translates, in terms of when it's making a placement decision, is that a node that satisfies the
rack affinity is going to get a higher score than a node that satisfies a
node_class affinity. But if a node satisfies both then it's additive, so if we find a node that's both rack m1 and node class c4 it's gonna have a higher score than one that only satisfies rack or only satisfies node class.
We also support anti-affinities by allowing operators to specify negative weights. So in this example, we have an anti-affinity on t3.micro, which means that Nomad will try to avoid placing t3.micro instances as much as possible but again if there's nothing else like you don't have anything else in your cluster you only have t3.micros it'll still try to place it there.
We've augmented the Nomad CLI, so we have this
nomad alloc status -verbose CLI command that gives you a little more deeper insight into what's going on with Nomad's scoring. So there's a new table added called placement metrics in the end. We did have this table in Nomad 0.8 but it didn't look nice like this it was kind of very messy with lines of output that was hard to interpret, so we've cleaned all of that up and we've provided output in this tabular format where every row is a node ID and then columns are all the different factors that go into scoring.
So in this example we can see that the first two nodes got like an affinity score of 1.0 which is the maximum possible score and then it got a different bin packing score and all of that combined together to produce a final score. So this is useful for operators who are interested—if you see a certain placement decision it's nice to be able to dig deeper and really understand why Nomad made this decision, so we've added this output for those who care to see that level of detail.
So in a nutshell, affinities allow operators to express placement preferences, which Nomad will try to match in a best effort manner. The next thing, the next big thing that I want to talk about is…
» Problem: Priority inversion
What do I mean by priority inversion? It's a situation in a cluster where a higher priority job is not being able to be placed because all of the capacity is taken up by running lower priority jobs. So, lets walk through a real situation to see how that might happen. So lets say I started with a cluster the green just means that these are nodes that have capacity so I have plenty of capacity. I started by placing my business critical web app first. I set its priority to 90 which is pretty high, and I asked for five instances, so that takes up some capacity.
Now that business critical web app depends on a backend payment service, and that also is pretty high priority, so we ask for five instances of that, and that takes up some capacity. Then our analytics team comes along, and they want to run some batch jobs like analyzing some click stream data, or whatever. And that's low priority, because it's not in the critical plan. But they ask for 25 instances of it, and we have capacity, so Nomad goes ahead and places them.
Then the marketing team comes along, and now they want run an email marketing campaign, to get people to click on ads. Whatever use case you have. And so that, again, it's a low priority task because it's not in the critical path. But we ask for 25 instances. We have capacity, so Nomad places that.
Then the data science team comes along, and they're asking for 200 instances of priority ten, some data science model. If you think that 200 is kind of an unlikely count, it's actually not. So, some of these large machine-learning algorithms where you're splitting data into multiple machines, running neural networks on them—200 is fairly common.
So at this point, the cluster is pretty close to being full. Now let's say that we get a surge of traffic and we want to scale our web app, which we only had five instances of, from five to ten. They're gonna see in prior versions of Nomad, this would basically lead to a placement error, in the sense that it would be a blocked placement.
So an operator has to intervene. Somebody gets paged, you get up, you have to either turn off either all those marketing and data science jobs so that this job can run, or you have to scramble to add new capacity to your clusters so that you can scale up your web app.
Preemption is a solution to this. Preemption, the word just comes from operating systems to our operating system also preempts tasks all the time. The basic concept of preemption is that in order to place a higher priority task, we're gonna cause a lower priority task to stop running.
There's one thing that we still need to be very careful of, which is cascades. The easiest way to understand cascades is to think about a domino effect. So let's say that we're trying to place a priority 70 job, and to do that, we preempt a priority 65 job. Now the priority 65 job gets preempted, and gets added to the placement queue. Now to place the priority 65 job, we preempt a priority 60 job, and so on and so forth. I hope you see where I am going with this, right?
So the problem with that situation is that you're going to see a lot of churn and preemption and lots of tasks being stopped in the cluster. In order to avoid that, our approach right now is to use an implicit band of delta of ten. So any job that are too close to each other in priority, so if you have a priority 70 job and a priority 65 job, we will not preempt the 65 job to place the priority 70 job. It has to be at least a delta of ten or more from the job being placed.
Preemption, like selecting things to preempt, is actually a very hard problem. It's a multi-dimensional selection problem. So I wanna walk through you a specific example, and give you some insight into how it's gonna work in Nomad.
Let's say that we're trying to place this priority 70 job. We know that it needs a gigabyte of RAM, two CPU cores, and one gigabyte of disk. And this note on the right is where we are trying to place it, but the node is pretty full. The colors on the note represent different priorities. The numbers in the squares show you what priority they are. And then the reason there are different shapes is because this is pretty realistic. In a real world cluster in Nomad, you're not gonna have homogeneous tasks, right? Each one is gonna have different amounts of CPU, RAM, and disk. So that's why the shapes are different.
So the way the selection algorithm in Nomad works is that it's an iterative approach. It starts by first examining what's the available capacity on that node. So here it realizes that there's no available capacity, and it looks at how much capacity it needs.
It starts by examining the lowest priority allocations first. It starts from lowest to highest. So it finds the two priority 10 allocations, it examines how much RAM, CPU, and disk resources each of those allocations use. And then it uses a distance function to figure out which one is closest to the requirements that we are trying to match.
So in this particular example, we find that we actually do need to preempt both the 10s, and we add all that together, and at this point in the iterative algorithm, it's actually met its disk requirements. So we've got one gigabyte of disc free, and we have some amount of RAM and CPU, but that's still less than how much we need to make this placement successful.
The next step: It's gonna try to get the next highest set of priority allocations. So it examines the priority 15 allocation, and once it adds the resources used by the priority 15 allocation, we find that we've actually preempted enough, such that our available capacity is greater than, or equal to the needed capacity.
At this point, Nomad will try to place the priority 70 job, and it'll stop the other, the two priority ten and the one priority 15 job.
Now, we realize that until now, only operators were able to mark things to stop. And with preemption, the scheduler is gonna start making these changes, and that can be very confusing to operators. So in order to provide visibility, we've made a few changes.
The first change is that any allocation that's preempted by Nomad, so when the scheduler tries to stop it as opposed to the operator trying to stop it, it's gonna have a new
DesiredStatus field. So the DesiredStatus field is gonna say,
evict. So that way, if you see something like evict, you know that a human didn't do it, the scheduler did it.
We've also added two new fields to the allocation API, as well as the CLI:
PreemptedByAllocID. So PreemptedAllocs is any higher priority allocation that caused lower priority allocations to get evicted will have this field populated with the IDs of the allocations that it preempted. And then the other field is in the opposite direction, so if a lower priority allocation got preempted by a higher priority allocation, its PreemptedByAllocID field will point there. So we're also gonna make changes in the UI, and the CLI so you can see this and click around and kind of explore and see what happened when something got preempted.
Another thing we have done is we have added it to
nomad plan—the dry run. So especially in CI/CD environments, you probably do a plan before you apply. If the plan is gonna cause preemptions, there will be a section that shows what those preemptions are.
In terms of how this is getting rolled out, Nomad OSS is gonna ship in 0.9 with preemption capabilities for system jobs. And Nomad Enterprise will have preemption for batch and service jobs in the next release, Nomad 0.9.1. In a nutshell…
Preemption keeps your business at critical apps running
We talked about a lot of stuff, so I want to summarize everything that's coming in Nomad 0.9. We've implemented task driver plugins, which is gonna make the run time driver system in Nomad extensible. We've implemented device plugins, and we've built on top of that to support GPUs. And the scheduler got a bunch of improvements spread, affinities and preemption.
And I think I have two minutes, so I'm gonna try to show you a demo. Let's try to pray to the demo goddesses.
So I've got a Dev cluster running with a priority 20 job. And it's actually taken up all the resources, and I'm gonna try to run another system job. So this demo will be a system job preemption.
This is something new in Nomad 0.9 as well. You can submit jobs through the UI. It's hidden behind an ACL, but I disabled the ACL for this demo.
So I'm copy/pasting this job definition, and first I want to show you what happens if the priority is the same. So this is still a priority 20 job, so it cannot cause preemptions, because the other job is also priority 20. So I run plan, I see this expected output, resources are exhausted, I don't have room to place this job.
Now I'm gonna change the priority to 50. Run plan again. And we see this new section here, required preemptions. And Nomad the scheduler has determined that it can preempt this other thing that I have running in order to place this higher priority allocation. So I can control click, go there, explore, okay, this is something else. This is the other allocation that's running that's gonna get preempted. So let's say I'm fine with that. I can say, run.
And at this point, if I go back to jobs, you will see that this sysredis job, the higher priority job, is running and this one is in pending state. So it's being added back to the queue, and once you give it additional capacity, it'll be able to place that.
So that's preemption.