Assumptions & Adventures in Nomad Autoscaling

This session tells the tales of uncommon Nomad use cases that deviate from standard Nomad Autoscaling assumptions, and it presents solutions for those cases.

Speaker: James Rasell


Hi, everyone. Welcome. Thank you for joining my talk — Assumptions and Adventures in Nomad Autoscaling. My name's James. You may know me online as jrasell. I work on the Nomad ecosystem team alongside Luiz Aoqui and Jasmine Dahilig. 

But that’s just the engineering team. There are a lot of other people involved. We have the marketing team, management, education, sales... Everyone is involved in helping make this product. We also have a great community. They provide us with great feedback, and they also help drive feature development. Thank you to everyone involved with that. 

The Nomad Autoscaler

One of our team’s primary responsibilities is the Nomad Autoscaler. This is, as it says, how to autoscale your workloads and your clusters. The Nomad Autoscaler supports three different and distinct types of autoscaling. 

The first is the most traditional. This is horizontal app scaling. This is where we increase or decrease the number of instances of your application to meet SLAs — to handle the traffic you have. The second is horizontal cluster scaling. When we're scaling your app in and out, you want to have either the headroom to scale, or make sure that you're cost-effective and you are not using servers that you don't need. Horizontal cluster scaling is the addition or removal of nodes in your cluster. 

The third one is our enterprise offering. This is vertical application scaling — or what we term dynamic application sizing. Picking CPU and memory resource values for your jobs is tricky, and it's generally a ballpark figure. It looks at historical data and tries to give you recommendations so you don't have to worry about now — so you’re as efficient as possible. That is a very large topic. I won't be going into much of that today, but if you're interested, please reach out. 

How Do We Support These Three Types of Autoscaling in a Single Application? 

Well, like many products at HashiCorp, we use Go plugins. Go plugin is an RPC plugin system over Net RPC or gRPC. It allows us to define interfaces — and then we can develop plugins either internally or externally. It decouples the process and means we can develop fast and the community can contribute without much effort. 

This was the ecosystem one year ago. I used this slide last year at HashiCorp, and we only supported horizontal application scaling. It had limited flexibility, and the target plugin itself was just Nomad; it did its job well, but it wasn't very flexible.

This is where we are today. This diagram might even be out of date at this point in time, and I'll have to come up with a new way to demonstrate how it looks. But as you can see, we've had a huge growth in plugins. We now support a wide variety of cloud providers and server providers, and a lot of this has been driven by the community — so thank you again. We also had a recent focus on our strategy plugins. Target value was a good plugin, but it was tricky to use in some circumstances. We now have pass-through, threshold, and fixed value, and I'll go into a couple of those in a bit. 

Autoscaling Is Hard

Let's get into some assumptions and some adventures in autoscaling. The first one is autoscaling is easy. Cloud providers have done a good job of making it a commodity service. And as developers, our goal is to present it easily for you to understand. But implementing an autoscaler is tricky. That also goes for implementing autoscaling in your application. Even adding autoscaling to one app is hard. autoscaling can uncover unknown behaviors in your application. The core, when you're implementing autoscaling, is to understand the metrics, which determine how your application scales. 

Traditionally, this may be CPU or memory-based, but they're not great indicators. You might look to use latency, which is a direct metric for your users. But it's a trailing metric. By the time that metric is alerting, it might be too late. Your users might go elsewhere. 

Then what happens when we do start to scale your application up and down? We might uncover bottlenecks, particularly as we scale your application out; maybe you start to exhaust the pool of connections you have to your database. Even things like log, metric, and ingress, can increase costs. They can increase network traffic, and it uncovers these unknown behaviors. 

Then what happens when we're scaling your application in? Well, maybe your application needs to perform certain shutdown routines. Maybe your application needs to de-register itself from somewhere. Maybe it wants to make sure its connections are closed properly. Having your application listen to the signals it's getting and perform a shutdown properly is very critical. Otherwise, it can result in bad user behavior. 

Task Stanza Helpful Parameters

Outside of your application code, Nomad can try and help you here. We have some task standard parameters that can be helpful. The first is the kill timeout. This specifies the duration to wait when Nomad's waiting for your application to shut down gracefully before it kills it. 

This has a fairly short default of about five seconds. That is okay for most applications. But if your application is on that threshold when shutting down, or above that, definitely increase this. 

The second is the kill signal — mildly useful — less so than kill timeout. This specifies a configurable kill signal. If your application is doing something weird, then use that. 

Horizontal Cluster Scaling Is Fraught With Danger

If application scaling is tricky, cluster scaling is just plain hard. Scaling out, we can lean on the provider. But scaling in, we have to make a series of calls in sequence and correctly ensure there are no errors. The first one is the node selection. We need to make sure we pick appropriate nodes for you. We don't want to churn work or result in a lot of progress, in hours of days of processing needs to be cut and have to be restarted.

Then we have node draining. Node draining is the process of migrating the work off that node. Again, this links into your application listening to signals. We also need to provide feedback. This activity can last hours. Understanding what applications are waiting to be drained or have drained — if there are any problems, is important. The last one is node termination. Node termination is the act of talking to the provider and saying, "Please remove this, terminate this server, terminate this node.” 

That has to go across unstable, unpredictable networks. There can be errors there. Also, we have options of running additional tasks after node termination, such as node purge from the GC. This is in itself, a number of calls that we have to make — and make sure there are no errors.

Job Migrate Stanza

The Nomad job spec has this migrate stanza, which is aimed at helping control draining nodes and your jobs in particular. max parallel is the number of allegations that we can migrate at the same time. If you have five allocations of a job, maybe one is fine. If you have 1,000, one might be a bit slow — and you might want to up that. 

The minimum healthy time is the time which an allocation that's been migrated when it comes back up. It must be healthy before it blocks further allocation migration. That is useful om that if your app takes a minute to come up, warm up, and do some work, just set that a little higher. The healthy deadline is the time in which Nomad will allow this allocation to become healthy. If your allocation takes a minute, your app should take a minute to start. Then two minutes is a good value to set — means it's probably not going to come back up in time.

Cluster Target Options

Then on the Nomad Autoscaler side, this is an example, from a cluster scaling policy — in particular targeting the targets plugin section using your VMSS section. The node drain deadline specifies how long to wait for allocations to finish migrating. If all the allocations haven't finished migrating by this time — one hour — Nomad will forcefully kill them. 

The second one, the node selector strategy is something that we've added recently. This is changing the behavior of how we pick nodes to terminate. We currently have four options available. We have the least busy. This looks at the amount of allocated resources on each node and picks the ones with the least amount. It's not to be confused with utilization. The second is new: it is creating index. This is just the order that we get it back from Nomad. It's the least computational heavy because we don't have to do any sorting.

We then have empty and empty ignore system. These will only pick nodes which are not running any allocations — or not running any allocations which aren't system jobs. These are perfect for something I will talk about in a minute. 

This brings me to the idea that autoscaling is only useful for services. It's true. It's very useful for scaling services, and it's usually the first port of call when you adopt autoscaling. But it's mainly useful for other types of work, in particular batch work. It's an area we've been working on a lot recently. 

Autoscaling for Batch Work

Autoscaling for batch work, scale from and to zero is a great idea. Batch jobs are expensive. Sometimes they use special hardware: CPU, high memory, or GPUs. And they generally have a running time which isn't forever. They may be run for an hour or a couple of days. 

AWS, their G3 class — the second-lowest instance in that class — costs $1.1 per hour, that's $10,000 per year for one instance. You’re probably going to be running more than one instance. You could spin that infrastructure manually with Terraform. But even that's costly, it's error-prone, and it takes away human time that could be better spent elsewhere. 

Also, focusing on AWS, ASGs don't cost anything when they don't have any instances running. Why take the effort to always spin up that infrastructure, spin it back down when we can have the autoscaler do that for you?

Batch Scaling Policy

This is a shortened example of what a batch scaling policy might look like. We have the source of Prometheus. In this example, we're going to assume that the query returns the absolute result. So per allocation that we're going to run — maybe it requires one GPU. Each instance that we are running has one GPU, so we have one allocation per instance. The number of running and pending is a number of instances that we want to have running. 

Therefore, this new strategy that we have passed through doesn't do any more computation on that value — on the query result. It takes that value and passes it straight through to the targets. Then, in the node selection, we can use empty. We don't want the autoscaler to prematurely remove nodes that are running work. Say if there's a batch work — this encoded job — is 90% of the way through and we migrate that work off, it has to start from zero again. So this helps avoid that. 

This is a quick example of how that might look. We have the platform ASG. That could run our long-lived services, such as our Grafana, our Prometheus, and most probably our Nomad Autoscaling. 

We then have the batch ASG. That's currently set to zero instances — so not costing anything, not running anything. And we have a work queue though, where the work gets submitted to. So, what can happen? We have work that's submitted. We have these encode jobs that turn up and your autoscaler will look at that, grab the query and say we need two instances running to handle this work. As those instances come online, we take that work out of the queue. And because we're using the node selection strategy as empty, the autoscaler will not select any of those nodes to terminate until the work is finished. As that work finishes, we gradually move that work off, and we're left back to where we were. 

This can use external metrics. Nomad, 1.1.0 also added additional metrics regarding blocks work. There are a couple of ways that you can achieve this. I will share later a link to a demo where we demo this batch work scenario in AWS. Then to the final idea.

Autoscaling Does Not Suit MultiCloud 

It's true in the traditional sense that cloud providers have their own service, and it will interact with other services of their own. But Nomad has a great multi-region, multi-federation story where multi-cloud is a viable option. It's something we talk about internally quite a lot — and something that we believe there’s definitely a future in. 

One big idea is cloud-bursting from on-premises. And you might say, "But everyone runs in the cloud." Not particularly, no. There might be a number of reasons why you might have on-premises workload at the moment. 

You might have long-running contract that you need to see through. You might have expensive hardware that you still want to utilize. You might be in the process of doing your cloud migration. These cloud migrations can take a number of years. 

On-premises can perform your base of compute. It can perform all the long-running services that are critical to you — your monitoring, your log aggregation, maybe some of your ingress. But what happens if your workload exceeds the capacity? It will probably take you weeks to order a new server, rack that server, configure it. By that time, it's far too late; it's wasted money. 

Why not use the elasticity of cloud to help you when you need it. It's much like the batch demo that I previously talked about where the platform autoscaling group is your on-premises — your long-running infrastructure — and the batch auto-scaling group is the cloud elasticity that you use when you need some spillover. 

MultiCloud Autoscaling (The Future)

Cloud environments are very complex, distributed systems. Outages can occur. What happens if your primary provider is having an issue? You can migrate to another region. But that's often tricky. It often falls into a thundering herd problem where everyone's doing it at the same time into the same region because it's the next closest. 

Even some services can fail completely in cloud providers. AWS IAM — if that's having problems — then it might impact the entire inside provider. So, maybe you want to scale to a different provider to weather out these availability issues. 

Most of the major providers offer a spot market where you can utilize unused compute at a fraction of the cost— 80% of the normal cost. Maybe you have work that suits this, and maybe you start 10,000 allocations in your primary provider, and they run. Then suddenly the spot price of the particular instance you're interested in — maybe that raises because it's becoming popular and it becomes too expensive compared to another provider. Then you might want to run 10,000 allocations of the next batch on a different provider and get that cost-effectiveness that is critical for business. 


If you're interested in finding out more, see the links below. If you have questions, please reach out to the Discuss Forum. We have guides, demos. If you come across any issues or want to contribute, we're always happy to chat — always happy to talk and to listen. We also have a blog post, which details some of the recent work we've been doing around the strategy plugins. Please head over and read that. 

Thank you for listening. If you have any questions, feedback, or just fancy a chat, please let me know.

More resources like this one