Presentation

Running GPU-Accelerated Applications on Nomad

See a demo of the Nomad device plugin system, which allows hardware resources, such as GPUs, to be detected and made available for scheduling user tasks.

Graphics processing units (GPUs) can provide significant benefits to numerically-intensive workloads, from speedup to energy efficiency. They have become critical compute resources for workloads ranging from high-performance scientific computing, to machine learning, and cryptocurrency mining.

As of 0.9, HashiCorp's Nomad scheduler can allow hardware resources, such as GPUs, to be detected and made available for scheduling user tasks through device drivers. In this talk, HashiCorp engineer Chris Baker will present an overview of this system, along with demonstrations using the NVIDIA GPU device plugin.

Speakers

Transcript

Good afternoon, everybody. I hope you're having a great HashiConf. I'm going to be talking about running GPU-accelerated applications on Nomad. This is something that launched in the most recent version of Nomad. I did not work on it. But I have worked on GPU-accelerated applications for a number of years in previous jobs, so I'm excited about this.

I'm going to give a little context and history. Some people here may not be familiar with Nomad, some people may not be familiar with General-Purpose GPU computing. I'm going to do a little bit of a history thing and then talk about what this type of workload means for us, what it means for the people who are using it, and what implications it has towards running in a multi-cloud data center.

Why general-purpose GPUs?

General-Purpose GPU programming came around at a confluence of a couple of different scenarios.

Availability

It couldn't have happened on its own. There had to be that niche, but there also had to be the opportunity. We had the availability of these devices. GP-GPU was subsidized by the assistance of commodity hardware for video games.

We'd been making these chips in these video game processors—these graphics processors—more and more programmable. We even made games that looked better and better and better and better. I'm sure you all know, if you put something in front of a group of people that's programmable, they're going to start programming—and they're going to be programming it to do things that you didn't necessarily intend.

The first thing that happened around 2000-2001 or so, people started taking these graphics processors saying, "You know what, graphics are cool, but I can do matrix multiplication and solve Fast Fourier Transforms," and they gave talks on it. It was a fun little thing for a little while, but it's very hard to use.

Performance

We needed performance. We weren't going to be getting the performance that we'd been getting for the past 10-20 years based on scaling at processor speeds alone. Moore's Law was going to let us down because transistor sizes are shrinking, and it wasn't going to work anymore.

Energy efficiency

Similarly, we were going to start running into problems with energy. We can't afford a computer that uses twice as much energy. We can't do that forever. We have to put a cap on that a certain point. High performance with energy efficiency typically comes with specialized hardware. We had these sitting around already—the specialized hardware—specialized parallel performing chips inside of our machines.

That's where it came from. Imagine you are a customer. Say you're the United States Department of Energy, and you're looking to build your next supercomputer. It's 2006. You say, “Well, I need one that's 10 times as fast as the current one because that's what I need to do the science that I need to do. But I can't afford it to be 10 times as expensive to run. I literally can't afford the electricity. I can't afford the physics of moving that much heat around." You'll say, "Well, we've got all these GPUs. People are talking about GP-GPU. That's going to give us the parallelism that we need, and maybe we should look into that." That's what we ended up doing.

OLCF Titan: GPU computing circa 2009

Back in 2009, I worked at Oak Ridge National Laboratories. That was around the time we set up the contract for Oak Ridge National Laboratory Leadership Computing Facility (OLCF), Titan. Titan was the first hybrid supercomputer with CPUs plus GPU. When a debuted, it was the fastest supercomputer, and it was on the top 12 list until this year when it was retired. August 2nd—about a month ago.

Interestingly, we knew it was going to be hard. When we proposed building a supercomputer with GPUs, we got a lot of pushback. They said, “This is great. We know they're fast, but you expect to do real science on this machine that nobody knows how to program?”

We spent a number of years getting ready for it—before we even had the first GPUs. We were successful. We had six applications ready to do science on day one. That tells us a little bit about what it was like to do GPU computing back in 2009.

Working with GPUs circa 2009

Back then, everything was bespoke, everything was customized. You were mainly doing batch scientific workloads. You were taking your existing applications, your existing libraries, all the frameworks, and you were rewriting them to use these GPUs—doing it using new programming languages, new compiler techniques. It was interesting, it was exciting, it was super hard.

Everything was bespoke; the cluster-aware submission scripts that the operator had to write for a machine like Titan. When you wrote your submissive script, you knew you were running it on Titan. When the scheduler was written, they knew that it was going to be running on Titan. They knew how many nodes, how much memory per node. They even knew what the workloads were like so they could balance that—hardcoded in the scheduler. The SLA for the scheduler was, “Here's my job—I already uploaded the binary, run it. If it fails, send me an email. If it succeeds, send me an email.” You pushed play, you went on vacation. That's fine—that’s great. That's how we got to where we are.

2019: Massive heterogeneity

Where we are now is 2019. We thought we had a problem with this heterogeneous hardware at that time. This machine is so hard. It's got a CPU and GPU, and we have to program both of them, but what we have in 2019 is massive heterogeneity.

I have a private cloud, I have data centers and multiple public clouds and multiple regions with nodes of different sizes. Some of them have GPUs, some of them have TPUs. Maybe I've got some ASICs—custom silicon running Bitcoin miners. Maybe I've got hardware security modules. All sorts of hardware, all sorts of node types, all sorts of software.

We are now in a situation where we have broad library support. All the work over the past 10-15 years to make software that can use this different type of hardware means that all of that hardware is available now for my developers. They have Spark, and they want to use it on GPUs. They have TensorFlow, and they want to use it with TPUs or GPUs or whatever. We have multiple architectures. I can't just pick an architecture, build one computer with it and call it a day.

This. Plus this mentality that we want to be able to build our software and use it anywhere; running many different types of jobs, data analytics pipelines that are running all the time. Running these batch jobs that are only happening on the schedule, running these services that are exposed to the internet that are serving our users—that have to be available 24/7. GPUs now, because of Moore's Law—because of the situation that we're in now—they’re mission-critical. The only way to get the performance that we need for our applications—our businesses—is to use these devices.

What are the prereqs for modern GPU capable scheduling?

This is going to be a scheduling talk because Nomad. First, we have just to go ahead and acknowledge we've got multi-cloud data centers. The scheduler must be able to support extreme hardware diversity, extreme heterogeneity across our entire fleet—multi-region.

We need to support developers. We need to give them the ability to schedule their jobs, schedule complex application architectures. We're not deploying monoliths necessarily. We're doing microservices. Some of these services have different requirements for hardware than others, but they still need to be able to talk. All of this has to be capable.

Operator flexibility—we need to make it easy to keep these fleets running. If the fleets aren't running, then the apps aren't running. This is obviously the first prerequisite. Then reliability. We have to be able to have our applications up and running because this is where our money is. This is how we survive.

Managing heterogeneity

This is a talk about the complications that GPUs bring to that. This is what Nomad does— it manages the heterogeneity. The first thing that comes up with something like a GPU is you have to find it. If you want to schedule workloads on a GPU, the scheduler has to know about the GPU.

We do this in Nomad with device plugins. Device plugins were previewed last year during the keynote at HashiConf, and they shipped this year as part of Nomad 0.9. Device plugins allow anybody—us, you—to write plugins to make hardware available—make devices available—for scheduling.

When the Nomad client—the agent that runs on a compute node—wakes up, the very first thing that it does and has always done, is it fingerprints that node. It fingerprints that node to figure out what sort of hardware is available. It looks, and says, "Oh, I've got four cores. I got 16 gigs of RAM. I've got some discs. I've got some network."

Then it tells the schedule about that and makes it available for scheduling work. Well, device plugins allow you to write custom plugins that are capable of doing all of these things that Nomad used to do that was hardcoded. Now you can extend that to discover additional types of devices. Do these three roles. Fingerprint the node and find any devices, inform the scheduler about those devices so that it can make scheduling decisions based on the user's stated needs. Then, later on, when an allocation happens—when it's time to run the workload on the client—the device plugin comes back in and helps the client make the device available for the workload.

Nomad's device plugin interface

We provide an interface. It's a very small interface you see here; it's got three methods. Fingerprint. This is the first one that I mentioned. Simple interface. I've got a device where I need to know whether the devices are out there or not, so I need to write a method to find them. Piece of cake.

The second one—reserve. Reserve says, "I just got some work that I'm supposed to do. The scheduler told me that it's supposed to get a GPU, so I need to make sure that the GPU is available at the workload by mounting, by passing in environment variables, by doing whatever's necessary for that particular device."

The last one is stats. As the device plugin is running, it's going to have the ability to collect statistics and make them available because the user wants to be able to watch certain things.

These work with Nomad as a pluggable system. You have the ability to write these. Anybody can write one of these. When the Nomad client wakes up, it looks in the plugins directory, and it starts running them according to the configuration of that client. If you want to write one of these plugins, all you have to do is implement this interface, compile it into a binary, and deploy it alongside Nomad. It's super easy to do.

We did this for a couple of different reasons. First, we don't want you to have to fork Nomad to implement new support for a particular device. We want you to be able to take the existing Nomad that we release and support whatever devices you need to do.

You—here being a community, perhaps publishing open -ource device plugins for things like TPUs. Or, you—perhaps as a business, writing proprietary, closed-source plugins to support whatever weird stuff you have going on at your business that you don't necessarily want to put out in the open. That's the flexibility that we're trying to enable here. That's this ultimate ability to capture whatever heterogeneity you have in your data center, in your application stack, in your operational life.

GPU device support

None of the device plugin system has any particular requirements. Our motto with Nomad is to run any application on any infrastructure. Device plugins are available. GPU device support is available on all job types in Nomad. It's a system job that runs on every node. These service jobs—these long-running jobs that we're going to make sure we keep them up—or batch jobs. You can imagine batch is probably one of the biggest consumers, but there are use cases for all of these.

It's also available via the Nomad-Apache Spark scheduler. If you're a Spark user, you have application workflows written on Spark for machine learning. You can access GPU devices via this as well. It's available in multiple exe drivers—exec, raw_exec, Java, Docker—and on multiple architectures. The rule with the Nomad has always been if you can run the application on the node, then Nomad can run the application on the node for you. The same is exactly true with this scenario as well.

Running a GPU job on Nomad

This is the title of the talk. What are the steps in using a GPU device? First, you have to have the GPU plugin. Well, with Nomad 0.9—to make this as easy as possible—we actually build the NVIDIA GPU plugin into the Nomad binary. If you've got the Nomad binary for 0.9 on your cluster, you have GPU support on your cluster. Whether the GPUs are present or not, it's part of the heterogeneity that we have to manage. That's step one— it’s taken care of you.

Step two; you have to tell your job that you want a GPU. Piece of cake. We have a new stanza in 0.9. It's a device stanza, and it's where you indicate devices that you need access to. If you want a GPU, it's one line in your job file. device "gpu" {}. That's it. When your job runs now, the NVIDIA runtime is going to get the information that it needs to make that device available to your job. Done. That's it.

This is good. This is all good for a conference hello world talk on how to use a GPU. But if we're going to talk about managing heterogeneity—about dealing with a global multi-cloud fleet with devices of different types, with applications of different types—we need to go a little further than something like this. It's a lot more complicated out there, and we have many different types of applications that are maybe vying for all the same resources.

Scheduling flexibility

Scheduling flexibility is one of the first things that we have to help us. The device stanza offers multiple mechanisms for helping you to make sure that you have the resources you need when your job runs. The first one is—if you look here—the device. In the previous example, I just said device GPU, but you have a fair bit of expressive capability inside of this one. This is a full specification right here where I say, "I need a GPU device." Type GPU, manufacturer NVIDIA, and here's the model number. You can do that. You could say NVIDIA GPU. You can go as far as the model number. You can say GPU.

The next one is the count. Some jobs need a GPU, some jobs need two GPUs. That's part of the device resource as well. The next one is a constraint. You can imagine that you have a job where you know you need a certain amount of memory, just like you have a job where you know you need a certain amount of CPU memory.

Here you see, we have selected some device attributes—memory. These attributes—all this metadata—is something that's picked up by the fingerprint. When the fingerprinter runs, it finds attributes that are specific to each device and makes them available to the scheduler, which makes them available to you—the developer or the operator—when you're trying to run your jobs. In this case, I know that I want a GPU, and I know that it has an attribute called memory. So I say, "Hey, I'd like at least four gigabytes." That's a constraint. That's a hard constraint. This job will not run unless it has four gigabytes. If that means there is no GPU available for four gigabytes, it may mean that this job does not run for a little while. It's blocked.

In addition to that, though, we also have affinities. Affinity is another capability that was added in Nomad 0.9. Affinity is a placement preference. It allows a scheduler to say, "This person, this operator who submitted this job has some preference for a particular type of node but is not a hard requirement." It's something that goes into the weighting, but it's not going to keep your job from running. In this case, I'd say I have a preference for the driver version—the NVIDIA runtime version. I'd like it to be 340.29 or later. But it doesn't have to be. This is for things where you'd like your job to run faster if possible, but mostly you just like your job to run.

Heterogeneous workloads

Let's talk about a more interesting example as well. Your classical, interesting example is going to be a heterogeneous workload. Let's take this. I didn't really show you the task from before, but I added a GPU stanza—that’s great. But in general, we have different types of workloads that are going to have different types of requirements. The example I'm going to be working with today is a little more interesting. It has a NVIDIA TensorRT server. This is a product from NVIDIA. It's an inference server, and it runs as a backend server—in a cluster mode actually—where it can sit there and do inference based on requests that come in from its APIs.

Here I've got the task definition for this job. You'll notice down here at the bottom we've got a couple of different interesting things. We've got this artifact stanza where you can pull down your training model. Because I'm doing inference against some model that I pre-programmed, I need to get that into the container. That's how you do that.

The other one is you see a service stanza here. The service stanza says, "This backend job is going to be registered by Nomad as a service." Non-trivial application deployments require things that you're not going to get out of trivial batch processing schedulers. One of these is service registration and service discovery, so you can connect things that need to be connected.

The other thing you'll see here is Connect. This is a new feature in Nomad 0.10. This says, "I want a Consul Connect sidecar to handle the encryption, to handle, prevent, dictate who's allowed to talk to this container, who is not." Not only taking care of the service registration but making the service registration easy by allowing me to talk to localhost inside of my network namespace and get the people that I'm trying to talk to.

Here's the frontend for this application deployment. It's the web frontend. It doesn't need a GPU. It is going to run this frontend application. It's going to use Consul Connect to connect upstream to that backend inference server. That's going to allow it to easily communicate with that thing using mutual TLS—and without having to worry necessarily about where it's running.

Resource fragmentation

Let's talk about another problem. We're talking about real-world problems in running these GPU applications, and one of the problems that you always have with the cluster is resource availability. You always have the potential for something like resource fragmentation, but when you throw GPU devices into the mix, this actually gets to be a lot harder than it was before.

You have nodes which have GPUs and also CPUs and memory. Then you have maybe nodes that have just CPU and memory. You can easily—much more easily—than without GPU devices and other constrained devices—end up in this situation where you have jobs that aren't able to run because they need this special thing, this one little device that makes this job special. A GPU is not available because, in this case, this non-GPU task here is sitting on node-a eating up enough CPU and memory that it prevents this GPU test that we want to run from getting to those GPUs on node-a. Even though in this case—in this trivial example here—there's plenty of room on the cluster for everyone to be happy.

How do we fix this? Well, this is a scheduling problem and Nomad is a scheduler. Nomad has—as you're about to see—a number of different approaches for preventing this from happening in general. But also for preventing some of the special cases around GPUs from happening.

Using an anti-infinity

The first—and one of the easiest things to do—is to use an anti-affinity. This pretty much is a greedy early decision. But it says, "Don't put non-GPU jobs on GPU nodes.” It acknowledges immediately that GPUs are special, and we don't want anybody else using those nodes. We're not going to use a constraint because we want our jobs to run. We don't want some job that doesn't require a GPU to not be able to run because the only resources available are on the GPU node. But we're going to put that affinity in that says pretty much, "Let's try to keep our non-GPU workloads off our GPU nodes."

This is going to do a couple of things. The first is going to let you scale down your GPU nodes when you're not using them—courtesy of Nomad's bin packing scheduler mechanics. This is good because—I don't know if you're aware—the GPU nodes are more expensive than non-GPU nodes. Anti-affinity is one way to do it. In this case, the example is I look at a node class— there are a number of different ways to do this using metadata on the nodes—but I just said node class not equal to GPU for this non-GPU job. That says, "Nomad, please don't do it, but do it if you have to."

In fact, on Nomad Enterprise, you can even do this automatically. You can enforce this with a Sentinel policy. If you're in an organization and you really want it to be the case that your developers or operators aren't deploying these workloads on nodes where they don't necessarily need that capability, you can write a Sentinel policy very similar to this one.

In this case, the Sentinel policy says, "For each task in my job, it either better have an anti-affinity in the job, the group or the task—or it better be using a GPU."

Job priorities

Another way to do it as with priorities. You can give GPUs a higher priority than the non-GPU jobs. This means as resources become available—as jobs finish—and make resources available on nodes that were previously occupied, the GPU jobs will have a higher priority. This increases the chance on a node with resource constraints that your GPU jobs will be able to run, and your other jobs will be placed later on—on other nodes.

Again, this is not a problem if you have a cluster out there which is over-provisioned–which has more CPU memory and GPUs than you need to run all the jobs that you want to run. That's fine. You don't have any of these problems, but that's an expensive way to solve the problem.

I want to talk about Nomad's ways of solving this problem using its advanced scheduling features. Priority is one of these ways to do it. Another nice thing that we have in Nomad 0.9 is preemption. If you're using Nomad Enterprise, you can use a combination of priority and Nomad's preemption capabilities to go ahead and do this for you. This means if I have a GPU job with a higher priority and I've got some other non-GPU job, that's eating up its resources, Nomad can evict that job in favor of the higher priority GPU job.

That evicted job might be fine. It might be able to go run somewhere else because it doesn't need a GPU. This is one of these examples where some of the advanced scheduling capabilities that we've been adding to Nomad—and will continue to add to Nomad—help this particular use case run a little smoother on large-scale clusters with large-scale applications.

The demo

This demo is, in fact, the example I was talking about earlier in the slides. We have a Nomad job. It's this TensorRT job, and it's got two different tasks in it. They're running a separate task group. This means when anybody goes to schedule it, there's no requirement that they be put on the same node. We don't want a requirement because they have different resource requirements. The frontend is a little webserver. It needs a little bit of CPU and a little bit of memory. It has different scheduling semantics. It needs to be exposed on different ports. It needs to be available to the user's browser, but it doesn't have to be running alongside the other one.

The other part is the inference engine backend—this NVIDIA product. It does need a GPU, so we have two different tasks. They're both registered here in Consul. Nomad has Consul integration for registering services, for hooking the up via Consul Connect, for finding them later on, for making them available to the load balancers.

You see here in my browser, I've got a little box here. This is an inference server—it’s been trained up with a whole bunch of different images allowing TensorFlow to figure out what's in the image. I'm going to go ahead and get an image here from my browser. I'll pick this. What you see here is a cup, specifically a Hashi cup. It's the coffee mug, I think, from last year's HashiConf.

The frontend here is going to talk to the backend. It's going to upload this image to the backend. The backend is going to do its thing. This is our business logic. This is this GPU-enabled capability that's running on our infrastructure—all the time providing value to our customers. Once it's finished, it's going to push that image into the inference engine, which is going to use the model that it was trained on to try to figure out what is in the image, and then it's going to tell us. I'll give you a hint. The answer is it's a cup, specifically a coffee mug. I'm going to go ahead and bail from that.

Nomad is a scheduler designed from the very beginning—from Nomad's earliest inception in its very first release—for managing the heterogeneity, for running the applications that people need to run. Not everybody's using Docker containers or wants to use Docker. Not everybody's running on Linux. The original design of Nomad in its continuing mission is to be able to run any infrastructure, any application, and with device plugins in Nomad 0.9, to be able to manage the heterogeneity that is the reality of today's large data centers—today’s federated, globally distributed, multi-cloud data centers. To be able to manage that heterogeneity.

One way we do this with 0.9 is device plugins. This allows you to support any arbitrary device. We're going to continue to build on that as we go forward. The other is with the advanced scheduling capabilities that we've always had in Nomad and continue to build into Nomad—and will continue to build in Nomad—to increasingly better support these devices.

All the material from this—or most of the material from this—was part of a blog post that we participated with NVIDIA on. If you want to find out more information, including the examples that I used in this talk, you can find those at either the NVIDIA Developer blog or the HashiCorp blog. Thank you.

More resources like this one

  • 1/19/2023
  • Presentation

10 Things I Learned Building Nomad-Packs

  • 12/31/2022
  • Presentation

All Hands on Deck: How We Share Our Work

  • 12/31/2022
  • Presentation

Launching the Fermyon Cloud with Nomad and WebAssembly

  • 12/31/2022
  • Presentation

Portable CD pipelines for Nomad with Vault and Dagger