Managing Stateful Workloads With Nomad
Oct 07, 2019
Learn how upcoming Nomad features can help operators and developers manage access to storage and how they can run their stateful workloads safely inside Nomad.
As organizations increasingly move workloads to cluster orchestrators, they frequently run into issues when trying to manage their stateful services. In this talk, we will explore how upcoming Nomad features can help operators and developers manage access to storage and how they can run their stateful workloads safely inside Nomad.
Senior Software Engineer, HashiCorp
What are stateful workloads? Stateful workloads save data to persistent disk storage for use by itself, and provide things to other services. An example of this would be a database or a key-value store, like Redis, or MySQL, or basically any HashiCorp product.
Managing persistent workloads is pretty hard. But they're core to basically everything that we all do. If you work on a web service, you will have at least 1 database somewhere, probably more than you know about because someone will have spun something up at some point in time and forgotten about it.
And because a lot of these services dictate the usability of your application, they're often managed very carefully. They often run on dedicated servers separate from the rest of your fleet. They often go a while without being completely destroyed and brought back up because you're scared of turning your database off.
Which is reasonable. But it often means that they are this mess, and you don't know how this machine is configured anymore or what the hell is going on. So when it breaks at 2 am and you're on call, you're reverse engineering how this machine was spun up, like 2 years ago. It's just a really sad time. And it only gets more complicated as your infrastructure evolves.
If you wanted to introduce that to a data plane layer and it's running on some random boxes somewhere, that is a whole new set of stuff you need to set up monitoring for, to try and configure, and to understand how it works.
That's a lot of pain. And they don't benefit from our native integration because they're just running somewhere.
But if you separate your database application from the underlying persistent storage, then you can move that into your workload orchestration and get all of those benefits in your persistent workloads.
That can help you simplify a lot of your operations, because you reduce the overhead of having to manage multiple deployment paradigms, unify a lot of your monitoring, and also, because we provide a lot of operational metrics about resource usage, application crashing, and logging, and everything else, it gets pretty nice.
Nomad deployments also give you the ability to coordinate a lot of your upgrades across replicated services. You can more easily manage rolling out new versions, configuration changes. And you get visibility into that through our logging and task events and stuff. We can also handle a lot of the failure domains, from handling crashes to handling nodes dying eventually, and all of those sad things.
That's all great, but it requires us to have a way of having storage that outlives your container. Containers are very ephemeral. It's a CH route somewhere in a disk somewhere. You blow it away and everything has gone, bye-bye.
» Nomad today
So let's take a look at what you can do in Nomad today before we take a look at where Nomad is going. Everything Nomad does with system storage shipped in 0.5.0, 2 years ago.
We're going to start out by taking a quick look at the
ephemeral_disk feature that Nomad has, which has a mode called "sticky volumes." That means that when a node hasn't gone away and you ship an upgrade, we try and schedule it on the same node, and then copy the allocations directory into the new allocation. You just have the data that was already around.
This is pretty useful if your application can technically recover its state, but benefits from just having it. It's an enhancement of that. We also have the ability to migrate data between different Nomad clients when we reschedule your allocation.
But this only works if that node is still healthy. If that node has gone away, so has your data. I don't know about you, but losing data keeps me up at night. I have had nightmares about losing data. It's not the best.
Some people who use our Docker driver also use Docker volumes to mount arbitrary paths from the machines into the containers, which is pretty useful, if you're using something where being ephemeral is sort of OK. But having that data around is nice. Or if you're mounting it onto EBS volumes under the hood.
And Docker volumes also have this concept of their own type of driver. If you're using something like REX-Ray, they'll automatically mount volumes from EBS or GCP or whatever to the node. But this gets messy if your scheduler isn't aware of things.
All of those are pretty useful features, but they have a few problems.
Although Nomad is aware of
ephemeral_disk when scheduling future allocations, it's not aware of anything about your Docker volumes. Without configuring a lot of known vendor data and setting up a lot of constraints, we have no way of automatically making sure stuff is placed on the same node.
For example, if you're using
ephemeral_disk as shown on this slide, we should theoretically prefer the green node, which is the previous-allocations node, or the yellow nodes, where they are close to the data, and so copying it will be cheap and fast. Because we want to minimize the startup time of your tasks.
And the red nodes are running far away. It's going to be slow to copy the data, and maybe there are some random network hops in the way.
But right now, the happy path is to pick one of the green or yellow nodes. But since we have no real control over that or no real understanding of those types of topologies with your volumes, Nomad could just as easily pick one of the slow, sad, expensive nodes leading to longer test startups and more expensive data transfers. Which, if you're doing this a lot, is kind of sad.
Everything I just talked about has no way of doing access control. In the case of Docker volumes and the way they do host paths, if you can submit a job to the cluster, you have access to all of that data. And with
ephemeral_disk, if you can submit a job to that namespace, you can just submit a new thing that replaces it, and bye-bye data. I liked you, but now you're gone.
You can just be like, "Please give me the data, run the job." And then you're like, "Hello, you have my data now." And then you're on the front page of CNN.
It's a very sad time. We also can't do anything intelligent about failure recovery if we don't understand your data. When a node monitoring allocation fails, Nomad will attempt to reschedule your workload, and then we'll just try and copy that data because we'll be like, "Hello, allocation. Yes, I want to start you. Where were you running before?"
And then we'll be like, "Cool, we'll try and contact you," and that node is just lost in the sadness that is commodity cloud hardware, and then we can't go and refresh that data because the node is gone, and your application now needs to recover that data or do something. I hope you had backups.
» Some improvements in Nomad 0.10
What did we do to improve this? Because, although we have some stuff that lets you get what you want done, it all has some limitations. In Nomad 0.10, which went into beta yesterday, we're shipping support for mounting volumes for Nomad clients into your tasks natively. Which solves some of the scheduling and security components, as long as there's a way to mount volumes into tasks regardless of that driver.
What we want to do is enable mounting these volumes from your host across any test driver with native understanding of scheduling. So we can have read-only volumes, writeable volumes, and some other nice things in the scheduling layer, and also have access controls that can limit the mountability and various access types of that data.
How do we make that work? To expose a volume to your cluster, you include the new host volume of configuration stanza in your client config. Give the volume a name. And you can also specify if users should be only able to use it in a read-only fashion regardless of their ACLs.
And just like with namespaces, volumes are keyed by their name, and you can specify a glob of volumes that ACL policy should apply to. As well as whether that policy allows read-write, access or read-only access to volumes, when available.
In your job, you specify that your job needs a volume, that it's a host volume, do you only want read-only access, and the source volumes that you should be trying to find.
Then in any of your tasks you can choose a different mount point for your volume. In this case we mount it to
/etc/ssl/certs and NGINX application. And then we go and run the task, and everything is happy, and it gets mounted in.
» A demo
I'm going to spin up a dev agent, and we have Consul running in the background. Then I will zoom in here. We have a couple of different jobs. In the config we expose a volume which points to
/tmp/hashiconf-demo and exposes it as
Then, if we take a look at the
mariadb job, we specify that we want this volume, and we mount it into the task env of
mysql. We have a very secure root password that is generated by Vault pulled into the task. It's all magical.
The rest is just a regular Nomad job that exposes service to Consul. If we run that and then take a look at our node, you see that the node exposes the
mariadb host volume. And if we dump this out in verbose mode, we also see where it is and whether it's read-only or not to the cluster. Which is just useful for debugging it and seeing what's around.
Then if we call the app, we see we have this counter. If we post to it, we increment the counter, all very simple. status Maria DB. If we then go and force stop and restart that allocation.
Might also need to restart my very hacky app. Actually, no; Nomad will have restarted it because it uses Consul for service discovery. If we take a look in
nomad status 5482, we see Consul is restarting it because the template changed.
Cool. Demo kind of worked.
» What about network storage?
Now, I hear people saying, "But what about my precious NFS?" Because everyone will have NFS somewhere. Well, in some upcoming release of Nomad, that is hopefully 0.11, we'll be shipping support for storage plugins. A lot of what I'm about to say is very early in development, and a lot of it is hand-wavy and may change.
This is just giving you an idea of where we're going. You can come and talk to us in the Nomad booth or, right after this talk, find me outside and we can talk about what people are actually doing and help us design a roadmap for it.
What we wanted to do is to enable mounting volumes from arbitrary storage providers across all of our test drivers and also add support for mounting block devices as opposed to just file storage. And we wanted this to have native scheduling integration with Nomad. So volume should have identity. We should have support for different scheduling modes like single-node mounting as well as multi-node mounting, where possible.
And also to have a way of handling failure in the scheduling layer in a way that is safe. Not all storage providers, for example, when a node goes away, will have a guarantee that something doesn't have a right path to that data. Which is pretty scary and dangerous. If you're running in a cloud provider and your node says it shut down, but it's not and in fact still can write to a volume, but also another node can write to a volume, things get really scary.
By the way, this happens in production.
And we also wanted to have some level of access control to these volumes. The way we're probably going to do this is bind volumes to namespaces. So you can benefit from the same level of control as you do with Nomad jobs in namespaces with volumes.
» The Container Storage Interface
The way we are going to do that is by implanting support for the Container Storage Interface. "But," I hear you say, "what is the Container Storage Interface?"
It is a standard interface for exposing storage volumes to orchestrators. It was designed by a committee of people from Kubernetes, Mesosphere, Google, and a bunch of storage vendors. It lets storage vendors write their own plugins that will work across various container orchestrators without having to maintain dedicated integrations for everybody.
Which also means, if someone finds bugs in one place, they get fixed for everyone, which is pretty nice.
CSI plugins are usually containers, or sometimes binaries, that expose the gRPC interface over a UNIX socket, with upcoming support for Windows that will do some other magic. This interface mediates between the containers, the constructs that an orchestrator cares about, like nodes and a bunch of other stuff, and hides the implementation details of managing storage inside the plugin itself.
It already has support from a bunch of providers today. So like things that run on-prem like Ceph and GlusterFS and Portworx to basically every cloud provider, which means that when we ship this, wherever you're running, you can probably use it that day.
How is this probably going to work? When you're running a plugin, because we can't guarantee how a plugin will be shipped from a vendor, Nomad will not be required to manage that plugin for you. What we will do is introduce a stanza in the config that lets you give it a name and tells us what to listen for.
And then if you want to run the plugin in Nomad, you can use host volumes, mount in the parent directory with a bidirectional mount. And then you can run that plugin inside Nomad.
The downside of running a lot of these as Docker containers is you also have to enable running privileged containers on your nodes. Which is pretty scary. If you want to, you can also run them outside of Nomad under systemd or wherever else you want. That is up to you.
When you've told the Nomad client that a plugin exists, it will start fingerprinting that path and waiting for the socket to become available. When the socket's available, it'll talk to the plugin and communicate a bunch of information about the plugin.
Also it will get external metadata about the node from the position of the storage vendor, like its node ID and its topology within the storage system, and then make the node available for scheduling of compatible volumes in the cluster.
When you want to register a volume, because Nomad is not a universal control plane, we won't initially support provisioning dynamic volumes on the fly. What you'll need to do is register an existing volume from your provider.
This allows us to reference the external volume and allows you to provide some metadata as to how it should be accessed. We'll then validate all of that and make it available for use in the cluster. You then bind the volume to a namespace with a claim that specifies exactly how that volume will be used and gives it a name that is common across all providers as opposed to one that is unique to a provider.
Then, in your job, just as we did with host volumes, you'll specify that a volume with this given name should be used in your job, and it'll just work as a host volume would.
» Help for scheduling problems
How does this help with the scheduling problems that we see with our existing persistence options? With CSI, plugins provide all of the information we need to have the scheduling topology so we can automatically generate the constraints to schedule the volumes while they can be used.
This makes scheduling on 2 illegible nodes impossible. And it also means that, eventually, as the CSI specification evolves, we can theoretically also generate implicit affinities to have better data locality, if you're running on Prime or whatever.
When it comes to security, we can also improve the status quo and Docker volumes here too, because we know about them now. Which means they're bound to a volume. Now when
acdbrn comes to take over all your data, it's like, "Give me your data," and we're, "Sorry. The namespace you can submit jobs to, this volume doesn't exist here. Sorry. Bye-bye."
Then you try and submit to your sensitive namespaces, and it's like, "You tried. Bye." Which is nicer than losing all of your data.
» Fault tolerance
There are also many cases to talk about with fault tolerance when talking about network storage. Is the provider still reachable? Is the volume healthy? Is it even still mounted?
Lots of these things are hard to manage when you're running things yourself, because usually when you're running stuff like storage, you only introduce monitoring for something when it's broken. That's something that everyone ends up relearning. It's really sad.
In some cases, volumes will just go away, and if you didn't expect that a volume would go away, you probably didn't prepare for that.
The most common case that we need to do something to bring your app back together would be if your database crashes. In these cases we can follow Nomad's existing rules for restarting and rescheduling the allocations, where the volumes can automatically be mounted into the new destination if we had to reschedule, and then you benefit from our existing monitoring of being able to see what is broken and when and why.
A bigger, more scary failure mode for data is when the node dies—whether it failed, whether it network-partitioned for too long, whatever happened. This is a complicated failure mode because it somewhat depends on your storage control plane.
Although Nomad already has a lot of controls around when and how to reschedule and restart all of your tasks, there may be some work involved here in introducing potentially some kind of manual recovery in some data providers, in the case of net splits, if we can't guarantee that a volume has been detached.
We can also cover the cases where your access to the storage provider becomes unhealthy. Because we're fingerprinting and monitoring the health of the view from that node, if the storage goes away, we then go and reschedule your storage somewhere else.
» A quick review
In 0.10 we're shipping support for host volumes. Which allows you to safely mount volumes from your Nomad clients into the tasks in a driver-agnostic way, with a useful ACL system, in a way that the scheduler knows about. That is in beta today.
Soon we'll be shipping storage plugins that will provide native integration with external storage providers with a lot of workaround making us safe, and full tolerance, and handling safe upgrades, potentially some kind of stable identity, and out-of-the-box default for things like NFS, EBS, and Portworx. And that's coming soon.
Thank you. Please come and talk to us.