See how a volunteer team working on Void Linux uses HashiCorp Nomad to deploy and collaborate.
Good afternoon, and thank you for joining me here today as we're going to talk about Void's gradual adoption of Nomad. We'll get into what a polyglot fleet is in a minute. But I do want to emphasize that this is a gradual adoption, and that's the interesting part of what Void has done here.
I play with infrastructure Legos. I put them together in interesting and unique ways — usually not building whatever was depicted on the side of the box. In my free time, I work on Void Linux, the subject of today's talk. I also work on an authentication suite called NetAuth and a turnkey Nomad Linux distribution called the ResinStack. When I'm not busy building interesting open source things, I work for a company called Backblaze. We're a large-scale storage vendor, and we're the people with the fun red servers.
Let's talk a little bit about Void Linux, what makes it unique, why we wound up using Nomad, and what makes it an unusual setup relative to your classic Debian or your CentOS. Void is a general-purpose Linux distribution. By that, we mean it's independent. It's not part of the Debian ecosystem, it's not part of the Red Hat ecosystem. We have a unique package manager, and that really means that you can use it independent of any other software that's out there. We're based on Runit. You may be familiar with System D if you used Debian in the early 2000s. You may have heard of Runit. It's a supervising init system, so when your software crashes, it does get restarted.
Void is suitable for both server and desktop use. We don't really specify either/or. I know of at least one smart coffee machine running Void out there. But we intend for you to be able to use it any way you want. We'll come back to this one. But Void is extremely open with our operations and our processes.
There is no corporate entity behind the project at all. We have 21 contributors at present, including one very cantankerous robot and one emeritus. These contributors span four continents and ten countries, and we're very proud of the breadth of our maintainer core. We have people whose experience ranges from college students just learning about the world of infrastructure technology and the challenges involved all the way up to full-time engineers such as myself who eat, sleep, and breathe datacenters.
This is my NASCAR slide here with all the stickers. We run on Hetzner. Hetzner is a German Cloud — European footprint largely. We run on AWS. We use AMIs and run on conventional EC2. We're set up to run on GCP in the Google Compute system, though we don't currently launch there.
A lot of our work happens on DigitalOcean. They're actually one of the core sponsors of our open source project. So we run a lot of small VMs on the DigitalOcean platform. We are then resident in three different co-location facilities with bare metal hardware that all have different IPMI controllers required for accessing them for install and maintenance.
Then we have two different VM systems that we get donated time on — KVM and Xen based. Finally, we have one CDN. We're very proud that Fastly is working with us to deliver our package collection using a compute-free format.
That's a lot of clouds, and that kind of thing doesn't happen overnight. Well, like many fleets, Void grew organically. As time went on, more people found Void; they wanted to use it, and that meant we needed to deliver more services, either in terms of more mirrors or more features. As we added these services and features, we needed more hardware to deliver the experience.
As I mentioned, Void is an open source project. There's no corporate backing. This means we are extremely conscious of our cost optimization efforts. While Machine shuffles can save many dollars, they cost many hours, and your time is not free. So, we spend a lot of time within Void looking at how we can get the most out of what we already have rather than standing up a new service or system.
This talk is about Nomad. But I think we need to go back in time because we did have another attempt at managing our fleet, and that was Ansible. If you're not familiar, Ansible is a powerful Python-based orchestration utility. Originally independent, it was later acquired by Red Hat and is now part of the IBM Automation Suite.
Ansible has full support for both the Runit init system and XVPS. Void maintains both of these modules as part of the community collection and ensures they work for all Void systems. So, we know we have good support there. But Ansible's learning curve is non-trivial at scale. It's difficult if you're not an ops person. Let's talk about where it fell short. Well, as I said, it has a relatively steep learning curve. You're having to think about a series of actions taken to a machine to result in a service running.
It's also difficult to test that locally. So you have to run it against prod, at least in check mode, and look at the Diff. You could run it locally, but that's likely going to involve full virtualization — and quite expensive virtualization in terms of memory and CPU — to get one-to-one with your production fleet.
It also requires thinking like an ops person. If you think about how you deploy a service: An ops person may think of this in terms of, well, I need to deploy my web server, and then install a package, then copy the service binary, and add a service script. This is very different from how a developer of a service thinks about it: I hit run in the IDE, and then the service runs.
Finally, Ansible requires active engagement rather than passive intent. By that, I mean Ansible isn't a Daemon. It doesn't keep doing things when you're not looking. So, if something crashes or drifts config-wise, you will know once you run Ansible again. That works great in a place where you're running Ansible regularly. But in a project such as Void — where we only change the low-level production fleet infrequently — this leads to config drift over time.
You're all in this talk. You know that we're at the optimal solution — It's obviously Nomad. But let's define why.
With Ansible, you have to have root-equivalent authority to apply the configuration to the machine. I want a caveat here: The orchestrator and the solution can run as root. But we need to be able to carve up that root authority and give it out in pieces to teams so that no team can take over a whole machine.
If we're going to spend the time and effort to deploy a new solution, then it needs to have a really good ROI. We've got to get something back.
Maybe not the whole orchestration solution, but you need to be able to take the services that run on the orchestrator and simulate them locally.
This is something that Void thinks of in two different ways. The first way is how many ops people think of it, which is how many services do I need to have running in a passive steady state for this application to work.
That's an important consideration. But as a Linux distribution, we have a second consideration we make: What are the dependencies I need to compile or build before I'm able to build this solution? Nomad is part of our critical infrastructure. So, we need to make sure we can recover from a loss of our packages of it without needing to build a transitional infrastructure just to rebuild the world again. Importantly, this slide's got enough text on it. We can't add any more bullet points, so that's all we're going to add in this solution.
A good solution feels like the solution when you see it. You read the document, look at the page, and you're like, yes, this will work. And, that's the experience that we at Void Linux had when we started looking at running containers in our fleet and looked at Nomad.
For those who may not be familiar, Nomad is a very flexible orchestrator. It does more than just containers. It works with Java. You can run raw processes on the operating system. You can use chroots if you're on an operating system that supports them. You can run jails to segment your processes.
Importantly, Nomad is easy to operate in a low-toil environment. This comic is from XKCD, and I often find myself referring to it as I look at improving infrastructure or making changes to existing systems to work out how much time could I realistically spend on that and not go net negative.
Well, obviously, we use it as part of a complete and balanced HashiStack. We use it as an abstraction over our entire fleet. It has become our single pane of glass through which we engage with machines, and we use it as a mostly unified production resource center. We'll get back to that "mostly unified" later. But the idea is we're looking at all of the fleet lives in Nomad. And if you want to deploy a service, you do so via Nomad. You don't have to think about the individual machine it's running on.
That's an interesting qualifier, and let's dig into that. Not everything runs on Nomad, and I think that's really a good thing. It means we didn't try to cram everything into a tool. We looked at what fit and moved those systems.
Some of our machines, in fact, don't even have a Nomad agent on them. A good example of this would be one of our servers that runs our core authentication and single sign-on services. Since they're in the dependency chain for Nomad, we don't orchestrate them via Nomad. And, because we don't have anything else on the machine running via the orchestrated layer, we just don't install the software.
We also have some incredibly legacy systems. They require special care and feeding. They need the love they've earned for being part of our fleet for so long, and that has so far precluded their migration to an orchestrated container.
Which Nomad features do we use? If you look at the feature matrix for Nomad, there's a lot there. We use containers. Most of our workloads are containerized today. As a Linux distribution, we build our own base containers and install our software into that.
We make use of Namespaces. This is how we carve up our fleet into smaller units. We have namespaces for things like monitoring, build infrastructure, and unsorted applications. We make use of Nomad's built-in service discovery. We use this so Nomad can actually be part of the critical chain for Consul. Mainly, we use Nomad to control our load balancers that provide us access to the Consul API and the Consul dashboard.
Which brings me to the Consul integration. Nomad is great on its own, and it has a lot of features, but if you're willing to take the time to deploy Consul and Vault alongside it, you get a lot of extra features. You get better service discovery, service mesh. You get dynamic credentials from Vault, you get things that Nomad can't do on its own.
Of course, as I mentioned, we don't want to have users who are using the orchestrator operating as an unrestricted root user on a remote system. So, we use access control lists to bound what a user can do.
Finally, because we're running on bare metal, we use a lot of host volumes. Host volumes are, I think, one of the most underrated features in Nomad. They allow you to take a path on the host server, carve it off and name it – and then present that to a task as though it were a Docker volume, bind mount or some other storage option that you just want to be able to attach later.
When you onboard a cluster orchestrator, obviously, there's a lot of planning. Void took, give or take, two months of planning and looking at our fleet to figure out what would need to change, what would we need to deploy, and how would we do that.
We also had to make a lot of considerations for what dependency chains we were allowing to form. Prior to using the orchestrated system, everything was capable of launching as the machine booted from cold — and it would launch using Runit. Obviously, when you bring in an orchestrator, you're now making the workload dependent on that orchestrator, which is sometimes in a remote control plane thousands of miles away. We chose to do this because we felt that the benefits outweighed the costs, but it is something important to think about.
Finally, while Nomad has excellent documentation from HashiCorp directly, we did produce simplified user guides for common tasks within Void's environment that call out these specific things that we do differently.
Once we had the plan, once we had an empty Nomad cluster, it was time to onboard some services — actually fill up this new system we'd made.
First, we moved monitoring. You might wonder, why on earth would we move the thing that's going to tell us if everything is on fire? And through the magic of buying two of them, we had a monitoring system already in place. We knew we wanted to migrate from our legacy monitoring platform to a more modern open metrics-based system. And, this gave us the ideal opportunity to try out a real application that had a duplicate available in case anything went wrong.
Once we got the monitoring system up and we were happy with how it worked — you've got that mental high of the success — you've deployed a service, it's working. Let's move more stuff. Next up with the tchotchkes. These are the pet projects of various developers, things that serve maybe single endpoint APIs. They're not part of any production workflow. They're not necessarily user-facing. But there are a lot of them. So, if we can forklift them up into the orchestrator, it saves a lot of effort.
Once we had the tchotchkes there, it was time to move a service that mattered, something that if it went down, there were consequences. But you don't have to do that all at once. In our case, we started with Batch Compute. An example of a Batch Compute service that we moved: When you think about needing to find a file on a Linux operating system, you need to install a package that contains that file. For Void, we have a nightly task that looks at all files across all packages and computes an index that's inverted relative to the normal package index.
So, when you ask it what package owns this file, it can tell you. Because this is a Batch Compute job, it was easy for us to make the move. If it didn't run, so what? We'll get the alert from monitoring that it failed to dispatch, we'll kick it off manually, or we can still use the legacy version we left in place. If it doesn't run, it's a Batch Compute process that generates data with a very slow churn rate. So, it's not the end of the world if it doesn't run for a day.
Of course, once we were confident with Batch Compute, then it was time to move services that mattered, things that would become an incident if they failed. We started with feature systems. A feature system is something that is not necessarily your primary production service, but supports your primary production service.
We moved over one of our package search features so that our website — which is fully static — reaches out to a rest API to search the package index for versions of software and what architectures those software are available for. This is a single endpoint API, but it is on our website. So, if it broke, people would notice.
We moved the API, and this allowed us to check end-to-end our load balancer systems, all of our failover, and our monitoring — and we were happy that it worked. Of course, when you're happy something works, then it's time to go to prod.
We moved our production backends, package mirrors, and global load balancers that handle all user-facing downloads. I'd like to call your attention to this picture because I liken onboarding services to driving and merging onto the highway. Yes, there is a flow of traffic, but it's relatively straightforward.
If you've never driven in the US, you may have never encountered a double diverging diamond where you drive briefly on the wrong side of the road. But onboarding maintainers is quite a bit more complicated, and we definitely had some skeptics early on. In fact, I'd be lying if I said we didn't have skeptics even now.
But a lot of dealing with skeptics is hearing their concerns, listening to why they think the solution will not work, and trying to provide a good answer. And, when they've caught you with a problem that you haven't solved, own up to it. We also provided a lot of shell snippets. These were things that simplified long commands or took the sharp edges off of multi-step workflows.
Finally, we leveraged a lot of example jobs and pull request workflows where you could send a completely invalid job spec through a pull request — and we'd work with you as the infrastructure group to the developer group to make a change to the system. This provided a shallower on-ramp and an easier way to adopt Nomad for people who had never worked with an orchestrator before.
We learned a lot of lessons in doing this, and I'd like to share some of those with you now:
One big thing we shied away from in the early days were HCL dynamic blocks. This is a mechanism by which you program your HCL. You can iterate over something and template HCL out. But crucially, you're not pulling in another language to do the templating. Had we embraced this earlier, we would've had a lot cleaner service files. We have a number of cases where we had multiple groups that differ only in name and differ only in a single config flag. HCL dynamic blocks are very much worth learning. They make your service files cleaner, and they make things easier to work.
We played with the Wild West for a while. We deployed a whole bunch of services into one namespace and then moved them later. Defining a structure early on would've saved us a lot of time and effort. You can define your structure maybe on your teams — maybe on your security boundaries — but having a structure is important. And having a structure not only in how you deploy the services but where they live in your source tree is also important. We had a handful of messy Git commits where we had to move files around once we finally figured out what looked good and how we wanted to manage things.
Void runs on multiple different clouds — some bare metal, some co-location facilities — and this means that we have to paper over any features that a cloud provider is providing us since we might not have that uniformly across all of our vendors. Standardizing on a network topology would have saved a lot of effort in writing the initial service catalog, where we had some services that were on bridge networks, some things using Consul Connect, some things using the host network. If we'd given a recommendation beforehand, it would've made things easier later.
We also learned a lot of lessons working with people. One big one that never occurred to me when I started the project of rolling out Nomad was having a documented path to general availability. I figured we would have this project wrapped up in maybe 2-3 months, then Nomad would be available, and people would gradually adopt it from there.
We had two power users who were very excited to use the new tool because they could see it was going to be easier to use than Ansible. So, they held the work they were going to do on the promise that the new system would be available soon. One week became two, one month became two. And unfortunately, in that time, the users who were so excited to champion this new system became frustrated with it and the rollout. Having a clear communication to your user base, when will the shiny new thing be available? I cannot stress enough how important that is.
If you've never used it, Nomad has the ability to spin up an embedded server and client on your laptop so you can play with it. Super cool, very useful. But you need to document how it's different than your production cluster. In dev mode, you're likely not running with ACLs. You're likely the root user. You can do things that you can't do in prod. Documenting that makes it a lot easier for someone to not have a moment of sadness when they push their job up to the production cluster, and then it crashes.
Finally, it's a lot easier to define the right solution than to stop people from doing the wrong one. We had a number of cases where I had to look at files and talk with maintainers and say you could do that, but why are you doing that? Maybe you should consider this other workflow. If we just defined the workflow upfront, it would've been a lot faster.
This last one doesn't apply to places if you're already using containers. But for Void, since we were adopting containers for the first time, we had to have discussions around what is mutable and what is not. Nomad alloc exec lets you jump straight into a running task and change things in there, but they're not persisted. You need to go edit the source artifact.
I'd like to call out two specifically and why we didn't move them. The first is Popcorn. This is our opt-in stats snooper. It reports back the packages that a user has installed and the versions present so we can look at if we need to prioritize work on a particular package group. If you're curious why it's called Popcorn, it's because for 15 years, I misread the various last screen of the WN installer where it says, would you like to join Popcom, the popularity contest? So, Void’s is Popcorn.
The second is Buildbot. Buildbot is a Python build service. It's very similar to early build automation controllers, where you have a grid of architectures you want to make a build for or a grid of targets and a set of workers that are listening to it. We're running a very legacy version that we've heavily customized, so we know we're going to need to scrap it and start over. Because of that, we haven't made the move yet. It's actually a Hacktoberfest project for us this year.
No tech talk would be complete unless I regaled you with a tale of an outage. Let's do that now. Mind your dependency chains. When Vault goes down, you get a fun condition where all your tasks work until their Vault tokens expire. Then Nomad goes to renew them and says, well, no Vault, guess I'll wait. This can lead to an incredibly stressful outage where your fleet just falls apart as time goes on.
Likewise, Consul outages can lead to really weird failure modes if services become deregistered. Then you are ranging over a four loop with zero elements in it. You'll get errors out of your target software where it's like, well, you needed to add something, and I don't know what you want me to do. This token takes an argument, and you didn't give me one. We've definitely run into that.
With Nomad specifically, I'd say know how to do Raft recovery. We've had this happen once or twice where a — I'm not going to say careless, but perhaps cavalier upgrade — led to us losing Raft quorum. One of the nice things that Nomad does is when you get into that state, the web UI is very broken. You can't click around in anything, but it gives you a big link: Click here to learn how to fix this. It takes you straight to the page that tells you how to fix it.
The overall experience with this has been pretty straightforward. We've been running Nomad now for about three years, and it's proven to be a great way to engage with the fleet and for people to manage things.
It's easy to mock up new services locally because we primarily use containers. We can do that just with Docker Compose or Podman Compose. When we want to run raw binaries, that's easy to do, too. You just run them in the shell. Perhaps the highest praise that you can give an infrastructure system is that its infrastructure you don't have to think about. This is ahesford, one of our maintainers who's currently supervising a large Python rebuild across the fleet.
This isn't a process I would repeat one-on-one. Today, I would start with only Nomad. When I started this project with Void, it was necessary to deploy Consul and Vault to attain the feature set I wanted. But today, Nomad is much more capable and has built-in service discovery and built-in variable interpolation. You can start with Nomad and always add on the more complex pieces later.
I would've defined what a good service looks like. When you're looking at a qualitative property, and it's like, that looks good. It does help to have some bullet points: This is what you're looking for. Maybe encourage people to call the linter or look at Nomad validate so their config files are syntactically valid. Those are always things to look at.
Counterintuitively, I would've made use of more downtime. Void did our entire migration from a legacy fleet to a Nomad-orchestrated system with only about 15 minutes of user-facing downtime over the span of 2-3 months. A lot of this is personal pride. I don't like people to see an error page. But looking back, I can see the effort we spent to do this burned some people out, led to fatigue issues, and we made dumb mistakes the longer we worked on it.
Though Void is an open source project where we can control our own destiny, I am always surprised when I talk to management at work and say, well, door number one is we do it with the system hot, and no user will see downtime, but wow can that go wrong. Door number two is we take the system down for 10 minutes, and we do it, but I can guarantee you a rollback path. I'm always surprised how often my management chain is willing to give me the option of downtime to take door number two and do things in a lower-stress environment.
With that, I'd like to thank you very much for your time. If you'd like to send me questions or comments, please feel free to use the address that's on-screen, or if you're here in San Francisco, I'll be staying in the hall for a few minutes after to answer your questions. Again, thank you.