Sustainable HashiCorp Nomad for SREs
Join us for practical recommendations for optimal use of Nomad to achieve greater practitioner sustainability and better quality of life for operators. This talk not only covers technical and operational sustainability, but also ways that Nomad can make infrastructure more sustainable environmentally with regards to climate change.
Speaker: Gale Fagan
For references, citations, and commentary that would not fit into this session, visit nomad.green.
» Transcript
I'm Gale Fagan, I am the engineering manager for the Nomad team. I'd like to talk today about sustainability and about Nomad.
I'm going to use Bill Johnson's excellent definitions of sustainability, as it relates to SRE.
There are a couple of different kinds of sustainability. First is technical sustainability. Technical sustainability covers things like language stack, CPU architecture, decisions that you might make about hardware and software and how your application is being written. If it's a service, how available is the service? How available it's supposed to be may well influence the design choices you make in building your app.
The next kind of sustainability is operational sustainability, which is largely focused on the cost of running your application, on deploying it, on the humans, and the resources involved with that. When I talk about deployment resources, I'm speaking about both human and compute resources. That includes operational cost and overhead for compute: when and where and how the application is deployed. How your on-call rotation is structured, how much human intervention is required and the nature of that work effort.
The last kind of sustainability is a consideration that becomes increasingly important as we talk about building at scale. And that is environmental sustainability, whether that's datacenter scale or planetary scale. As we grow, we see our resource usage footprint multiply, and the environmental impact of that increasingly translates to economic impact.
Sustainability for the stuff that we maintain, the way that we keep it running, and the place where we do it. Quality of life has always been intertwined with quality of service. In the past year in particular, it's impact on operations and sustainability has become increasingly visible. That wasn't always the case, though.
» The Difficulty of Reducing Toil
Traditionally, operational costs that center around humans have been pretty far removed from the dialogue of sustainability. Ask anyone who's been on call for a production service for a slightly different take on that.
The advent of SRE, site reliability engineering, as a discipline, has brought that back to the fore, along with the notion of reduction of toil. But even once the toil has been automated away and humans intervene only when absolutely necessary, the human costs remain.
A study in 2015 found that time you spend just being on call, not the amount of exposure to incidents, but the number of hours that folks are on call where something could happen, is the primary stressor. On-call rotations span the time humans are expected to relax.
But when there's the specter of a 15-minute SLA for a P1 hanging over your head, not thinking about work isn't an option. That's work/home interference. That's when you change your routine because you need to be able to respond quickly, to abruptly hand off family responsibilities for a work emergency.
It's also when you change your home life, when you relax and recover from work to accommodate work obligations. Over time, without that recovery period for folks on call, fatigue and dissatisfaction set in, and even your perception of how you perform your job is impacted.
We talked about toil. Let's define toil. The definition from the SRE handbook is that toil is work that tends to be manual, repetitive, automatable, tactical, and that doesn't have enduring value. Most significantly, it scales linearly as the service grows.
A couple of examples of toil are handling quarter requests, applying schema changes, copying and pasting commands for a playbook. And a common thread in all of these is that they don't require humans, they don't require an engineer's judgment, and they also keep us from making progress automating that toil away and making the service better.
One of the most difficult things about toil is that it's both technical, and cultural and identifying it is not always easy.
Experienced complexity isn't toil. It is frequently mistaken for toil, and it's not great. It's the feeling that you get when something should be easy, but it wasn't. Time spent figuring out how to do something contributes to experienced complexity.
The more consistent the system is, the less complex it is, which means the smaller burden on the humans that have to work with it.
According to the Cloud Native Computing Foundation (CNCF) survey in 2020, the reasons why organizations haven't adopted containers are equal parts complexity and cultural changes within the development team. The culture changes include changing the operational environment, the development environment, the mindset, the skills, growing the concept of service ownership, and the host of technical and cultural changes that need to happen for development to make its shift to a less monolithic world.
This doesn't happen overnight; it takes time. Time is a non-renewable resource that includes human time and compute time.
Humans that are overloaded, fatigued, and dealing with change take longer to solve problems and make more mistakes. That is time that you don't get back. That is opportunity cost. Operationally, it results in delays. For quality of life, trying to beat time means cutting corners, which can go on to impact operational sustainability and really create a vicious cycle.
» What's Nomad Got to Do with It?
Now I'm finally going to talk about Nomad and how it helps solve and address many of these sustainability issues.
We talked before about the collective frustrations of containerization. As a cloud architect in a past life, I've seen the pain of this firsthand, where what was supposed to be a simple lift and shift to the cloud becomes a move and improve and then takes a detour into a much more dramatic refactor.
Nomad allows you to take on this change incrementally, but to really see benefits from it right away. You can containerize at your pace.
You can move workloads at your pace, and you can abstract where things are actually running.
Many of the practitioners that we've spoken to have had initial success in containerizing some of their apps and then wanted to move the rest. Nomad facilitates that, because it allows you to run non-containerized workloads immediately and alongside, without rewrites, without refactoring, without doing a huge infrastructure migration.
Even environments that just have legacy apps, that haven't begun containerizing yet, can take advantage of the deployment strategies that Nomad supports. Modern deployment strategies like rolling deployments, blue-green, canary.
Moreover, Nomad continues to keep these migrations sustainable at a compute level, by optimizing how resources are used and where jobs are placed. Compute is a non-renewable resource. It costs money, it costs time, it costs power. Being able to take advantage of spot market pricing and spot market availability to get the best value for your cloud dollar requires supervision, scheduling, and orchestration.
Using data from our friends at the CNCF again, we can see that private cloud popularity has gone up a bit in the past year, even more than public cloud. Hybrid cloud dropped a little, and for the first time this year, the CNCF also collected data on multi-cloud usage.
This is really where Nomad shines. Nomad handles scaling workloads across vendors and across datacenters. To a large degree, the complexity of this multiple-provider situation can be abstracted away by setting the right constraints within Nomad for workloads that need to run in specific environments, or specific networks, or with specific resources.
Nomad is able to work with all of these environments, and it's able to schedule tasks in a consistent manner across environments. That alone reduces one key part of operational complexity.
As an example, one of the advantages that Nomad has is the ability to schedule across otherwise underutilized resources at a variety of different levels. In a couple of different ways, this lets you have better efficiency and also cuts down on the expense of your infrastructure. Being able to densely schedule, as bin packing does, minimizes your compute footprint. Memory oversubscription, like bin packing, allows you to exploit even more of your resources than you could before.
Sizing for memory has always been an inexact science, and sizing for an upper ceiling can really leave a significant amount of memory on the table unnecessarily. Memory oversubscription addresses that directly and gives you a way to provide your job with some headroom for spikes, but also guarantees a certain reserve limit.
It also adds additional guards to protect against client-level out-of-memory errors by building on the underlying OS's primitives to help avoid host-level memory contention. The support for memory oversubscription was recently extended from the Docker driver to the exec driver, the Java driver, the Podman driver.
And we're working with the community to support the additional community task drivers that folks use.
Affinities are placement preferences that can be connected to metadata associated with your nodes, including cloud provider metadata that Nomad sees, and other node-specific attributes that you may have set.
Affinities can be specified at a couple of different levels. It can be specified at job level, group level, individual task level. And they are individually weighted.
There are soft constraints, so they'll boost the rank of nodes that match them, unlike the hard constraints that Nomad also offers—and has very similar syntax for some combination of preferences that can't be honored, that task will still run. It will run on the node that has the highest rank and matches the most of your preferences.
This allows humans a great deal of latitude to weight jobs across a variety of criteria and provides a way to ease into optimization opportunistically.
On the screen now is an example of memory oversubscription in action. The first set of nodes doesn't have enabled, the second set of nodes does, and we were able to fit considerably more work onto the second set of nodes.
The Nomad autoscaler is one of those things that we think of pretty much immediately when we talk about making the best use of resources. I'm not going to go into detail here because my colleague James, who wrote the autoscaler, just gave a great talk on it. But the autoscaler, in brief, is an external agent which interacts with Nomad scaling policy API. It has a couple of different plugins:
APM plugins get data from various performance metrics systems.
Target plugins provide a means for performing scaling actions.
Strategy plugins implement the logic that dictates when and how to scale a particular target based on what its current scaling level is, the scaling strategy you've provided, and the metrics that it sees.
The autoscaler works with Nomad. It's developed in conjunction with Nomad. It's developed by the Nomad team. ut it has a separate release cycle and it's a separate download.
We decided to do that for a few reasons. Because it is a separate binary, it decouples the release cycle of the autoscaler from Nomad, which is pretty important as the technology grows. Also if you don't need the autoscaler, you don't need to have it. Sticking with our modular approach, the autoscaler is standalone.
The autoscaler interacts with a few different plugins so it can get metrics from APNs, and then target plugins provide the interface for performing scaling actions. Strategy plugins, implement the logic that dictates how something will be scaled based on how it's currently scaled, the strategy and the metrics that we've seen.
Please check out James' talk. It goes much more in-depth into the autoscaler and what it can do.
» Think Sustainability
At the end of the day, you are a non-renewable resource. The tools that you use, the tools that your team uses, that your company uses, need to ensure that you're not depleted.
The questions to ask about the tools, the tech, and the processes that you have in place are: Are they sustainable? Is what you're doing sustainable for your colleagues? Is it sustainable fiscally? Is it sustainable for your physical and mental well-being? And is it sustainable for the planet?
Nomad helps answer yes to all of these questions.
That has been my obligatory marketing statement. Thank you for your time.
There's a lot more here than can fit in a talk. We're putting together resources on this at nomad.green, and we'll continue to publish blogs and demos and more information on some of the things we're doing to help Nomad help you make things more sustainable.
Thank you very much, and have a lovely conference.