This talk explores the use case that led Cloudflare to start using Nomad, how it is deployed and managed it, and how it is integrated with Consul.
Welcome to HashiConf Digital 2020. My name is Tom Lefebvre. Today, I'm going to be talking about how we use Nomad and Consul at Cloudflare.
I've been an SRE at Cloudflare for a bit over four years. I'm currently part of the Edge Platform team at Cloudflare. My role as an SRE is first is to ensure the reliability of all the production software running at the edge. And also working on automation projects, such as provisioning automation, release management automation — and more recently — task scheduling, which is what I'm going to be covering today during this talk.
We've got a massive network. We have hardware deployed in 200 cities, and that list of cities keeps growing every week and every month. Our strategy when deploying new locations is to be as close as possible to our customers who are using our services — such as CDN, Cloudflare Workers, DDoS protection, and web application firewall, for instance. We're trying to be within 100 milliseconds of every website visit or internet user globally.
During this talk, I'm going to be covering our use case for deploying Nomad to the Cloudflare edge and how Nomad interacts with Consul as part of this deployment.
First, some history. We have 200 plus locations. We have physical servers in each of those locations. Some deployments are smaller than others, some are large, especially in big metros. But common between those locations is that all servers run all software. It was a design choice made early at Cloudflare, and it makes it super easy to scale horizontally.
When and as we get more traffic, we just need to keep adding more servers to existing locations to keep scaling with the increasing amount of traffic that we receive. From an architectural perspective, it makes it super easy.
But recently, we've had internal asks within the company for more on-demand workloads. For instance, as you can see in this graph, we have time during the day where CPU is underused, and we could use that available CPU during those off-peak times.
I've been covering mostly customer-facing services — Cloudflare products — which scale horizontally with the number of machines we keep adding. But in each of those locations, we have management services. For instance, monitoring, configuration management. Those services, because they're not customer-facing don't need to scale at the same speed as customer-facing services — they don't need to run on all the machines.
When deploying new services to the edge, we had two options.
The first one, as I said, is running those services on all machines. The pro is that it's super easy to scale, as I mentioned before. But the con is that it uses more resources than it potentially needs — especially for those management services.
As you keep colocating services together on the same machines — and you keep adding more and more — you're at risk of some of the services using more CPU than they should, or more memory. Their footprint keeps growing and growing, and you're at risk of having a noisy neighbor scenario — where some services are impacting others, which is not what we want.
The alternative is to pick a static list of machines in each datacenter and choose them to run the services which don't need to run everywhere. It's less wasteful of resources because you don't have that big of a footprint in CPU and memory anymore.
But the con of that is that when you have hardware failures — which happen every day when you have a few thousands of machines or more. You have to get each engineering team deploying their software to be aware of that complex architecture — which can be different from co-lo to co-lo, from deployment to deployment — and they have to engineer a reliability strategy working around those constraints.
All of that led us to believe there had to be a better option we could propose to our engineering teams at Cloudflare to deploy their software. That's where we decided to add a new layer, so teams are able to use dynamic task scheduling to schedule a service in each datacenter, without having to know the architecture of the datacenter — like how many machines there are, etc. They want their service to be running at all times in the datacenter — that’s the value we're trying to give to those teams.
It also provides us with a unique place to add resource isolation. I was talking about noisy neighbors earlier and having that layer of task scheduling. We can add constraints to avoid colocating certain services together. We can set resource limits, and we can even add a security layer, which becomes then the default of all services using that new layer of dynamic task scheduling.
It gives us improved reliability to hardware failures because of dynamic task scheduling — if there's a service running on the machine and that machine goes down for any reason — it's just going to get rescheduled to another machine and is going to stay online. Additionally, the service — because you don't need to provide extra reliability by running it in more places than you need — only uses the resources it needs at any point in time.
And finally, our last objective when designing that new solution was that the operational cost of the new system we were deploying to fit those needs had to be as minimal as possible. The goal is to save time for all the product teams deploying their services. But we don't want to be at the expense of the SRE team who has to spend all that time managing that new software.
While looking for that new solution, we looked at different software. We ended up picking HashiCorp Nomad because it satisfied our initial requirements. We had this management service, which we wanted to run in every datacenter reliably, and Nomad completely fitted that need.
And also, because it's lightweight and reasonably simple, its operational cost is not as large as some competing solutions. It also has very few dependencies — the main one being Consul — which we already had deployed in every location for a different use case. So it was really easy to pair those two HashiCorp solutions together.
I'm going to be talking about the design of our Nomad implementation at Cloudflare, how we deployed it, how we manage the deployment, etc.
As I mentioned, we have around 200 locations where we have servers. And importantly, we want those locations to be separate from each other. We want them to be part of different failure domains. We don't want a failure in — let's say — Amsterdam impacting the London location. With that in mind, we deployed a different Consul cluster in each of those locations — so the failure of a Consul cluster or anything else wouldn't impact other locations.
Therefore we also deployed a separate Nomad server cluster in each location. The way we make it reliable — because now the reliability burden is placed on that new layer, which is Nomad Server. So we pick machines in different fire domains. Such as machines connected to different switches, machines in different tracks, and even machines in different physical sites. Just to ensure that one reg going down, won't bring the Nomad server cluster down, which would — in turn — cause an outage for services trying to schedule on top of Nomad.
The Nomad client part — which is responsible for running all the jobs — we simply run it on all machines and let Nomad make scheduling decisions when placing those tasks to each of those machines.
I'm going to cover how we manage the lifecycle of the Nomad server cluster, how we bootstrap and deploy it, and also how we manage upgrades for it.
First, one of the requirements when deploying a new service; we don't want to have to make any manual task or intervention when we deploy new datacenters or expand existing datacenters. All the edge machines are supposed to be stateless, which means we can wipe them, reboot them, and they should all come up cleanly in production without requiring any manual intervention.
With that in mind, to simply bootstrap a new Nomad server cluster in each datacenter — as we add new datacenters to our fleet — we use the Consul cluster, which is bootstrapped by orchestration.
We use SaltStack when machines come up. They PXE boot a very minimal image. Then SaltStack manages the formatting of this deploying software, etc. Including deploying the Consul cluster and then also deploying the Nomad server cluster — which is relying on the Consul cluster being present to automatically cluster nodes. They advertise themselves to the Consul cluster. Once we hit the correct number of instances we expect — as set by the bootstrap expect parameter in the Nomad server configuration — the Nomad server is ready to go without any manual intervention.
Part of the lifecycle of the Nomad server clusters that we have deployed in all those locations is upgrading the Nomad version to stay on top of all the new features, which are getting released every month — or even more frequently sometimes.
We drive the Nomad deployment using our orchestration framework — which is Salt. It runs periodically on all machines. But then — because we have physical machines — it's not as if we could spin up a new VM and we have a new version of Nomad server running on top of it, and then stop the old VMs. We have to update and upgrade the Nomad server in place.
Additionally, we want to do all of these with zero downtime. We don't want to have to disable a location to upgrade a version of the software; everything has to happen live. We achieved that using — once again — Consul.
We have orchestration to avoid restarting multiple Nomad server instances at the same time, which could cause losing quorum. We have it hold a
lock in consul during the upgrade process. So if orchestration runs on another node — running Nomad server in the datacenter and tries to hold that lock — it's going to have to wait for the upgrade to be finished. In a sense, we only upgrade one instance of Nomad server at a time — ensuring continuity of service of the Nomad server.
To get the Nomad server instance to leave the cluster gracefully — when we stop it before the upgrade — we use the
leave on interrupt option. That tells the other members of the cluster that it's leaving the cluster when it gets a sync interrupt — just to save a bit of time and also to add some additional reliability to the Nomad server cluster.
I went over how we deploy the Nomad server cluster and how we upgrade it. But it's time to add some jobs and start using the actual cluster.
The way we see it, there is a shared ownership of the whole platform. Each team owns a different layer of that platform. We — as SRE — own the Nomad cluster, deploying it, making sure that it's reliable. Product teams own the software, which runs on top of the platform. But then there is a layer in between those two layers. This is how all those product teams can interact with the Nomad cluster — how they get good observability of all the software that they run on top of Nomad. This is important to give them a sense of ownership of those products and manage them in the whole lifecycle.
On provisioning new jobs and also managing those jobs in production, we use the Git-based workflow where teams commit their HCL job files to Git repo. Then a Debian package gets built by CI and then deployed to the edge and centers in stages.
We deployed those in stages to make sure that a typo or — not even a typo — but just a bad version of a particular Nomad job is not going to break the entire edge at the same time. When that Debian package is built and rolled out to the edge by orchestration running periodically, orchestration also runs
nomad job run and
nomad job plan before running
nomad job run to detect that there were some changes in the Nomad jobs — and ensure that they're always in the correct state in the Nomad cluster.
We'd like to improve the turnaround time on those job deployments in the future and give a bit more immediate control to product teams on those jobs when deploying them and rolling them back.
We're looking at having CI pushing jobs directly to the edge and to each Nomad cluster inside the edge — perhaps using Terraform. The goal would be to also keep that level of safety when rolling out Nomad jobs to stages — to small Canary deployments; to a small percentage of the fleet before going global. We'd have to work on implementing that same logic in that CI deployment workflow.
We have our jobs deployed on top of Nomad. Next, we want to have good observability on those jobs.
When jobs are being deployed to Nomad — and when the jobs write to stdout or stderr — we have a custom task driver, which also has other tasks and duties; specifically adding another isolation layer between Nomad tasks running at the edge. But it's also forwarding logs, which are being written to stdout or stderr to the local Syslog server running on each machine in every datacenter.
Then the local Syslog ships those logs to a central Syslog server in our core datacenters. They become indexed in ElasticSearch, and teams can see — in real-time in Kibana — our logs for their services running on top of Nomad. It gives them feedback on what's happening for those jobs, which is important.
What's also important — and what we get for free when using Nomad — is that, by enabling Nomad metrics, you get a lot of insight. From the job status, the summary, you can also look at C groups and see how much CPU and memory jobs are using. You get those metrics for free. But when teams want service-specific metrics, we also set up something to dynamically scrape the metrics endpoints of each service.
That works — once again — leveraging Consul. When services run on top of Nomad, they are advertised in the Consul Service Directory. When those services in their Nomad job file then are tagged — which is
enable-prometheus-scraping in our case — that instructs Prometheus to start scraping metrics for this Nomad service.
Prometheus uses the
consul_sd. It queries the Consul Service Directory in each colo and looks for those Nomad metrics endpoints matching the
Once those service-specific metrics are in Prometheus, each team can then create Grafana dashboards. They can also set up alerting on those jobs and get that alerted as soon as something goes wrong on a Nomad service without any involvement needed on our end. The goal is to give good reliability for services to product teams and also to give them good control over the lifecycle of their services — raising a new version, looking at logs and seeing metrics, alerts, etc.
We have quite a few workloads running on top of Nomad now. It's still early stage — I think it's been six months since the clusters have been deployed. But we're seeing new workloads being added onto Nomad every quarter. Some of those workloads, they're legacy workloads, I'd say — which we used to run on a fixed set of machines — and that we're now migrating to Nomad to have this extra level of reliability. But we were also happy to see it gave some ideas to product teams or other teams at Cloudflare. They're taking advantage of this new dynamic scheduling capability in every co-lo to deploy their services.
A recent example that was announced on the public blog of Cloudflare is cron triggers for Cloudflare Workers. The service — which schedules on the timer workers to run — is running on top of Nomad's. It was exciting to see.
We expect more things to be added, which gives us a nice challenge as SRE. As you deploy more things — and as it gets more critical — you encounter lots of interesting things to work on to make sure the Nomad server cluster is as reliable as ever and as well-maintained as possible.