HashiCorp scaled Consul to update ~170k service instances running on 10,000 nodes in under half a second. This talk explains the testbench and what we learned.
Speaker: Anubhav Mishra
Hello and welcome everyone to this presentation. Today, we will be talking about scale testing the network control plane. In particular, HashiCorp Consul and the Consul Global Scale Benchmark that we did in April.
My name is Mishra. I'm the technical advisor to the CTO at HashiCorp. I work in the office of the CTO. The picture that you see on the slide is me rehearsing for opening HashiConf EU in Amsterdam in 2019 — which is the last in-person event. I still remember meeting everyone in the crowd, also a lot of people in the community. That was a privilege for me to open the conference for Mitchell and Armon there. Hopefully, we can get back to that. For now, we'll try to make the most of this.
I work in the office of the CTO on strategic initiatives — internally with the product and engineering teams and also externally with our partners like AWS and Microsoft. I'm a computer engineer by training. I went to school for computer engineering. I've done software engineering and operations engineering. I empathize with both developers and operators and understand this empathy towards delivering software. I also work a lot in open source — and I love open source. Here are a few projects that I work on. You might have seen me on GitHub in the past few years here.
First, we'll talk about some of the history of HashiCorp and Benchmarks. Then, we'll talk about scale testing Consul in particular with the Consul Global Scale Benchmark that we published in April. Then, I'll give you background on service mesh. In particular, the components — especially the control plane — that was the focus for this benchmark. Then, we'll talk about the benchmark set up, the benchmark tests, and the results — and talk about some learnings from it. It's quite straightforward.
I think March 2016, we did our first public benchmark and published the results, which was on Nomad. We did The Million Container Challenge, where Nomad scheduled a million containers in five minutes across 5,000 hosts on GCP.
The Nomad team outdid themselves again. They more recently released a benchmark in December 2020 where they scheduled two million containers across — I think — 6,100 hosts across ten different AWS regions in 22 minutes, which is amazing. We wanted to bring in the same scale and concepts and focus them on HashiCorp Consul and specifically the service mesh.
That's one of the big questions that one might ask. Here's a video of the Airbus A350, and we talk about the stress testing of the wing here. The wings bend to a large extent — more than anything that a traditional flight would ever have experienced.
You'll notice here, the wings bend to about 30-35 degrees or so, which is impressive — the amount of force they put on this wing. Other aircraft might also do this test, and they might go up to maybe 15 degrees or so and then snap. But those aircraft are safe as well to fly because traditionally, a flight might never experience such a stress on the wing.
But my question to you is which aircraft would you like to be on or take flights on? Is it going to be the Airbus or the aircraft that maybe snaps at 15 degrees or so? The answer is probably the Airbus, and this is where these concepts also could be directly applied into products and software that you use every day.
I want to take a step back and talk about service mesh and some of service mesh components service mesh. Let's say you have a service A and you have a service B. Service A wants to talk to service B, and they're both in the service mesh. Service A will do a local host call to its local proxy that's running alongside it. This might be a sidecar container. This might be a system .jar on a VM or Nomad allocation.
Then the proxy is responsible for finding a healthy service B, and it will make a mutual TLS call using certificates — and make a mutual TLS connection and connect up to the service B proxy. The service B proxy will forward that request up to service B. This is how service A will get to service B. It doesn't need to know the IP address of service B. It'll make a local host call, and everything else is the responsibility of the proxy.
But here, the interesting thing is something needs to configure this proxy. This is where the control plane comes in. The control plane is responsible for providing service discovery data — maybe certificates, things like policy rules, traffic shaping rules. It is the component that's responsible for maintaining state in the system, and it's the highly available component of the service mesh. Then you have the data plane, which is the proxy itself. It's responsible for routing packets from one place to the other after being configured — getting the configuration from the control plane.
This is why we focused on the control plane itself because this is the resilient — the highly available — component of the mesh. This was the main focus with the Consul Global Scale Benchmark — the control plane scalability and understanding how Consul responds to really large scale.
Let's talk about the benchmark itself. We did this in, I think, February 2021, and we published the results in April. We did two important propagation time tests. One was the end-to-end point propagation time. The other one was intention propagation time.
We had 10,000 hosts on EC2 on AWS. We scheduled — I think — 178,000 containers. These were applications that were running on the cluster using Nomad. Then, we made all of these containers and applications depend on one common upstream — and we changed that upstream. What I mean by changed is we either deleted it or added it. In this case, I think we deleted the upstream service. That triggered a change in all of the 178,000 proxies. We saw 96.6% — which is about 172,000 proxies — updated their configuration within one second or so. This was all done using five Consul servers in the cluster.
Intentions in Consul are ways to define access permissions of how a service can talk to another service in the mesh. These might also be traffic shaping rules. For example, whether service A is allowed to talk to service B is something that you can express using an intention.
Here, we created an intention that would affect 9,996 nodes — 9,996 agents that were running on these nodes. The four nodes that we missed out of the 10,000 were running workloads that might not be related to this intention change. As soon as we changed the intention — in this case, we created an intention — the propagation time from the control plane down to all Consul agents was about 900 milliseconds, which is quite impressive for the extent that we were running it. We had about five Consul servers, again, powering the control plane for this experiment.
So those are the results. How did we get to those results? The benchmark setup was quite straightforward. There was no secret sauce that we used — or secret service that we used to optimize and get these results. A lot of the products like Consul and Nomad were running almost default settings with some minor optimizations that I can talk about in the next few slides here.
We had five Consul servers. This made up the control plane for the service mesh. The EC2 instance that we were using was C5 — I think c5d.xlarge. We had about 36 vCPUs on these boxes with 72 GB of memory, and they were spread across three availability zones. Then, we had three Nomad servers that made up our schedule of control plane. They were decently sized as well. Then we had 10,000 Nomad clients. Each Nomad client was running a Consul agent inside it — alongside it as a system D job — and these clients were about c5.xlarge.
So, four vCPUs with 8GB memory, again, spread across three AZs. We created these clients using auto-scaling groups. We created two separate pools with 5,000 nodes each, which made up the 10,000 node count on the clients here.
Then we had 100 load generators. These load generators were responsible for testing whether all of the service configurations in the Mesh were correct and service-to-service communication was okay. We routed traffic to a bunch of workloads and tested out the service mesh.
Again, I want to note here that the data plane scalability was not the focus for this experiment. We didn't care about how many requests we could put through the system. You can go through a large number of requests in this large system. Then all of the setup was in US-East-1 in the AWS region.
We wanted to mimic something in real life and bring that into our benchmark to mimic the microservices world that a lot of organizations are in.
We created a three-tier service architecture. We had service A, service B, and Service C – the three tiers. service A was dependent on service B, which was dependent on Service C. You'd call out service A, which would call out to service B, which would — in turn — call out to Service C, and you get a combined response. These applications were returning some text back, so they weren't doing much in terms of this workload. There were about a thousand unique services put here, which makes up to about 3,000 unique services in the cluster.
Each tier had a number of service instances. We had 10,000 service instances in tier one. We had 83,000 service instances in tier two. We had about 85,000 service instances in tier three — which makes up about 178,000 total service instances running in containers in the cluster.
This does not account for the sidecar proxies. If you bring in the Envoy sidecar proxies, this number goes up to 356,000 — which is 178,000 x2 in total service instances and Envoy sidecar proxies Each service instance and the Envoy sidecar proxy was registered with Consul — and there were health checks that were going on to monitor these applications and the proxies.
This is what the service layout looked like — the three tiers. You can see tier one; it goes from A1 to A1000, which is 1,000 each. And then A1 calls out to B1, and B1 calls out to C1, and the same thing for tier two and tier three.
The unique thing here is that all these service instances were dependent on one upstream. It's called Service Hey. I know you'll talk about this name. Like, how did you come up with his name? I know. This is a very unique name. I can take full responsibility for this naming here. If the Service Hey was to change, all of these 178,000 applications will need to update their sidecar proxy configuration.
How did we scale up to start running the workload? We used Terraform — a Terraform apply — to start creating all the infrastructure. We created all the VPC resources, the subnets, all the underlying infrastructure that you need to run this benchmark. Then we created the control plane for the service mesh, which were the Consul servers. Then we created the Nomad servers, which was the scheduler.
Then we scheduled the workload using Terraform. We applied the workload on Nomad using Terraform, but we queued all the workload first. Before we created any of those 10,000 nodes, we queued everything up. In Nomad, you can queue jobs. Then, when nodes are available, Nomad automatically schedules the workload on them. We were using a spread scheduling algorithm — spreading the workload across those 10,000 nodes.
We queued these jobs because we wanted to optimize for every billing second that we got. We were on a budget. We couldn't run these things multiple times and wanted to make sure everything came up okay. We wanted to optimize all the idle time of these instances as well.
In terms of creating the 10,000 nodes, we used the two auto-scaling groups with 5,000 nodes each. Those came up almost within minutes. Amazon, when you ask for 5,000 or 10,000 nodes, they just give it to you. To avoid this thundering herd problem — with all 10,000 nodes approximately that would try to join and register services at the same time — we added a random delay of up to 26 minutes on every node. When a Nomad node would boot up, it would choose a number between zero and 26 and then wait for that time, then finally join the cluster and register themselves. It gives you a very graceful scale up action.
It did delay our scale up a bit, but it gave us that responsiveness — and doing this helped us and instilled confidence in the benchmark. We wanted to optimize for success. Again, we were restricted by the budget. We couldn't rerun these things multiple times, we wanted to make sure we get it right most of the time here.
Here are some of the graphs of what the scaling up looked like. It took about 51 minutes for all of the 10,000 nodes to come online. This includes the nodes joining and all of the workloads to also be scheduled on those nodes. So both of these things — and everything registered, everything good to go.
The CPU usage graph here is interesting. During the scale-up period, you see the CPU usage goes up to about 43% or so. That's due to all of the service registrations, the node registrations, and also certificate sign-in requests from the clients to talk to the control plane. Also, the services were being scheduled that wanted to be part of the mesh, but also get unique certificates to do MPLS in the cluster. Those certificate sign-in requests were done by the servers too. As soon as you see that you hit the 10,000 nodes — and everything is scheduled — the CPU on the control plane on the servers goes down and stays at about 5-10% or so on throughout the experiment.
What were the results when we did the endpoint propagation time test? We removed the Service Hey, which was a common upstream. We deleted that from Nomad. That triggered an update. All of the 178,000 proxies needed to update the configuration at the same time. We saw 96.6% of the total workload — which is about 172,000 proxies — updated the config in under one second. On the right, you see the graph of the latency distribution here. On the left, you see the table that showcases the latency distribution and well — and you can see P96 is about 564.2 milliseconds or so.
It’s interesting to note — here in the graphs — is this agent coalesce timeout. In Consul, we had a 200-millisecond timeout that's built into Consul. We wait for changes up to 200 milliseconds and then coalesce and batch all of these changes — and send one change to Envoy to reload its configuration. This avoids Envoy getting triggered updates all the time — especially when there are proxy boot-ups — and gives you that efficiency in terms of the proxy here. This timeout is already built-in. If we were to remove this timeout, you might see better results here.
I want to talk about the long tail that you saw in the previous slide with the latency. About 3.4% of the proxies took minutes to update. In some cases, five to 10 minutes after the endpoint event had happened to update the configuration. This delay appeared to be a transient network disturbance that we saw at this scale.
If you're running 10,000 nodes in the cloud, you will see some form of network disturbances. We saw TCP connection drop-offs where the agents would go into a retry and backoff. Some agents failed repeatedly to connect to the servers due to transient network timeouts. But due to repeated failures, they were in backoff state and took a minute or more to reconnect. Some agents remained connected through the servers but saw a packet loss and TCP retransmits.
Intentions in Consul are ways to define access permissions between services — whether a service is allowed to talk to other services can be defined using an intention. We created an intention that would affect 9,996 nodes — and the four nodes missing here were not running workloads.
As soon as we created an intention, from the control plane to the propagation time down to the Consul agents, was under 900 milliseconds in this case. On the right, you see the graph. Again, the latency that it took. The x-axis here is the time in milliseconds. On the left, you see the latency distribution table, and you can see P99 is about 815.2 milliseconds.
We ran this test on Nomad, and on Kubernetes. We created a 500 node Kubernetes cluster, had about 1,990 service instances running on them. They were all dependent on this one common upstream service. Very similar to the example we saw earlier.
We changed that service. In this case, we created that service so that it registered itself in Consul. That caused a change that would lead to 1,990 service instances updating their sidecar proxies. It took just under 700 milliseconds for all of these 1,990 proxies to update themselves. You can see some of the graphs here. On the right, you see the propagation time graph. On the left, you see the total running pods in the Kubernetes cluster.
We had a lot of data with a benchmark of this scale. We used the Datadog platform for all of this. We collected StatsD data from Consul, Nomad, Envoy proxies, Kubernetes, load generators, logs from Consul, Nomad, Envoy proxies again, and the application itself.
We used a fixed service application. You can find it on GitHub. We also had APM data for the application to generate service-to-service traces to test out whether the service mesh was configured correctly and also generate latency graphs to test out the data plane here. Datadog processed about 1.8 billion log events throughout the experiment — from lower node counts to the 10,000 node count.
I would like to thank our partners. Big thank you to AWS for giving us the credits and the support to run this experiment — to Datadog for giving us the platform to store APM data logs, and StatsD data — and all the credits to run this and perform this experiment. A big thank you to both our partners here.
If you want to check out a detailed version of this report — this talk is a 20-minute talk, and only highlights a few important portions in the reports. If you want to read it in detail, you can go to hashicorp.com/cgsb, which has Consul Global Scale Benchmark.
All of the configuration for the benchmark is also open sourced on GitHub. This includes all of the Terraform configuration and the service specifications for Nomad and Kubernetes. This is all available under the HashiCorp GitHub organization. You can go to github.com/hashicorp/Consul-global-scale-benchmark.
If you don't want to read the detailed report, you can definitely check out the HashiCast episode on Consul Global Scale Benchmarks. It features Paul Banks, who is the Principal Engineer in Consul who worked with me on the experiment. This goes into some of the insights and behind-the-scenes of the experiment and some of the details. Definitely give that a listen.
Thank you so much. I appreciate you all coming to this presentation. A special thank you to the Consul product and engineering teams for building a product that can scale at this extent. Also, a special thank you to the Nomad product and engineering teams for giving us a scheduler to run this test at this scale. Thank you so much. I hope you all have a great HashiConf, and stay safe.