Learn about the optimizations Criteo has made to get the most out of Consul running at high scale.
4 operators serving the networking needs of 500 users with HashiCorp Consul, sustaining the load on 12 datacenters running more than 40k Consul agents—that's the environment that the ad platform Criteo works in. Learn about the optimizations they've made to get the most out of Consul running at high scale.
Hello, everyone. I hope you are all safe and feeling OK.
I'm Pierre Souchay, working for Criteo, where I'm leading the discovery team, which is responsible for developing SDKs for various applications, as well as operating Consul. We are also very heavy contributors to open-source Consul, with more than 120 pull requests merging Consul.
And we are also contributing to open source through 2 projects: HAproxy Consul Connect, which is an implementation of Consul Connect with HAproxy, and we are the author of Consul TemplateRB, a tool that allows you to create a fancy UIs based on Consul data. We will see a bit of this project in action later in this presentation.
Today we are going to see how we operate Consul at a large scale. (You can find past presentations on Criteo’s use of Consul here: How Criteo went from Physical Hosts/DCs to Containers with Consul and Connect and Inversion of Control With Consul at Criteo. You can also read Criteo’s formal Consul case study.)
We are using lots of machines, lots of humans, and lots of infrastructure. We are going to see how we deal with all those aspects, because dealing with humans is as important as dealing with servers or applications.
The world is changing very fast today. We are moving from machines to services in VM to containers to function as a service. We're using several clusters and several cloud providers.
We've added a huge multitude of microservices in the last few years, when we started using service mesh for the east-west traffic in our datacenter
Infrastructure is changing everything, and it's very hard to discover where things are. Consul helps a lot with that. This is the exact reason why my company, which is a major advertiser, with servers displaying ads on the internet, started using Consul in 2015.
At that time we had around 15,000 servers. We have something like 40,000 now, and Consul helped us a lot to move from service on machine to pure service architectures.
We use now Consul everywhere. It's a real backbone for our company.
Let's start with one of the biggest issues we had while migrating to Consul, which was humans. In our company we've got more than 2,500 people, and around 500 regular humans using our services directly.
Those people are working in R&D, where I work as well. They want to know what is happening to their application. How to call this particular microservice API or other questions such as, "What happened last night?" or, "How do I create a new service?"
When we started using Consul, we had lots of those questions every day. And my team and I were spending a long time answering those questions.
Our first move was to give those humans ways to know what is happening on their systems. Our first action was to create Consul UI, based on Consul TemplateRB. It's an open-source project, and it's a UI you can plug into your organization.
It shows live information about Consul, displayed using a simple static web server. You can scale it for thousands of users if you need to. It can very easily allow you to plug links to your systems in your organization. It shows easily metadata semantics, and it's open source.
You can tune it; you can change it to fit your exact needs.
Now we are going to see this in action. In this short video, what we are seeing is this UI in real time, where I'm using it to find some services just by typing a few words of the service. I can also use tags, which is a feature using Consul to filter systems.
I'm filtering on FTP, but I can also use some global tags such as HTTP, who's talking HTTP, or who is using Swagger. Swagger is a mechanism to describe web services.
Here, on the screen, I'm going to find all the Swagger systems, and I'm going to display them. And then from this UI on the right side, I'm able to display a Swagger descriptor and directly call my APIs.
All the data you are seeing here is called "metadata." It's not only data that is visible; it's data that is useful, because it's used by other systems as well to take decisions about how to provision infrastructure, for instance.
This UI reflects exactly what is provisioned with Consul and is updated live.
I'm currently showing how we can inspect the machine the service is running on, for instance, with this hardware and so on. But we are creating links, for instance, to our alerting systems, so we can see how the alerts are configured for the system.
We can see how all the load balancers are configured for the systems, or simply see the change log for this action.
It's very useful, and it's very easy to tune, meaning that you can link Consul to all the other systems in your infrastructure.
That's very important for us because it allows users without specific knowledge about Consul to reach all of our systems and to get specific information for their services.
The other very interesting feature here is everything is very fast. You can search for any kind of data very quickly, and reach them in a matter of seconds.
Another very interesting feature: Our users wanted to know what happened in the past. Here I'm seeing how a timeline that is about to display what happened to your service last night, for instance. And users are able to investigate.
It's a very important feature because in the early days, we were constantly asked, "What happened to my service last night?" This feature bundle in this Consul UI allows people to investigate by themselves what is happening to their services and to be completely autonomous.
Another very interesting feature is we mark owners. It's possible to identify for each of the services of Criteo the real owners or the team that is handling this service.
And it's very useful in case of an incident, but it will also be used by other applications to take decisions.
Another key point we learned with scaling with our users is to protect the users from themselves. You don't want your users to be able to perform modifications on other services.
You want everything to be standardized. All of our services in our systems start with declarative information in their Consul services that only at startup the data can be changed. It cannot be changed later, so people cannot add entropy into your system.
Another very important thing is, the data registered in Consul can be changed only by the machine itself. We have lots of machines, and keeping others from modifying the service on a given machine is very important, because it ensures that nobody can change what is published into Consul.
Finally, as I said, owners are a key point for us because we've got more than 4,000 different services. Identifying the stakeholders is very important in case of emergency, but it can also leverage information on usage, such as consumption of resources and so on.
What did we learn? We learned that in order to scale with our users, being open is very helpful, because giving them a UI to investigate avoids the need for operators to answer lots of questions. We went from having dozens of questions per day to maybe 1 question per week. That's great for a small team such as ours, with only 4 people answering the needs of 500 users.
We did also a huge effort to standardize the naming of metadata. We prefix all services with a team name, for instance. It allows us to really build the experience of users, so users can help themselves as well.
Another key thing we learned is that people love the service. They want configuration put right into the system. That's why we use a feature we developed called "service metadata," which allows you to unbind metadata into the service.
This metadata will allow us to configure external systems.
Users want predictability, so give them the ability to explore what is happening to the system. Give them the tools to investigate themselves.
And they love business semantics. Whenever you are inputting metadata, don't input metadata related to tools. Don't configure the alerting system of the moment. Put business semantics.
It's very important, because that way the infrastructure will be able to evolve in the long term.
Another interesting thing is that people want things to be magic. By putting all the configuration of other systems into the service themselves, they don't have to call yet another API to provision a network, for instance, or to add alerts. They can do it all in one place.
That's very important, because it allows us to change our infrastructure systems without users notifying us, since we define business semantics.
Let's talk about the applications. An application usually wants to know where its database is. Where is its Kafka? Where is its NoSQL? Where can it send metrics?
An application could also be using other microservices. So where are my microservices? Where can I load balance the data to these services?
An application may also has the ability to say, "I'm feeling too loaded; please send me less traffic."
And finally, an application may want to describe these metrics in order to give information about it after this.
In our case, we've got more than 4,000 different kinds of applications. That's around 240,000 instances. And some of these services are very large, something like 2,000 instances on a single datacenter.
The first thing you have to do, very carefully, is to control specifically all people who use your SDK. In our case, we've chosen to implement outside our SDK and to implement defaults. We'll see a bit later about that.
The first very important thing to understand is: Use
?stale in all your queries. It's very important, because whenever you are performing a request to Consul, all the requests will end on a given server. So even if you provision 3, 5, or 7 servers, all the requests will end up on a single server.
?stale allows Consul to answer from any of those servers. It basically means that it allows horizontal scalability. If you want to scale your applications on your infrastructure, you need to use
It's very good when you have control over your SDK, but sometimes you don't control your SDK. For instance, if you are using external applications.
We added support for a feature in Consul called "discovery max
?stale." This feature allows you to fall back by default on a
?stale request for all applications, meaning that if applications don't specify that I want something consistent, they will use
?stale requests. So every Consul server will be able to serve them the data instead of just 1 machine in 1 place.
Finally, whenever you are using a large infrastructure, be sure to retry in case of errors with an exponential backup in order to avoid overloading your Consul servers.
A typical application at Criteo is composed of a My App application, which is serving content from the internet. This application is using metrics, is using Kafka, is using microservices, is using microservices using HTTP or gRPC, is using SQL Server caches, and so on. Also, services are very different. For instance, we've got hundreds of databases, but they speak the same common language.
What we did was to build APIs for those common services. We've got an API to call to databases, an API to call to messaging systems, to metrics, an API to create a load balancer to HTTP, and so on. Having this allows us to do a factorization of this code and to be able to control the behavior of this value, and is really a tool for the long term.
For instance, for our databases, you may want to stick to a single instance, while for HTTP load balancing you want to run within your calls or use Apache or whatever. But it's very important, because those kinds of services have very different technical requirements.
Implementing this in one place allows you to control how your people will use the Consul APIs.
Finally, it would be very helpful if an application could, whenever it's too loaded say, "Please stop sending me too much traffic." And Consul supports 3 kinds of health checks:
We added support for changing the weight of requests whenever the state of your application is in warning or passing.
That's very important, because it allows us to dynamically, when an application is overloaded, the application is just going to tell Consul, "I'm too loaded; I'm in warning state." And then all the applications targeting this specific instance will send this instance less traffic because it's in warning state.
It allows us to have a natural way of letting the application recover from excessive load. It's very useful at large scale, because you don't want 1 application to go down. Because as soon as 1 instance is down, another instance will go down, and another instance will go down, and so on, whenever you've got too much traffic.
On the infrastructure side we are using Consul everywhere as well. We're using it for load balancing, for metrics, for alerting, for availability.
And we see how we configure DNS as well.
One of the key points we learned is that technology on the infrastructure side is changing quite a lot. We wanted to decouple producer from consumer. We call this "inversion of control," and basically it works with a provisioner.
On this slide, I represented 3 provisioners: Swagger catalog, auto alerting system, and a VaaS, or Virtual IP address (VIP) as a service. Basically, something provisioning load balancers. And those systems are watching Consul services all the time and reflecting the changes of those instances on performing provisioning stuff. That's exactly what all our load balancers are doing.
Whenever an application is registered in Consul, Consul gets notified. VIP as a service gets notified, creates the entry in the load balancer, routes traffic to it.
Whenever the instances on this service are changing, Consul is once again notifying the VaaS on everything that is modified dynamically. It means that Consul is the only repository for all of our infrastructure.
The load balancers are using it, but all our 4 libraries are using it whenever they discover databases or logs or perform HTTP load balancing toward microservices. So it's a clear view and unified view for all of our infrastructure on all of our applications.
In the demo of the Consul UI, I'll show you some metadata. This metadata is used by other systems, such as the automatic alerting system. It basically scraps everything that is in the metadata of the service and can take decisions, create alerts automatically, and notify the owner of the service whenever the service is down.
We've got several of those services, and they are all using the same business semantics to build services over services. That's very interesting, because it completely changes the way we interact with our infrastructure. People don't have to feed yet another Git repository. They can change by themselves the metadata in their service, and the infrastructure will take care of everything.
Everything is magic because you're just publishing your service on the infrastructure, we watch what is published in your service, and we'll take the appropriate decision. That way we can cover all the services we have. All the 4,000 services are covered with alerting if we want, and that's very powerful on the infrastructure side.
DNS is a complex piece of this because, by default on Consul, DNS will reach the Consul server. So it's very important to use
?stale queries, as I said previously, but you can also use cache queries when you implement into Consul to have sub-millisecond queries.
So you've got almost instant queries because all the queries to DNS are locally cached by the agent and do not require a call to the Consul server.
You also have to take care with DNS configuration, especially negative TTL configuration. This supports queries as well.
All of this allows you to have very good performance of DNS with Consul, but it requires some work. We have written an article about that. I encourage you to have a look.
Another key point regarding large-scale infrastructure is that whenever you are using several datacenters, you might want to use async refresh of ACLs over WAN.
This is a support I added in an old version of Consul, but it's not the default. And it lets you, whenever your link between your datacenters is weak, to still have very good performance. So use async cache for
Of course, take care of the ARP cache. If you are using large datacenters, you might be hit by that.
And finally, if you have a large infrastructure, be strong on security and enforce the write on the Consul agent only from the local machine, and ensure that your health check script cannot be registered using API codes.
To conclude, the most important thing we learned is that, for humans, applications on infrastructure define clear SLA. You have to define with all the stakeholders, what does matter to Consul. For us we choose to select a few metrics.
Metrics are not technical metrics, but business metrics. How much time do I need to register a service? How much time do I need to be sure that my key can be seen by all Consul agents in my datacenter? How much time do I need to answer to DNS? And so on. We define those SLAs.
We are measuring it outside of Consul with small applications, and we have a dashboard. On this dashboard everything is green where we are under the SLA, while it's orange when we are close, and it's red when we are over.
Doing this reduced the amount of questions as well. And people know exactly what we agreed with them.
That's all for today. I encourage you to have a look at Consul UI. It's called the Consul TemplateRB. It's open source. You can fork it, you can patch it, you can send a pull request.
We are also doing lots of articles on the Criteo R&D blog.
Thank you very much and stay safe. Goodbye.