How Criteo went from Physical Hosts/DCs to Containers with Consul and Connect
Jan 09, 2019
See how Consul Connect is the keystone of Criteo's advertisement retargeting business infrastructure.
Criteo is a large actor in the advertisement retargeting business. They have run their business on large datacenters with Consul for almost 3 years with more than 22k nodes, up to 2.5M requests/sec and 100Pb.
Historically running only on bare-metal hosts, they are moving their workloads to containers. Consul provides the gateway for all their systems running Windows, Linux, containers and cloud systems. Consul is the keystone of their infrastructure: it acquires locks when needed and provides service discovery using both DNS and advanced libraries. This talk will show how Consul allow them to provision their load balancers (F5 / HAproxy), and configure their libraries to perform CSLB (Client Side Load Balancing) (C# and Scala).
Since all of their apps are very sensible to latency, they try to avoid proxies as much as possible and are using full mesh connections between their applications. They developed many CSLB libraries with various algorithms and use it for C# and Scala. See how they manage having different hosts (containers and bare-metal servers) and how Connect (with a few evolutions) will help bring new workloads to their infrastructure.
Senior DevOps Engineer, Criteo
Good afternoon ladies and gentlemen. I'm Pierre Souchay. I'm completely jet lagged. It's the first time I do speak in English, so pardon my French, but I hope you will enjoy the talk.
So, about me, I'm Pierre Souchay. I'm working for three years at Criteo. I first started as a software architect. Then I did infrastructure, working, managing more than 15,000 bare-metal machines. Now I just created Discovery team. The goal of the Discovery team is to operate Consul, to create SDKs for all of our applications to interoperate nicely with Discovery. And of course to provide all the services you expect from Consul including DNS and so on. In order to do that, we have to create several patches to Consul—I just learned that I was the first external contributor to Consul. But more globally we really need to optimize Consul for specific workloads, and it's very interesting because Discovery is all about naming things and invalidating cache and everything knows that's the hard part of computing. It's a very interesting challenge.
So, just a few words about Criteo. Criteo used to be a small French startup that was introduced and is now listed at NASDAQ. It's pretty big. It's in the advertisement business. We have offices in Paris but also in Palo Alto as well as Ann Arbor and we are recruiting, so if you want to join us, give me a call.
So we have around 800 employees in R&D, around a bit more than 2,700 employees globally. So we are present in all parts of the world.
A little bit more about Criteo infrastructure. It's a pretty big infrastructure. It was initially Windows shop, so there are lots of Windows bare-metal machines. Globally we have around 25,000 nodes in eight data centers just for the production. We also have two data centers in pre-production and around 50% of the nodes are running Windows and the others are mostly running Linux.
We have huge Hadoop cluster. We are performing lots of machine learning. We are using Kafka very heavily. We also have a Mesos cluster so it's similar in terms of features with Nomad. We are currently moving away from our bare-metal infrastructure in order to go to containers and I'll dig into that during the presentation.
And of course we've got Consul everywhere. All the machines within Criteo are running Consul agent. That's an interesting challenge because of the scalability of the old infrastructure.
So during this presentation I'm gonna talk about how we went from physical data center, physical bare-metal machines to how we want to go to containers using Service Mesh. If you heard about Service Mesh during those talks, you probably have seen Paul's presentation or Betty's presentation from HAProxy. We definitely want to go this path and I'm gonna explain why and how. Since we are already using Service Mesh within Criteo for a few years, I'm gonna explain how we went from classical architecture to Service Mesh. Do any of you in the room know exactly what is Service Mesh?
So basically, and you see how good I am at drawing pictures, on the left side of this drawing, you'll see a classical load-balancing architecture. So your application is gonna talk to another service using a load balancer. On the right side that's the path we want to achieve, meaning Service Mesh, which is the application is directly connecting to the other services.
So in order to explain the path we took, I'm gonna take a very simple example of an app that could run at Criteo. Naturally we've got very similar kind of applications like that. So this kind of application may have to handle more than 2,500,000 requests per second, for instance, globally. So it's pretty big.
So this application is gonna be very easy. It's an application that displays banners. So the user is on the internet, he's going through a farm of load balancers. For instance, using F5 or HAProxy, whatever you want. And then you've got my app.
Well, in order to work, my app requires services. So you've got metrics, you've got memcache because we've got lots of users. You've got SQL servers. You've got Kafka and so on and so on. And you've got microservices. Actually Criteo took the path of microservices several years ago.
Well, if you think about microservices, you often think about containers or smaller applications. Actually, the microservices we are using at Criteo for a long time are not micro. They are microservices because they are doing one single thing. For instance, tell me what is preferred color of this user. Tell me whenever I have to display a banner for this user, and so on and so on. So it's kind of Unix tool doing one thing but there are several of these on those application are huge. I mean we are talking a full bare-metal machine taking lots of CPU are very optimized for what they are doing. So those are microservices.
Why did we go through this path? Simply because those microservices are used by lots of applications because we've got my app but we've got ten apps like this. All of them need to know what is this user. So we separate it into several teams each microservice, and we are able to scale those services by adding new machines for instance on bare-metal hosts, doesn't matter. So we are able to scale every part of our infrastructure this way. And it used to work well.
The other thing to take into account is latency. In the advertisement business it's very important to have a good latency because as a kind of auction system. If you're too late, you're out of business. I mean even if you say, oh yeah I'm gonna answer to this advertisement, the auction system won't take your answer if you answer too late. So it's very important. It's a part of the business. It's not only QoS, it's really part of the business.
So we are gonna do this application with a traditional model, which is: I'm gonna use a load balancer. I'm gonna use HAProxy because I love this product so let's use it. My application is gonna call the microservice using this load balancer. Well, with the scalability of Criteo, our microservice instance, which is a huge machine, you've got 500 of those. So it's pretty big. And you've got lots of those services everywhere within the company. But it might work.
So this is the bandwidth of one of our microservices. So as you can see we have very, very good engineers working well doing async I/O, stuff like that. They are working very hard for their microservices to be very efficient—it's a real graph I took last week. So this microservices is getting very very high in terms of bandwidth. Another issue is by adding a load balancer, our latency did increase a bit, and it's a problem for the business. When you have to answer in 20 milliseconds, every millisecond counts.
Let's do some math. In a one single data center, we've got something like more than 2,000 servers. And given the bandwidth I show you previously, lets say we can put 10 targets per load balancer at peak. So the answer is far too much. Far too much, because it costs a lot to add all those load balancers. If you have to add more than 200 servers only to load balance your systems, it's gonna be complicated, especially if you are using F5 or stuff like that, it's gonna to cost a lot of network dollars, and everyone knows that network dollars are very expensive.
So, we have lots of good engineers, so we have to find a way to avoid paying for this infrastructure, this cost in every of our data centers. So what we did a long time ago—when I say this, it's something like six or seven years ago. We did client-side load balancing. So maybe some of you know Finagle or this kind of things. So we did implement our own load balancing mechanism, so a Service Mesh, which is implemented in an API.
So, we've got libraries. We wrote one library for our C# implementation, we wrote libraries for JVM/Scala, and we've got the best of all worlds, meaning we've got very, very good performance. We don't need all those machines when we've got less latency. So it's really, really great, was a great idea.
So there is no application. MyApp is using this CSLB API and discussing directly with every of the microservices. The drawback however, for those having heard the presentation from Batist of HAProxy, he said, "Ah, don't do that." And he's right because it's very hard to get it right, very, very hard.
Only the CSLB code, different libraries, is more than 10 kilolines of very, very complex code. Every time you are moving something within this particular code, it's difficult, you really have to get all the implications it's gonna do. If you want a new feature—you want TLS for instance—well, that's very difficult because we oversimplify the stuff and having TLS right is very, very hard.
Another issue is, everybody now wants to get rid of C# code and so on 'cause Go is cool, Rust is even cooler, Python… Everybody want uses the right language for the task, right? Okay, but how do we deal with this CSLB stuff? It means for each of this languages we have to bring all of our code to the new language. It's messy and it's very complicated. Now we have a discovery team, we are full. So, full, if you want to patch your CSLB code to Go or Python or whatever, it's gonna be very difficult for us. So, we're not dealing with that—we don't want to deal with that.
There's another issue. In my first application I told you, "Oh yeah, there's a load balancer," and so on. That's great but still this load balancer has to reach MyApp and MyApp has maybe 500 or 800 instances so you have to do all the provisioning by yourself as well.
So, we did it, meaning that we did implement our own system based on Consul, which is watching all the changes into Consul, live; which is provisioning F5 load balancers as well as HAProxy. But still it is almost the same job as providing CSLB—lots of work, very difficult. On the scale we had, we had to implement our own Consul template clone, for instance, in order to get those things done otherwise it was taking all the bandwidth available on the server just to watch all the changes in your infrastructure. We've got more than 1,300 different kind of services in a single data center. So it's creating lots of bandwidth usage.
Another issue is of course, for the load balancer, you have other issues to tackle which are TLS, virtual domains, and so on. So there are lots of more features than CSLB.
We did work a lot on this and finally we figure out how to do that but it's a huge amount of work. It's a full team working during, maybe, one year and a half so it's very, very hard.
Why are we doing all of this? I mean, we used to work without Consul. We introduced Consul three years ago but we did find a way to get the job done. We are using a kind of "discovery mechanism," relying on a database or something like that, pulling the stuff and so on—it did work. But why are we doing all of this? Simply because we want containers.
Everybody want containers: containers are hype, containers are cheap, they are scalable, they are also very hard. Why are they so hard? We'll see this a bit later.
So, containers give you quick scalability. I mean, you can easily create a new instance of your app. For instance, for bare-metal provisioning, spooling a full instance of a machine running a specific service for us, takes around three or four hours. It's a long time. You have to image a machine, create this thing, we are using Chef. So it takes a lot of time.
At that time, having a database is not a big issue. I mean, you can poll your system every 20 minutes and say, "Okay, there's a new entry in the load balancer," and so and so on, and you don't need Consul for that. But as soon as you introduce containers, then you have to be more reactive regarding your changes. Containers are great for microservices because, as long as you want to scale horizontally one of your microservices, with containers you can spool in a few seconds a new instance of μService and be able to answer to more requests from the internet, so that's great.
Another thing which is very difficult to do with bare-metal machine is to co-locate several services. Let's say you have a service taking some amount of CPU and some amount of memory, we can say bandwidth as well, another service running less CPU and more memory, with containers you can combine both to run on a single host. And that's where containers are really great. I mean, it's all about the money so being able, on a single machine, to run two services, running with different characteristics, is great. And that's the goal of containers on microservices, being able to scale them individually.
Three years ago we did choose to first introduce Consul and we also took the path of taking Mesos. So it's a bit similar to Nomad. At that time it was one of the nice way to deploy containers. I know there are several more alternatives such as Kubernetes, of course, Nomad. So we had to first bridge Mesos to Consul.
Here is an example of one service running. So on the right side of the image you see, you will see one instance which is running Linux in a container and on the bottom ... No, that's the opposite. On the top it's a Windows machine, bare-metal Windows machine running Windows, and on the bottom, a Linux machine in a container. So, for Consul, those instances are similar. Of course you can see we put lots of metadata and so on but for the old discovery mechanism everything looks the same. I mean, whenever you are reaching an instance in a container on a bare-metal machine, it's the same for the caller. So that's great.
However, we've got an issue because, for instance, we are moving away from Windows, we are moving to Linux, but the runtime is different and moreover you don't want to use very, very large containers. You don't want to use huge amounts of memory and CPU. You want to have smaller instances to be able to scale it horizontally in a fine-grain fashion. So it means that, what you want to put in your container system is smaller instances.
The problem is, if you want to scale up the thing and being able to answer correctly each request from the internet. Then you have to put less load on your containers. Because they will be able to transfer less requests per second. So, what we have then is a huge machine being able to take twice or three times the load of containers. So we want to be able to achieve that.
So this is pretty cool, because we added support for weights in Consul a few weeks ago, I think it's ... yeah starting with 1.2.3. version of Consul. So it means that Consul has a full knowledge of whether this app should handle a huge load or a smaller load. So it's very useful, because otherwise it's almost impossible to move away from bare-metal machine to containers.
Great! We've got a new,
weight attribute into services for Consul. So we want to use it. Great! So, first you have to wait for your Consul pull request to be accepted. Once you've done this, you have to modify your CSLB client side load balancing libraries, in order to take this into account. Modify the LBs.
That's nice. Takes a huge amount of time. Is it worth the effort? I mean we are working so much for a small feature, and I'm not even talking about security. If we want to move away from all bare-metal infrastructure, protected infrastructure, we want to move to the cloud, we want to move to containers, we need to have TLS for instance.
Okay. Weight is one small thing. TLS is far harder to implement. So can we find something that will help us deliver more for all those changes?
Of course! Connect is there.
So Connect offers you the low latency, the weight, you don't have CSLB code to write, and much more. The biggest important thing for us is we are an on-premises shop. Meaning that all machines are running in bare-metal server, just with network cards and so on. We are not using a software defined networking mechanism, such as Calico or whatever. We are not running Kubernetes. We are interested in running your stuff, but it has to be compatible with us.
The big advantage of Consul is it tells you to run those kind of workload. So it tells us to migrate from a pure bare-metal infrastructure to a new one with containers, with fancy stuff such as Kubernetes. So it's great.
So that's the big advantage to Consul, because you don't have to say, "Okay my port is hard-coded and so on". Consul provides all of this. So it's compatible with whatever you want. So that's a great selling point for us.
Another very cool feature of Connect is Intentions. I've been working on our cloud provider for some time, with lots of microservices. And it's almost impossible in the current world of microservices to have a graph of dependencies of services. It's a big deal. Because when you stop a service, who's gonna be impacted? If you stop your microservice another microservice might use it, other microservices might use this service, and so on and so on. It's almost impossible from now in most architectures to get a full graph of what's going on.
Intentions are, of course, good for security. But in the long term it will allow as well to create a full graph of the dependency of your full data center, or global architecture. So it's very good in the long term to be able to have strict ACL. Being able to see the full picture of what is going on within your systems.
Another cool feature regarding the Consul Connect feature is being able to switch from one implementation to the other. I mean you've probably heard about the control plane, which is Consul, on the data plane. And the data plane is implemented now with a pure Consul proxy implementation: Envoy on HAProxy. So you will have the choice to choose the better implementation for you, and it will work not only for HTTP services, but gRPC, pure TCP stuff. So that's really cool, because you might address most of your needs.
So what we did is we tested Consul Connect a bit in our infrastructure. It was a quick test. We did test it only with the default Consul Connect managed proxy. Mostly because we are running Windows hosts, so having a sidecar was not very practical at that time. And we did not test it on the whole range of our services, but only on the Mesos services. Still it's around 450 different kind of services, and it's around 10,000 different instances.
So this is quick result. Please do not take this seriously. Imagine we run the app, we did some basic test. What we found out is: compared to direct connection the latency did increase, because we worked very hard to implement direct connections on our home Service Mesh. It's a bit less, you have to do up to the local host.
Compared to a load balancing approach it's faster, of course, less latency. You have one hop less. The throughput is a bit worse. But what's interesting is just the default implementation, which is not really recommended for your usage. But it shows still you can have some benefits using it. And the big benefit is, now we can use CSLB for all the languages we want.
In the era of containers, everybody want to use the right language for the job. So people want to use Python, people want to use Go. People want to use C#, Scala, whatever. And they can use it almost for free. That's very interesting.
So all of this make me think that the future implementation, future tests with Envoy and HAProxy will be probably far better than those results. In the long term you will be able to use it without paying the costs of writing CSLB code, CSLB libraries and so on. So it's a real benefit for you, because you will avoid having all those classical load balancer stages between your applications on the microservices. You will get good latency, good throughput. So it's gonna be really good.
Another point is you don't have some kind of single point of failure, and you will be able to have a scalable infrastructure globally for all your data centers.
So just to summarize, what Consul Connect and globally Service Mesh do add as a value, which is:
You have less machines.
You don't have to pay the price for lots of stages of load balancers.
You've got network independence—at least with Consul Connect.
I've seen many presentations showing how Kubernetes can work with Consul service discovery. That's a great thing as well. Meaning that from Kubernetes or whatever the path mechanism you choose, you can interact with Consul. So that's a very strong selling point as well.
It's got very fast updates on provisioning. Probably better performance than with a real load balancer. So all of this make me think that is a kind of future for everybody.
And the biggest selling point, the biggest issue for me with Kubernetes on this stuff is it's a local bubble. Meaning that those services are within Kubernetes but they cannot go outside. So that's a greatest selling point for me to Consul Connect.
Thank you very much.