Self-Service Discovery at Scale With Consul at Bloomberg
Jan 06, 2021
Using HashiCorp Consul, Bloomberg created a self-service product for service discovery that thousands of Bloomberg developers use every day.
Mike Boulding: Hi everyone. I'm Mike. Later on, I'm going to be introducing you to Jeremy. We both work for Bloomberg. Today we're going to be talking about self-service discovery at Scale with Consul.
Then we're going to go on and talk about some of the challenges we faced while trying to bring service discovery into Bloomberg. Those would be the scaling problems and how we scaled service discovery to thousands of nodes. Some of the network topologies that we bumped into and how we managed to solve some of the complexities around those. Then finally, we're going to talk about how we made the whole thing self-service and try to encourage people to use it and get the adoption going.
Why Did Bloomberg Need Service Discovery?
The first question is a good question. To give you an idea about that. Bloomberg has three primary data centers, but we also have 100+ node sites. Those are smaller data centers dotted around the world. That's a lot of data centers to manage. Those obviously then create hundreds of network segments that we then have to manage software between and figure out how to connect them.
We also have 20,000+ different kinds of microservices dotted around those networks and data centers. We have 30,000+ servers, all running various different instances of those microservices. That creates quite a lot of instances all around as well. On top of that, one of the world's largest private networks.
All of that together creates a big need for us to put some organization there — and use some kind of service discovery to allow the various different bits of software to try and find — and communicate — with each other throughout our different networks and data centers. When we looked at this, we broke the problem broadly down into three main areas that we wanted to focus on when we came to solving service discovery.
Over the years, Bloomberg has provided many software-as-a-service (SaaS) offerings to its internal users. That is great. It allows them to go away and develop software really quickly on top of our platforms. But we found over the years that as we add more SaaS offerings into that mix — each one tends to have to solve some of the same problems again and again.
One of the bigger problems with service discovery, we would find people solving that with various different mechanisms — and they would eventually bump up to a lot of the same issues trying to scale that out and trying to make that useful everywhere. One thing we wanted to focus on with this initiative was how can we solve that for all of those future problems we might have? How can we make sure that future SaaS offerings don't have to do this — and therefore are a little bit faster to deploy.
Bloomberg internally has a thoroughly large service mesh. Whatever we used, we wanted to make sure there was something that we could integrate with that, make sure it would work and be compatible with that down the line.
Health-Based DNS Systems
This is a more historical way of doing things, but we want to just do the straight up DNS-based discovery and allow a lot of the systems that have an integration point of DNS to be able to use some great service discovery and take advantage of whatever we built too.
At the time we were doing this about three or four years ago, we were looking around and we were trying to figure out what the right way to do this was. There were several different options available — etcd, which is a nice bit of software. I think it really gets used for things like given Kubernetes today and is a popular and successful piece of software. Then we have another option which would have been Apache Zookeeper. That gets used for things like Kafka today and it’s obviously reliable and solid there too.
We ultimately ended up with Consul. When you compare those products, the reason that we ended up with Consul was that out-of-the-box, it is a service discovery product. It tries to solve service discovery end-to-end. That's awesome because it gives us great consensus algorithms like Raft and Gossip blended together to give a nice mix of AP and CA as well. The way that the way those two consensus algorithms work together and just give us a good Discovery solution is awesome.
On top of that, it has things like built-in health checks and solutions for a lot of stuff that solves service discovery. Whereas I think if we'd gone with some of the other solutions, we would have ended up having to layer some of those pieces on ourselves. But with something like Consul, we could bring it in, and it would just work — which was really awesome. Additionally, it has a great set of APIs, again, targeted at discovery as well. We didn't need to necessarily build anything on top of that too.
Thank you. I'm going to hand it over to Jeremy and he's going to talk a little bit more about some of the challenges that we faced as we scaled things out.
Jérémy Rabasco: Thank you, Mike. I'm going to now start to walk you through some of the challenges we faced when building service discovery.
The first of them is how do we go to thousands of nodes and services. When you hear thousands, usually one of the first things that pops into your head is performance. How do we make that scale in terms of raw performance? And there, there were two main questions.
The first one is, can Consul monitor that many items? We just did the experience. We registered thousands of nodes and thousands of services into Consul and Consul is happy with it. So, yes it can.
The second question was, can we query these items efficiently? If you have thousands of nodes and services registered into Consul, can you do then thousands of requests? Is Consul still happy about it? Yes, it's possible, and yes, Consul is still happy about it.
With the performance out of the way, there's a second challenge. That is a bit more interesting about scaling to thousands of nodes. It's a management problem — and not a problem of management of people, but management of software allowed.
We have thousands of machines — I've said it many times already. If you want the Consul agent on all of these machines, it's already a challenge on its own. If you don't have a solid software rollout system, you will end up having problems already.
If you think about it, you also want to be able to do multiple updates a year. You want these new versions of Consul to keep rolling out. If you're not careful — if you have multiple updates a year and you have thousands of machines — you might end up doing rollouts continuously and being drawn into software rollouts.
And on top of this at Bloomberg, we have a very heterogeneous environment. That means we have different kinds of systems running in the company. Some of them would be — for instance — AIX and Solaris. There were systems that wouldn't run any Go program. So, you cannot deploy the Consul agent on AIX or Solaris machine. We needed some way to support people who run software on these systems.
The Consul ESM stands for Consul External Service Monitor. It's a piece of software that you can deploy alongside your Consul agents that will let you monitor remotely. What does it mean to do remote monitoring? It means that you can monitor machines that don't run the Consul agent — so as a remote probe. This is something that helped us a lot in solving a problem and we liked it a lot. So, we contributed pull requests, which is also the good occasion to thank the commiters and maintainers for taking time to review our use case and accept patches.
How Does Consul ESM Solve Our Problem?
You can think of the ESM as a multiplier. You have one instance of the ESM. That instance will be able to monitor dozens or hundreds of different machines. Instead of deploying one agent per machine, you will deploy a handful of ESMs and you'll be able to monitor thousands of machines. Now the multiple updates a year — it's not so much of a challenge because the number of agent deployment you have is much smaller.
Finally, it does support AIX and Solaris because the only requirement for the ESM to work is network connectivity between the ESM and the machine it's monitoring. We know now — thanks to all of this — that we can deploy our discovery system through thousands of nodes and services.
Tackling a Complex Network Topology
How about network topology? Network topology is very hard to picture in your head, so I'm going to start with pictures straight away. This is the simplest network topology I can think about — where you have only one network segment.
What does it mean to have only one network segment? It means that all of your machines can talk to all of your machines. There's absolutely no problem. You can deploy one Consul server — or at least one set of Consul servers — and they will be able to monitor all of the members of your segment. It's fine. It just works.
What happens if you want a slightly more complex topology with two network segments? To monitor the inside of the segments, you can have one Consul cluster per segment. That will monitor the inside of those segments — it's fine; that still works as before. But the problem is if you want some service from Segment A to discover service from Segment B, for instance, you will need those two Consul clusters to somehow share information. Consul has a feature that can help you with that.
Consul WAN Join (or Wide Area Network Join)
How do you make it work? Well, first you have to poke a few holes through your firewalls to tell your firewalls that those two Consul clusters are allowed to communicate. Then you tell those two Consul clusters that they are WAN joined. They will start talking to each other, gossiping, exchanging information. Then you can do discovery this way from across different segments.
That works well. We in fact did a similar thing for some time, but what happens if you want a truly fixed network topology — and in my case I mean N network segments. N can be big, but 5 is enough to illustrate my purpose. Here are 5 network segments. What happens if I want to do WAN join on all of them?
Here's what happens. All of these arrows are holes that you have to poke through firewalls to have pairwise communication available for the WAN join. That's ten, just for 5 network segments. Really, we're going to have to say no to that solution because just 10 rules of firewalls for 5 network segments is way too much. Now think about six, seven, 10, 100 network segments. That just doesn't scale anymore. Also, as a general rule, you don't want to be poking holes in your firewalls all around. Those rules are there for a reason. It felt a bit of a security issue. What is our work around solution?
Hierarchical Network Structure
We decided to abandon the fully distributed nature of joining all of those Consul clusters together and tried to find a way where we could pull all of the data — of all of these individual clusters — into one place, and then use that place as the basis for discovery. That would mean less configuration.
Luckily for us, inside Bloomberg, we have some special network segment that we call the orchestration segment here. That will be very helpful for this. This segment has slightly elevated privileges — in that it can do communication with all of the other segments. That's fewer firewall configurations, which makes it the perfect place to do this absorption of data I described earlier. We put — what I call in this slide, the management system — into this orchestration segment that does exactly what I described.
Let me zoom into this management system. This is what it looks like. You would have a templating engine that is a bit like Consul template that will pull data from all of the individual Consul clusters and merge them into one big, unified dataset. It will bring that data set to all the different outputs that we might need.
One of them is Redis cache — for instance — that we use as the basis for DNS-based Discovery — it writes it to that cache. Then have a CoreDNS instance that we’ll use to serve DNS data. Another one is simply flat files on the file system. We use that as a basis for an HTTP API through NGINX and the end result there is more or less like the Consul API.
First, it requires less configuration. It's a bit obvious because it's the problem that we set out to solve with this project. But it's important to mention we did achieve that — requires less configuration — and we were very happy about it.
The second one would be that this is overall more resilient because we put effectively a layer of caching on top of all of the Consul clusters. That would allow us to do some maintenance operations. For instance, we could bring down a full Consul cluster and the data would still be available to our clients through this caching layer.
Finally, it's also interesting to speak about read efficiency. The read efficiency is something that you might lose on when you go from the fully distributed system to a hierarchical system, because it does add some more complexity. It does add some layers of indirection. We worked very hard to make sure that we could reclaim some of that efficiency and make sure that it was still very efficient — because overall it's a heavy system.
This is how we manage the network topology. The two previous challenges were very technical — and very heavy technically. Now I'd like to go on through something that is part human, part technical.
Creating Something That’s Entirely Self-Service
This is something that is core to our success within the company, because this is how we get people to like the product and to enjoy using the product — and not have frustration while using the product. There are a few key pieces there. The first one is probably a question that any SaaS provider has to ask themselves at some point. What is the smallest piece of data or the smallest piece of information that I need to give to someone so they can start using my software?
Use of ACLs
When we're speaking about service discovery with Consul, we thought about ACLs because when someone has an ACL, they can use it to write service information to Consul. We set up the ACLs we gave away in a way that the person will then get full access — full ownership — of a service prefix.
Getting full ownership of a service prefix means that they effectively have their own namespace where they can play and experiment however much they want without risking disturbing other users of the discovery system.
The second thing is we built a bunch of what I call wrappers or helpers. The main word that we put into there was would be first the API wrapper. The API wrapper means that we don't expose the Consul API directly to our clients. We build our own API that will then call onto the Consul API.
The main reason is this allows us to reduce the feature set we expose to our users. This has the effect of reducing the barrier of entry — because they don't need to learn about all of Consul — just about the set they're interested in.
Secondly, it allows us to make sure that only the features that we have optimized our architecture for are exposed. So, we are not going to expose features that might endanger the discovery system.
Command Line Interface
On top of this, we built a CLI — a command line interface — program that lets people use the API without even coding. They can just download the program, run it, and it will do all the work with the API for them. The good thing is — as I said — less coding; it’s easier to use. Also it lets us discover a lot of the parameters to the API from the environment, because this CLI program will run on the client's machine. A lot of the data that the API needs is discoverable — that means that we need to ask less from our users and it's simpler for them.
Finally, we built a bunch of libraries. There's one example there. This is Python — the idea is that the users don't need to know about DNS, HTTP or Consul to use service discovery.
They only need to know that they want to discover an instance of a service — and that service is part of a namespace. If they know the name of that instance and the name of the namespace, they can discover the address, ports and everything they need to know to use that service. This is what this library achieves — abstracting away a lot of the concepts.
Tutorials and Documentation
Something that is often overlooked is tutorials and documentations. This is one of the main things that made the self-service experience with our products very nice and not frustrating.
We put a lot of effort in building step-by-step tutorials. This is an example of one of them — and I mean step-by-step. We guide them through how they can get their ACL, how they can use this ACL to do things, how they can use all of our libraries or can use all of the different APIs.
This is a big investment at the beginning to get nice tutorials. But the reward is that when you get to hundreds or thousands of people using your service, they don't ask as many questions. They're just very happy to use the system on their own without your intervention. That has been a huge time saver for us.
In terms of reducing the barrier of entry, this product that we call Farms — that I'm about to show — is probably the combination point where the idea is to remove the need for any code for someone to use service discovery.
They would get on that screen. This is a Wave UI. They just need to get there, write the list of machines that will be part of their service, give us the list of checks that they need for that service — HTTP, TCP. They click the submit button, and then we do all the work for them in the background: Register everything into Consul, make it available in our discovery. Zero code needed and their service is advertised.
ZooKeeper Migration — A Success Story
I'd like to close now with a small success story that we have in the company. This is a story of how focusing on discovery can — in the end — get through very nice things. This is a story of another team that is providing Apache Zookeeper as a service within Bloomberg. They decided to use our discovery system as the basis for their discovery needs. As it turns out, different versions of Zookeeper ended up having different needs for discovery. I put two examples there that I'm going to tell you more about.
One is Connect string — a piece of configuration that you need to give to your Zookeeper instance, so it can do discovery. The other one is just an endpoint that needs to connect you to do some discovery.
If you think about it, what would be required to build a specialized system for both of these options? Well, Connect string, as I said, is a piece of configuration. That means you would need a proper configuration management system. The end point works a bit like a DNS system — so you won't need to build your own DNS system to support that branch.
As it turns out, if you have a proper discovery solution that solved discovery for you as we are providing, it only requires a tiny bit of software around it to provide both solutions. We ended up saving them a lot of time by having solid bases that solve discovery for them. They just needed to do the business logic on top of that. The added bonus is that their clients could now use all of our helpers and libraries to do their work.
That was Mike Boulding and myself — and have a good day.