Bloomberg's Consul Story: To 20,000 Nodes and Beyond
Hear the story of how Bloomberg used Consul to create their own service discovery SaaS for their heterogeneous IT environment.
Bloomberg has an interesting heterogeneous software and hardware environment consisting of many architectures, operating systems, and thousands of machine types.
Here are some more stats:
- 5,000+ software engineers
- One of the largest private networks in the world
- 120 billion pieces of data from the financial markets each day, with a peak of more than 10 million messages/second
- 2 million news stories ingested/published each day (500+ news stories ingested/second)
- News content from 125K+ sources
- Over 1 billion messages and Instant Bloomberg (IB) chats handled daily
To manage tasks like service discovery, Bloomberg runs HashiCorp Consul.
This talk will discuss some of the challenges involved when integrating Consul into an existing network of 20K+ nodes. First they had to design a way for nodes with a Consul-incompatible network, hardware, or architecture to work. The process involved 2K+ node scale tests.
To make their infrastructure work with Consul, they built Consul ESM (External Service Monitor), and now they're building a service discovery SaaS platform with Consul and Consul ESM.
Software Engineer, Bloomberg LP
Engineering Manager, Bloomberg LP
Rangan Prabhakaran: Good afternoon, everyone. Thank you all for being here. My name is Rangan, and I have with me Mike. We’re both from Bloomberg, and we’re here to share our Consul story. How we got started and what we learned along the way.
A little bit of background about Bloomberg. We were started by the former mayor of New York. The Bloomberg Terminal is our flagship product. We have over 325 000 subscribers who rely on us for real-time market data, news, and analytics around the world. We also produce news and media, and we have more reporters than The New York Times, as well as some of the other well-known media outlets mentioned on the slides.
We have a reasonably large engineering effort. We have over 5,000 engineers. They make up for almost one-fourth of our company, and we also, to provide our service, we operate our own data centers. And we have over 200 node sites around the world. We do have one of the largest private networks, and 325,000 clients run some part of our software on their workstations as well.
» A diverse environment
With a company that’s our scale and our age, we have a lot of critical applications and a very diverse environment. No surprise, from the title of the talk: We have over 20,000 nodes. A majority of it is Linux, but in addition to that, we also have big-endian, Solaris, AIX boxes, ESX VMs, and our own private OpenStack cloud, as well as some things running applications on Kubernetes within our environment.
We have a lot of infrastructure service providers as well who provide a wide array of both proprietary and open-source infrastructure services—it could be databases, caches, messaging, middleware, just to name a few—to our application developers and that’s where we come in.
Our team, Bloomberg Managed Services—branded name—we enable application developers to create, update, deploy, and monitor the infrastructure managed services in an easy-to-use, self-service fashion. And also to have a uniform and consistent experience across these different services.
This is all grand to say, it’s easier said than done, but what are some of the common challenges that we have, a common team across different infrastructure service providers? Config management, deployment, so is discovery, access control, monitoring, telemetry, having a uniform API, CLI, UI. These are some of the different highly available frameworks that we aim to provide to our developers at Bloomberg.
We obviously leverage Consul for service discovery, and to talk more about that I have Mike, who is one of our lead developers on the project.
» Service discovery at Bloomberg
Michael Stewart: I am going to be talking about the fun bits: service discovery and building a platform for that at Bloomberg. To begin, I’m going to talk about our setup: some of the motivation behind our Consul architecture; why we’ve set up our servers and client agents the way that we have; how it might look a little bit different or similar to what you’ve seen. Then we’re going to get into our DNS configuration: how we expect our developers and applications to consume service discovery; how we use prepared query templates to make that a seamless and managed experience. And then just some general recommendations for what you do if you’re running a new Consul cluster, just some things you can mess around with, with an existing cluster.
This talk and our work are mostly around open-source Consul. There are a lot of great Enterprise features that you might be able to use in addition to, or in place of, some of the setup that we mention, but we’re going to purposely ignore those for right now.
» Motivation for Bloomberg’s service discovery platforms
Some of the boxes that we needed to check off before we could offer this as a service: We needed to ensure there’s seamless integration. We have so many developers. We have a lot of applications, things that have been written years ago, things that are being written right now. Not all of them were created with service discovery in mind. And we want to ensure that we are capable of providing service discovery via DNS or via rich APIs.
We wanted to make sure that something that existed 10 years ago could begin making use of this platform today. We also needed service discovery to be globally available. We run very large data centers. We need to ensure that, though our various network areas may not have full connectivity between them, we would be able to discover services running at any of those network areas. We want to make sure that this seems global regardless of how it works behind the scenes and that service providers and consumers in Bloomberg don’t have to worry about that.
And lastly, we need to be highly available. We have a really strict SLA, really strict expectations from our external clients. We’re an infrastructure team, so we need to make sure we meet that same standard, that we’re capable of staying alive if a data center fails, that servers falling over leave us in a good state, that we’re still able to perform and answer queries as fast as we would be in a good situation.
» Bloomberg’s Consul architecture
Now just a little bit of our Consul architecture. Here we have a really simplified Consul overview [5:36]. We have 5 servers. We have on the screen 4 client agents, all gossiping with one another with full communication. They’re sending their updates back to the servers with RPC.
You can imagine that there are thousands of these client agents. What we do for our network areas is we run fully separate Consul clusters using Consul open source. This gives us a completely separate gossip layer, and then we’re able to WAN-join these together. So that way the servers can answer a query for a service instance, running in any network area. This is the Consul data center feature. Rather than using it to isolate or segment our data centers, we use it for our network zones.
» DNS and Consul at Bloomberg
We do things a little bit differently than a traditional Consul setup. Right now, up on the screen, is a traditional setup. Any Consul-related DNS query, you would expect to be sending it to your local agent on your machine. It would be able to answer that for you.
We, like many large companies, already have a DNS setup. On the bottom half of the screen we have a simplified view of ours. We have a global BIND setup. Any DNS requests, Consul-related or not, is going to be sent to that. It’s not going to be answered by something on the machine itself.
For Consul, what we do is, we send all of those requests over to our primary Consul servers in our primary network zone. This does lose a little bit. We are going to be putting more load on the central servers, but it allows us to answer a service discovery DNS query on any machine, whether it’s been around for 10 years or it’s a brand-new machine we’re building today with a Consul agent on it. We do lose out on the source information here, if we wanted to make use of Consul’s really nice, smart DNS responses with nearness.
» Getting EDNS support in Consul
So we worked with HashiCorp to get EDNS support in Consul natively. This means that a DNS query can have its source tagged along with it along multiple hops, and then Consul’s able to extract that. We’re building that out right now, and it really allows this setup to work as seamlessly as a traditional DNS setup for Consul.
And nearly all of our DNS queries are made with prepared query templates and there are a few reasons for that. One is that they’re parameterized. We wanted to make sure that our service providers and consumers can talk in named terms. We don’t just ask for a service name, a service ID. We split that into a few components, and because of prepared query templates and because of some middleware that we have, we don’t need to modify Consul to support that concept. It’s able to extract fields from request, marshal them into a service name.
It also provides really nice WAN failover. This is the failover between network zones. Without prepared query templates, you may have to have a client that is capable of retrying a query for different Consul data centers, or for us different network zones. This does all that behind the scenes.
And lastly, provides nearness. It provides a really nice concept of placeholders to extract the source machine. Traditionally, this was only for a local agent, but again we’ve worked with HashiCorp to have EDNS and HTTP and IP addresses extracted if it’s been along multiple hops.
We have an example here of a prepared query. This is a catch-all prepared query essentially saying, “Any prepared query execution with no specific match will hit this.” We can then extract with a regular expression 3 fields. And then we can marshal that into a single service name. This means that our internal format isn’t exposed to our end users. We can still give the full feature set of Consul, but provide a much more managed service discovery-first framework.
» Other Consul features used at Bloomberg
Then the failover nearest end-block. This is really important for us. The WAN failover will automatically go to another Consul data center—in our case, again, network zone—and get a result if it does not find one locally. This means that a service consumer does not need to care. It can discover anything from anywhere. Whether it can connect to that or not, our job is to be service discovery. We want to make sure that we can do that very well.
Near_IP: This is the feature that we worked with HashiCorp to add. This allows for EDNS or for HTTP source IP to be pulled out. Consul can then go into the catalog, use the network coordinate feature, and give you a smart DNS response, rather than a randomized result.
And lastly, just to point these out, we can set a custom TTL prepared query. This is really nice because if someone is hammering you for a specific endpoint, you could set a different TTL for a more specific prepared query. You can immediately have that take effect.
IgnoreCheckIDs is another feature that we worked to get added to Consul. It allows you essentially to say, “I want to respect the health of this node of this service, but I know this one known check ID is giving me issues.” This gives you an immediate response to that scenario, and then you can work on the side to fix that as well.
» General setup recommendations
Some general recommendations for my setup before we get into some more fun. We highly recommend using ACLs for a Consul cluster if you’re not. Even, if you’re in a trusted network. Even if they’re in plaintext and you’re not encrypting your communications, they ensure for something that is production SLA, that any action taken on your cluster is intended, at minimum.
Custom metadata, something that we think is really powerful in Consul. Being able to tag your services for your nodes with pretty much whatever you want. You can go overboard, but we use this to tag some information, some versions of middleware running on machines, to add some additional context about our environment on the machine where service instance was registered. And this allows us to build some nice visualizations on this, to make some decisions. It helps us with debugging as well.
Staging is something that I hope everyone is doing already, but I can’t recommend it enough. You should always have identical clusters that you can run your changes against first, run your console upgrades against first. We know that we can mess some things up because we have staging clusters. It allows you to iterate faster. We’ll get into some challenges later, but at least one situation, it really helped us to have staging clusters.
Monitoring, alarming: You’re scaling a Consul cluster up to 20,000 nodes. You need to know if the gossip layer is keeping up. You need to know if the servers are capable of handling the load, how much memory they’re using, how long a DNS request is taking to serve.
It’s really important that you have dashboards, you’re monitoring your log traces before you have any issues, from day 1.
And then, lastly, access logs. Some of you could add any point to a system, but as an infrastructure service that we are trying to have used on 20,000 nodes by 5,000 developers, we want to know what the next feature we should build is, or when we can deprecate something. So access logs are really important and critical for any, really any piece of core infrastructure.
That’s the basis for our Consul setup. I’m sure you can find similar setups out there.
» Consul and the External Service Monitor
Now we’re going to get to the fun bits: external service monitoring.
So, what is an external service? It is just any service that you want monitored via Consul that is not running on a node that is running Consul itself. And there are a few reasons why you might be in this scenario.
One would be if you are not running on architecture that supports Consul. In our case this might be big-endian. I’m sure there are many architectures that I’m not aware of, but there are situations where you can’t run a Consul binary on a machine.
Probably more common, it’s possible that you have limited network connectivity. Maybe it’s a restricted machine—it can only accept incoming traffic—or it’s a machine that can only accept traffic from a known list of hosts.
These can be very limiting when it comes to the Serf gossip layer. So you might be in a case where running a Consul agent is something that you can’t do for those reasons. And then, who knows why? There are plenty of reasons that you wouldn’t be able to run a Consul agent. I’m interested in hearing anyone else’s. We have a lot of scenarios where we want to support these nodes that are not running Consul as first-class citizens.
We can’t offer a service discovery platform that covers 90% of our infrastructure. We want to make sure that this is a uniform experience for all of our developers on all of our machines.
So to help with this, we’ve worked with HashiCorp to have the Consul External Service Monitor, or Consul ESM, built. The External Service Monitor is an official HashiCorp-built, open-source extension for Consul. It provides you node and service monitoring, and unlike Consul agents, it has no local dependencies on that machine.
Just to keep in mind before we keep going, the word “extension” is important there. The intended use case right now is that you are running this in a cluster where Consul agents are still your primary entry point into Consul.
This is a really useful tool to cover that last 5% or 10%, or whatever percent, that you cannot cover with Consul agents. But it works really well on a cluster that is already running Consul.
» A detailed look at the architecture
Now, just another slightly more detailed Consul architecture diagram. The only change here is our yellow boxes are representing service instances that we are monitoring. These are the same open-source Consul servers, same open-source Consul client agents. We expect these to be available via DNS, via Consul catalog APIs.
So, we move these over. They’re still there. We still have thousands of these client agents and many more service instances on those nodes. And we now add some external nodes. The important detail here is they still have those yellow boxes. We still want them to be monitored. A service consumer should not have to know that these are external nodes. Those yellow boxes will appear the same way.
Today, without the Consul External Service Monitor, the only option you really have is to register these in the catalog, and this is a static entry. You’re telling Consul, “Hey, trust me. This is running at the safety address. It will stay healthy, or I will tell you if it changes.” You really have no option for automated health checking, because that is expected to be happening at the local agent level. The servers will keep this data forever, but they will not update it themselves.
This is a good workaround, but it’s not a perfect one. It’s not really a fully supported method. You have to build a lot of your own infrastructure to update these in real time.
So, we replace this now. We’re now running the existing Consul agents—no modifications to them—and this Consul External Service Monitor daemon on those machines. These are running in a cluster themselves. They’re participating in the gossip network. They’re using Consul key-value store for coordination between one another, so they’re very easy to manage.
Now we can register those external nodes and the services on the external nodes in the catalog with some additional metadata, telling the ESM that it should go pick those up and start monitoring them. It will then perform health checks. It will then ping those nodes and give you almost a full feature set of Consul, even though you’re not running Consul agents on those nodes.
And this is fully compatible with an existing Consul cluster. This is how we are using the External Service Monitor.
» Using the External Service Monitor
What can the External Service Monitor do? It has some limitations compared to traditional Consul. It is almost purely for service discovery. It can provide you with really rich HTTP and TCP health checks, the exact same way a local Consul agent does, the difference being it will send an HTTP or TCP request from a daemon rather than from the same machine.
But you can still have customized intervals. You’re still using the same Consul APIs to register these. You’re just registering them in the catalog instead of on a local agent.
It can provide you with really nice network coordinates. This was a feature that we really wanted. We wanted to ensure that our big-endian or other machines were considered full first-class citizens. We wanted to be able to have nearness in our prepared queries, even on these nodes. But they’re not in the gossip layer, so we don’t get that for free.
The External Service Monitor is in the gossip layer. It’s running with a local Consul agent, and it’s running in a cluster. So it has a pretty good view of what your network looks like, and it’s pinging these nodes, so it can use latencies to inject into the catalog an estimated network coordinate for all of your external nodes.
This is a really cool feature, and that and health checks really distinguish this from a static catalog registration. It’s also highly available and performant. We’ve worked with HashiCorp to ensure that this runs in a clustered mode using the Consul key-value store. It will shard health checks. If one ESM agent dies, another one should pick it up in its place.
With this type of setup, you can really monitor a lot of nodes, and even more service instances. You can run as many ESM daemons as you want. We are still building this out.
We’re really interested in hearing other use cases for this. And we’re really interested in seeing where this project goes. It’s pretty simple right now, but does a very important task. It’s really extensible if there are other use cases.
» Challenges with Consul ESM
Now, some challenges that we faced. This is not a fully inclusive list, but just some things we ran into while we were scaling the Consul cluster up, or some areas where our knowledge wasn’t complete, things that we picked up along the way.
Just to clarify: Anytime we talk about gossip, that is hashicorp/serf. If you have any questions about gossip, its stability, its tunables, the first place to look would be Consul documentation; the next place would be Serf.
The first challenge that we faced was gossip encryption, and not having it. It’s not something that you might expect to need if you’re in a trusted network, but let’s say this scenario up on the screen is true. We have 2 clusters. They’re fully separate. They run their own separate servers. They have their own separate clients. They’re each in their own separate gossip layers.
So, the question will be, “What happens if a single client agent LAN-joins another client agent from the other cluster?” Say I go on a machine and make a bad operator command, and this is in development. For us, this would be 2 staging clusters. Say a bad configuration is rolled out. We’re testing out some changes, and we accidentally have an agent join the wrong agent.
This is not something that Consul would do on its own, but it’s something that’s pretty easy to run into if you’re not careful. You might expect this scenario would be OK. You’d probably just need to fix those 2 nodes, and everything would be fine and happy. But, yeah, there’s a reason there’s fire on the screen.
Because you end up in this scenario. You end up with Cluster AB. Those 2 nodes will gossip with one another. Serf is really good at distributing updates, and Consul is really good at building a cluster out of the gossip layer. That’s one of the purposes of that layer.
So they will all discover one another. The servers will discover one another, and they will be happy, and you will be sad. So, yeah, this is not that easy to split. They will remember each other for, potentially, 72 hours. Which, again, is a great feature when you don’t do this.
So it’s really hard to split, but it’s really easy to avoid. Just use gossip encryption, please. Even if your keys are public, even if you’re in a trusted network, they’re in plaintext, and anyone can go on a machine and look at them. It’s not purely for security. It is also so this scenario can’t happen.
If you give your clusters different gossip keys, they cannot join each other. You might see a log saying that they can’t communicate, and that’s great. That’s what you want to see. And you can enable this on a running cluster. That’s what we did for our production cluster.
Consul’s documentation is really good for this. It has guides on how to enable this in a rolling manner. So that way you don’t have any down time or performance hits.
» Checking gossip health
The next challenge we’re going to talk about is trusting gossip health. This one is not something that we have hit. It’s more of a hypothetical that we wanted to make sure we had answered before we were using Consul for our production SLA service discovery across 20,000 nodes.
With gossip, with Serf, you have an implicit health check, the Serf health, and that’s great. It is very good at ensuring that any failed nod is truly failed. It does very good at not flapping with the Lifeguard improvements. But if you get into a really diabolical situation, you can even break those assumptions.
So let’s say we have many hundreds of nodes on both sides of this diagram. They’re all in the same Consul cluster. They’re all expected to be in a flat network. They all have connectivity to the servers. What would happen if we broke their connectivity?
Say a firewall rule is rolled out. Something is going on with the network, not a Consul issue, but something that might affect Consul. They still have connectivity to the servers. You could end up in a scenario where they start marking each other as offline. They will come right back, because they still have connectivity to the rest of the cluster, but this shouldn’t really happen. This requires a lot of things to go wrong, but how could we react? Either react, or how could we stop this from happening? It turns out it’s really hard to stop this from happening because it breaks a lot of assumptions about gossip, but there are ways that you could prepare if you notice this is happening.
So one way you could stop it would be if you have more WAN-joined clusters. If you know where these clear lines are within your network, then you can stop this. You can have separate gossip layers. You won’t have the nearness between those nodes, but it shouldn’t matter. The option that we are taking is more of a reaction. If you think that this is unlikely but you need to have an answer for your boss of how you would fix it if it were to happen, you can use the IgnoreCheckIDs from prepared queries. So this would allow you to say, “Let me ignore Serf health, purely for service discovery purposes, while my network is having issues. While this firewall is misconfigured, I will immediately stop this flapping.” The flapping will still be happening, but you will be ignoring it. And like I said earlier, that property of a prepared query can be really useful while you are fixing whatever true problem is causing this.
Another feature, from Enterprise, would be network segments. Not going to go too deep into that, but it will also give you this isolation if you know where your network lines should lie. And then lastly, I believe Consul has recently exposed some more gossip tunables from Serf. We haven’t played around with these much, but in a cluster of this size, it might be worthwhile playing around with some of the Serf timings, just to give nodes a little bit more time to refute their health status changes if this were to happen.
» Using Raft tunables
The next challenge is completely different. It has nothing to do with gossip. This is just some motivation behind why we played with some Raft tunables and a bit of very shallow explanation for the effect of them. I’m not going to go too deeply into Raft and how it works; that would be its whole own talk, which did happen yesterday.
Here we have a simple diagram of 6 servers, which is not an odd number, which should make people upset. So we have 3 servers in 1 physical data center, 3 in another. We want to ensure that they’re treated the same. We don’t want to say that 1 is higher priority than another. We need to be able to handle either data center failing. We want to make sure we don’t have any degradation in performance, that we’re still able to perform reads and writes.
This is a problem because if we cut off that communication now, the expectation would be that at least 1 of the 2 data centers was still a full cluster, or at very least, we could bring down 1 side to leave the other side fully operational until we restored that. What actually happens is you don’t have a majority on either side. Right now we have 3 of 6, 3 of 6. Both sides are trying to have a leader; neither side will ever succeed.
One way you can avoid this, if you do have a stretch cluster across multiple physical data centers, is to have tiebreaker nodes. These are just other nodes that you add, other servers to bring you to an odd number, to make sure that neither location is prioritized over the other. We don’t want to add just a fourth node to one of these, but if I add a tiebreaker node in a third location and it can connect to both sides, suddenly 1 of the 2 sides will have an active leader. That side will be able to perform at the same levels, potentially even faster, since it has less servers to replicate to.
Our issue is that we wanted to run these tiebreaker nodes for various reasons. We might want to run them in less resourceful environments or environments where the resources aren’t as guaranteed, maybe bandwidth is shared with some other applications, maybe compute is an issue, but adding more servers does degrade your performance a little bit, and we wanted to ensure these could not become the leader of the cluster. That would be our worst-case scenario: The thing that we did to keep ourselves healthy actually ended up making us less healthy.
To do this, we are looking at HashiCorp Raft. It has a lot of tunables. Consul exposes one, the raft_multiplier. The documentation for this mostly says it’s a number between 1 and 10.Set it to 1 for production; set it to 5 by default. Under the hood, what this translates to are a few HashiCorp Raft properties, the election, leader release, and heartbeat timeout. Not going to go into exactly what those mean, but the Raft documentation has full definitions.
The gist of it is we set these to a higher value on our tiebreaker nodes, and it makes them significantly less likely to become leader in a cluster. It means they’re going to wait much longer before they attempt to become leader, and that means that our more performant nodes are going to do that first, on average.
It just gives us a more tolerable risk and makes the benefit of these tiebreaker nodes much better than the downside.
» DNS tunables
Now on to some DNS tunables. These are potentially less of a challenge, just more things to be aware of that maybe we weren’t at the beginning. Stale results: If they’re not enabled I’d recommend it, purely for service discovery. For us, it’d be better for our servers—if they’re in a degraded state, in the middle of a leader election, whatever it might be—it is better for us to get some results than no results. If they happen to be a little bit out of date, we can fix that, we can alarm on that. But it’s really important that we stay highly available, that in that previous slide, if 1 data center does not have an elected leader, we can still respond with something while we are recovering.
Truncation: Consul and any DNS server can set a truncation flag on DNS response. That will let the DNS client—whether it be BIND, dig, or anything you are using—it will let it know that it maybe should retry that request with TCP, get a more full response. We use Consul sometimes to get a full list of available endpoints, meaning that we max out a UDP packet frequently. So if you don’t have this flag enabled, you might be curious why you’re not getting a full set of results. This could be it.
» Setting TTL values
And lastly, TTLs: Something to keep in mind. You should set these to a sane value and think about it, and we really recommend that you’re able to change these on the fly if you notice an issue. If you have dashboards or access logs and you see that something’s being hammered, but TTLs are being respected, then you can raise that value and guard the rest of your infrastructure. And it’s also important to make sure that your clients are respecting them, that you don’t have any misconfigured DNS clients out there you can track down.
And some final challenges: Raft recovery: Essentially our point here is you should be able to bootstrap your Consul cluster from scratch if you need to. You should be able to fix your servers if they are in a misconfigured state. This isn’t something that you’d ever want to do on production, but it’s something that you shouldn’t be in production without the ability to do.
This we frequently test on our stage in clusters, sometimes intentionally, sometimes not so much. You know, when you LAN-join 2 staging clusters, you get to test if you can bring them back up from scratch really quickly.
And gossip conflicts: This is significantly less likely in Consul today, but this is what would happen if you have 2 nodes in the gossip layer with the same node ID, but they are 2 different nodes.
If you have a lot of these conflicts, then you can destabilize the gossip layer. So if you do see logs on your servers, on your agents, about conflicting node IDs, I would definitely try to fix that. It’s something that Consul does much better with now, but you could still run into issues if you copy data directories, bootstrap a VM from another VM, or really if you see it happen, try to track down why.
» Bloomberg’s integrations for Consul
And lastly, just going to go over some integrations that we have for Consul. We aren’t exposing our users, service providers, or consumers at Bloomberg to raw Consul. We are trying to give them the full service discovery feature set of Consul. There are a few reasons that we built middleware around Consul. (There’s hopefully a lovely word cloud with a lot of words. I’ll say some of them.)
Middleware: We really wanted to abstract our users away from Consul. We wanted to make sure that—they still knew they were using Consul, they still got all of the benefits—but they could talk in multiple name fields instead of a service name or they get automatic tagging of their service instances and nodes with some environment metadata.
We wanted to make sure it’s uniform, so our middleware will ensure that if I’m running on a Linux node, if I’m running on a big-endian node, really anywhere, I will be able to register using 1 API call. Sometimes it goes to my local Consul agent, sometimes it goes to the Consul servers, so that way the External Service Monitor can pick that up. But it’s all 1 API internally. We want to make sure it’s tightly integrated in that form, supported our staging, so we can play around with it. It is a lot of effort to keep something like this up to date, but it is worth it for a cluster this size, for an engineering effort of our size—to us at least.
Internally, we don’t interact with raw Consul, at least people off of our team do not. We built our own REST server, our own CLI using OpenAPI. We try to provide the best user experience to our 5,000-plus engineers that we can. And again, the purpose of a CLI is we have a lot of applications that were built 10 years ago or even yesterday, but not with service discovery in mind, so we wanted to make sure that you could throw something in a start and a stop script and suddenly you’re in our service discovery platform.
User interface: We have built our own. It’s not open source, but there are a few reasons I wanted to show it off, just to show what is possible, talk about some of the differences, and why, for a cluster of our size, it might make sense to build our own on top of Consul’s catalog APIs, in addition to the Consul UI.
We wanted something really user-friendly. Consul’s UI exposes its full feature set, which is great, especially for us managing the Consul clusters. But we wanted to make sure that we only expose our users to service discovery and that we only expose our users to our terms within that. We wanted something really performant. Consul’s getting much better at this with its UI lately, but with 20,000 nodes and many, many more service instances, that’s a lot of data to load.
So, we extract some of that, we use node metadata to give you a top-down view, and we get to integrate this. We get to talk in our terms. So we can do things that Consul can’t do because it doesn’t know of our environment. It doesn’t know our clusters, it doesn’t know how we tag our machines. So for a cluster of this size, it made sense for us to put in this effort.
Going to show off some screenshots to show what’s possible, to tie back to earlier points. Here we just have a view of some of our clusters. This doesn’t look like Consul’s UI at all, but it is being pulled straight from the Consul catalog. It is exactly the same data.