HashiCorp co-founder and CTO Mitchell Hashimoto outlines the problem space for Consul Connect and announces Consul 1.4.
Hello, everyone. I want to start by talking about Consul. The room is pink for Consul. It’s been a really big year for Consul. Last year at HashiConf, we announced Consul 1.0. Consul has been in use by thousands of companies at very high scale for many years. Last year, we were super excited to announce 1.0. This year, we’ve been focused on improving those existing features of Consul, but we also, earlier this year, introduced a brand-new feature into Consul for secure service connectivity called Connect.
I want to start by taking a step back to talk about the original problem that we built Consul to solve. By doing this, helps explain and paint the picture for why we’re building these new functionalities directly into Consul and what the future also holds for Consul. The original problem that we built Consul to solve was the new challenges introduced by breaking down monoliths into discrete services.
Historically, we started with a monolith. With monoliths, when one subsystem wants to talk to another subsystem, such as A to B here, it was pretty easy. It was an in-memory function call. One class calls an exported function of another class, all the data’s in process, the network’s not involved at all. It’s fairly straightforward. When that monolith needs to talk to one of its upstream dependencies, like a database, then that’s usually configured with a static IP.
In the monolith view of the world, there are not many upstream dependencies. Maybe there’s a database, maybe there’s a cache, but there are not many things. So we could just assign static IPs, hard-code them into the configuration files, and just connect directly. Then when we want to control who has access to this thing, we put a load balancer in front of it and we scale horizontally by adding more monoliths or we scale the backend tier similarly.
Then to protect all this from a security perspective, generally you would break this into a few discrete zones. You would have the demilitarized zone in the front, the Wild West, where everyone’s trying to hurt you. You’d protect that with a firewall, allow application users into your application, and then you would protect very sensitive data, the actual data lake, with another firewall in the back so that only applications access data and they’re not behind the load balancer, allowing user access. A pretty straightforward and common approach.
That’s all well and good. What changed when we started splitting monoliths into microservices? We started by taking this and realizing we could break this up into discrete network-connected services. The first question is why would we even want to do this? Well, breaking it up has a number of benefits, and it lowers the surface area of each individual application that we’re deploying, so it’s easier to reason about what features we should add, how do we test it, how do we test it across the board from unit testing, integration testing? We could deploy separately from the other subsystems, and also more downstream services can consume the same shared upstream service. If you had 5 applications before that all needed subsystem B, they don’t need to compile in a copy of subsystem B into each one, they can all network connect back to one.
There are a lot of benefits here. But there’s no real free lunch here. By breaking down these services and getting more agility in what we do, we’ve also made a tradeoff for new operational challenges. The first challenge we run into is that of discoverability. Before, when subsystem A needed to talk to subsystem B, the service discovery tool that handled this for us was our compiler and our linker. They would discover the right memory address to jump to and you would just jump there and execute code. Now, these are network-connected services. The compiler and linker cannot be used for service discovery, and we have a challenge of, How do we find how to talk from A to B. That’s problem number one.
Problem No. 2 is configuration. When you have one monolith, usually it’s fairly easy to configure this thing. That era of application is also heavily correlated with not very many replicas—large servers and not very many replicas. You would just put a giant configuration file that configured all the various subsystems of your application right next to it. Maybe this is a giant XML properties file or whatever it is. It’s not hard to orchestrate. There are not very many servers; you just put it there.
Now you have many more services, and in this basic example of 4, it’s not hard to comprehend, but realistically you’re aiming for hundreds or thousands of services, and now there’s a real challenge that’s non trivial for, How do we distribute configuration? How do we try to make that configuration almost atomic in its shifts so we’re not waiting an hour where one service is configured with version one and one is configured with version 2? And how do we get visibility in what that consistent configuration should be? There are more questions that arise here, but the general category and problem is that of configuration.
The last challenge is this security challenge, and the one that I’ll dive into a little bit more. So in the monolith world, we’ve solved security by breaking up our zones into 3 separate zones separated by firewalls. The way we do this, the category of solution that we use here is segmentation. When we talk about network segmentation, it’s the concept of breaking down a network into a number of sub-networks and controlling the access between them.
The issue with this is it’s a fairly coarse-grained approach that has absolutely no idea of the applications running on top of it. You’re saying segment A could talk to segment B or vice versa, and if you deploy the application into the wrong segment, then you’ve inherently broken the functionality of that application because the security is defined at a different layer. But when you don’t have that many applications and you have a fairly flat network, this is an easy approach to do. It solves the problem by separating the security problem from the deployment problem from the developer problem. It’s a traditional historic siloed approach to breaking down problems. It works when it’s small.
The challenge with microservices and when you start breaking things down is the access patterns aren’t as clear. You don’t have a clear external-to-internal, internal-to-database data pipeline happening in your infrastructure anymore. Now the arrows are going a little bit everywhere. So in this example, we have A talking to B and C. We have C and D talking bidirectionally. We have B talking to D, and nothing else should be able to talk to each other. If you try to imagine: “How do I solve this with firewalls?” you might be able to see a solution, but you’re already requiring here at a minimum of 4, 5, 6 firewalls between these various services and the complexity of configuring these.
People try to do this. Very commonly, they try to do this. The real challenge is as you scale this up. Because the reality of adopting microservices is not that you’re breaking down a monolith into 4 services—maybe that’s the initial step—the reason people adopt microservices is the expectation that one day you’re going to have dozens or hundreds or thousands of services, and it’s going to look like this. So now when you look at these connections going around, try to imagine how you configure firewalls. Whose job is that? What effect does that have on your deployment speeds? And you start to realize the challenge.
The other thing we’ve realized is that these companies breaking down into microservices very often aren’t single-cloud. They’re on multiple clouds and multiple platforms. So you’ll have some things on cloud, you’ll have some things on a platform like Kubernetes, and you’ll have some things on-prem. They’re not silos; they want to talk to each other. The database might be on-prem that a service in cloud needs to talk to. How does it get there? How do you do security between these 2 totally separated networks?
Ideally, what we’d like to do is create these pairwise rules that service A could talk to B, service C and D could talk bidirectionally, and then create the other set of pairwise rules around A could talk to C and B could talk to D. We want to be able to make these as our segmentation rules for security.
So these are the 3 problems that we see when you’re breaking down a monolith into microservices. We consider them the fundamental problems of adopting microservices. There are more challenges ahead ,but unless you have answers to these, it’s hard to get there. We combine these categories into what we believe a service mesh must solve in order to provide this base functionality.
For many years, Consul has solved the discovery and configuration problem. Since Consul 0.1, we’ve supported DNS for service discovery, a nice lowest-common-denominator approach that any application can integrate with very, very easily to find anything else. Just make the right DNS call, you get an IPS address, and you connect right to it. For configuration, we’ve had a KV store in Consul so that there’s a centralized place that supports block inquiries for immediate updates, a central UI so you can see what all the configuration is across your services, and that’s in the KV and been there since 0.1.
The segmentation problem is one that historically we’ve had our users look elsewhere. You’ve had to find other tooling to solve this problem. A lot of our users were solving this by tying Consul very closely to systems like HAproxy or other firewall-type layers in front of their services. But we heard time and time again from our users and customers that Consul was just so close to solving this for them, and they wanted to see the solution. So for the past couple of years we’ve been planning how to make this a reality. We talked to some users and customers last year as well about previewing what we were planning on doing. Then earlier this year at HashiDays Amsterdam, we announced Consul Connect.
This is built directly into Consul to solve this problem. This is also why it’s not a separate tool and is built directly into Consul, because we can’t solve the segmentation problem without also knowing where our services are and what their configuration is. So, naturally, if we built this separately, we would have had to build a plug-in interface. We would have integrated very closely with Consul anyway. By doing it all in one, we’ve dramatically simplified the deployment and made Connect a much more adoptable, practical tool.
So that was Connect. It’s a feature for secure service-to-service communication using automatic TLS encryption and identity-based authorization. It works everywhere. I’m just going to quickly go over what Connect is in more detail rather than the problem.
Connect is comprised of 3 major features: It’s a service access graph, it’s a mutual TLS and CA, and it has a pluggable data plane. The service access graph is what defines who could communicate to each other. We used something called “intentions” to define what services can or cannot communicate to each other. These are defined using the service names and are completely separate from the hosts that they’re on. So you could deploy these services anywhere. You don’t need firewalls. You don’t need to do network segmentation, because the security is handled at the service-name layer and not the IP layer.
Another benefit of this is it’s scale-independent. If you imagine you have, let’s say, 10 copies of a web server and 3 instances of a database and they’re running on separate hosts, the firewall rules for that are either on one side coarse-grained, where you’re saying, “This block of IP addresses that might include IPs that aren’t running the application can communicate”—it’s like what Armon said earlier; it’s the simple-policy side—or you have the highly complex, “We’re going to create a pairwise ‘This IP can talk to this IP’ for every single instance. And you end up with a multiplicative explosion of firewall rules. Either way is really complicated and really expensive because there are a lot of rules to process, and it’s hard for both people and technology to manage.
The service access graph is a lot simpler. Since it happens at service names, it’s completely scale-independent. Ten web servers, 1,000 web servers, 3 databases, 300 databases—it’s just 1 rule, right? The service can or cannot talk to this other service. That’s really important, because that enables us with Consul to replicate the whole service access graph across the entire cluster really cheaply for incredibly high-scale deployments.
The second thing we have to do is establish identity and security and encryption. We do this using mutual TLS. TLS is a perfect protocol for us to use here, because TLS was designed specifically for zero-trust networks, the public Internet. That’s designed for a case where the client can’t trust the destination. It’s happening over untrusted networks, and we still need to establish trust in some way. That’s exactly the mindset and philosophy we’re trying to bring into private data centers as well, cloud private data centers. We do this using TLS. TLS provides the identity and encryption. So the identity is baked directly into the certificate. The encryption happens by nature of the transport protocol of TLS.
We also built APIs for getting the public certs and issuing new certs directly into Consul. Consul does this by fronting real CA providers. Consul itself has a built-in CA, so you could use Consul right away, but it’s completely pluggable so you can bring in Vault. We’re working on integrations with Venafi and many more CA providers. But in front of that is a consistent API for verifying certs, signing certs, and more that every application and Consul agent can use.
The really powerful feature that we built on top of this is built-in rotation. One of the biggest challenges with TLS and mutual TLS everywhere and many certs everywhere is, How do we safely rotate them? Initially, getting them out is a tractable problem, but then thinking about, “We’d like to rotate those relatively quickly,” becomes challenging. Consul handles this automatically for you. So we advertise short-lived certificates; I think the default’s around 3 days. After 3 days, we automatically rotate, serve both, and after all the traffic drains out of one, start serving exclusively the other one.
More impressively, we also rotate the CAs for you. So you could rotate intermediate keys, route keys, and the whole CA provider. You could switch from the built-in to Vault to Venafi, and back to Vault, whatever you need as your organization changes. We handle automatically cross-signing CAs, issuing those cross-signed certificates, the right routes, waiting for the traffic to drain, and flipping over so that the whole time during the rotation process your connectivity is preserved across the whole cluster. We could do this across every service, all the time.
The last thing is the data plane. When we talk about service mesh, it’s important to separate the concept of a control plane and the data plane. The control plane is what’s defining all the rules: Who could talk to who? What are the configurations of everything? And that’s where Consul firmly sits. It is a control-plane solution. The data plane is the thing that’s sitting in the path of data, verifying the certificates, consuming the figuration from the control plane and reacting to that. For Consul, it’s a pluggable data plane.
We have a built-in proxy in Consul. We just do that so deployment’s easy, but it’s not really meant for production, high-scale usage. For that we bring in pluggable proxies, and you can use whichever proxy you want to use. Visually, what this looks like is this. You have the Consul client, which is the control plane configuring a proxy. You have an application that’s using Connect that wants to talk to another application.
It goes through a local sidecar proxy, which is very performant, just because it’s over loop-back and most operating systems optimize this heavily. It goes over the network and then goes into the application. These proxies are totally pluggable. So you could drop an Envoy there. I think, more importantly, you can mix and match.
So one side could be the built-in proxy. The other side could be Envoy. It could be anything else. As long as it speaks the Connect protocol, which is just mutual TLS—it’s not anything novel or exotic—then it all works. This is really important for deployments across heterogeneous workloads, which is the bread and butter of what HashiCorp does. If you have 1 server that’s on Linux, and you have another server that’s on Windows, the best-of-breed proxy, the best-performing proxy, might be totally different, and we allow for that heterogeneity. Or if you have one side that’s not performant-dependent at all and just wants to use the built-in proxy, that’s totally fine, too.
In addition to Connect, we’ve made huge strides on the UI. Consul’s had a UI for many years. We haven’t updated it in over 3 years. This year we released a brand-new UI that supported every feature of the old one and has also added new features such as the Connect intention management. Thanks to our awesome UI team, we’ve also been shipping features right away with support od the UI. So the same day features come out, we’re also supporting them directly in the UI, which historically we’ve never been able to do.
In addition to shipping these features, we’ve also been improving the core of Consul to make stability and scalability better, just for the core features that existed for years. So back in July we were excited to work with a user on seeing the largest single data center deployment of Consul ever. Their single data center deployment was 36,000 nodes. We have larger deployments by node count across multiple data centers, but this was an all-in-one data center, all one single gossip pool/gossip network and it was amazing to see. It didn’t come for free. We had to make dozens of improvements in the core of Consul, but the benefit is that these improvements benefit everybody.
So everybody as they’ve upgraded Consul will see lower CPU usage, less network consumption, more stable leader election under pathological conditions. And it just makes everyone’s deployment better thanks to these large-scale deployments. At the same time, we have a user right now that’s preparing to test Connect, which I just talked about, across 20 data centers and thousands of nodes, which is really exciting because that becomes one of the largest production-service mesh deployments in the world.
What I want to talk about now is Consul 1.4, which is something that’s available today in a preview version and we’ve been working on for a while. The first thing we’ve done in Consul 1.4 is integrate deeply with Envoy. Envoy is a high-performance, production-hardened proxy. Now you can use that directly with Connect as an option just as easily as the built-in proxy. We automatically configure Envoy. We’ll run it for you and more.
Our Kubernetes integration also uses Envoy by default. So you don’t have to do anything. We’re just running their Docker just for you. We also expose the ability for you to pass through a custom Envoy configuration. So if you want to configure Layer 7 routing, observability, and more, you could do that through Consul through the pass-through Envoy configuration. We’ll look at adding more of those features first-class in the Consul, but this is a great way to get all those benefits without having to wait on us.
Just to look at how easy this is: This isn’t just a demo. This is like a real configuration you might see in production. When you define a service, you would define that you want a sidecar for that service. Then to run Envoy, it’s 1 command, Consul Connect to Envoy, and then specifying which service you want a sidecar for. What we do there: Envoy isn’t the easiest thing to configure. We generate the Envoy configuration. We set up the right network addresses to point it back to the Consul agent, and then we fork and exec and start Envoy for you for whatever’s on the path. So you could use multiple Envoy versions as well. This just gets it running really easily.
The other thing that we’re bringing in Consul 1.4 is a revamped ACL system. The ACL system in Consul hasn’t been touched since early versions. Since then we’ve shipped 2 other ACL systems, policy systems, in Vault and Nomad. We’ve worked with customers of very large-scale and learned a lot about how to improve the ACL system and what works at that scale. We’ve now brought the same systems and same learnings that we’ve had there back down to Consul into a new ACL system in Consul.
This has a number of important features. We’ve, first, separated the access token from the policy. So you could have multiple policies and assign tokens to those. You can also now restrict policies by data center. Certain policies are active only in certain data centers. We also support exact-match policies. So before everything was prefix-matched in Consul, which made it difficult in some scenarios to get exactly the right security that you wanted.
Then, finally, as a performance feature, we introduced DC local tokens. Data center local tokens are tokens that aren’t replicated globally. Generally you could have thousands of these in a single data center without pushing any replication burden onto your global deployment. This is super important for performance. The ACL system is completely an open source. It’s backwards-compatible with the existing ACL system. We’ve done some neat things to upgrade you
A big part of that is the UI. This is an example of one of those features where Day 1, right at launch, we support the new ACL system directly in the UI. A really cool thing we can do with things like the UI is when you come in here with a previous ACL system—which like I said is backwards-compatible if you’re using the old one—we will guide you through a multi-step flow to upgrade those tokens for you. We’ll guide to take the ACL, turn them into policies, set up the replication for those policies, and associate them with multiple tokens while preserving the exact same access token ID so that none of your applications need to rotate it. They’ll still keep using the same secret, but in the background we’ll be upgrading that to the new system. So check out the UI. Check out the new ACL system. It all works great.
The last thing that’s exciting that we’ve been working on is Multi-Datacenter Connect. One of the major restrictions of Connect when we launched it in June was that it only worked within a single data center, 2 services within a single data center. Now with Consul 1.4, a service in Data Center 1 can communicate to a service in data center 2 with an end-to-end TLS encryption across the whole thing that is authorized using our intentions. So you can say, “Web can’t talk to DB,” and if DB is in a totally different region, we’ll enforce that across, and we’ll ensure that it’s encrypted all the way end to end across. To make this possible, we replicate and manage the replication of the certificate authority across the multiple data centers. The important thing is this works with any provider.
So the provider-backing Consul does not need to be Multi-Datacenter-aware at all. What we have found as we’ve investigated new CA providers is that very few CA and PKI solutions work across multiple data centers. So using Consul, we replicate and manage that global distribution of certificates for you. This is pretty much magic to see. This is a feature in Consul Enterprise and is available as part of 1.4.
Consul 1.4, as I said, is available today as a preview version. You can go download it right now and see all these features right now.