Connecting Services with Consul: Connect Deep-dive on Usage and Internals
Jun 27, 2018
Paul Banks is one of the engineers that built Consul Connect. In this deep-dive, he explains Consul's architecture, security model, design decisions, and upcoming features.
After HashiCorp co-founder and CTO Mitchell Hashimoto introduced Consul Connect at the keynote for HashiDays 2018, Paul Banks, a HashiCorp software engineer who worked on Connect, provided a deep-dive follow-up.
If you're not very familiar with Consul, this talk starts out with a helpful overview of its client-server architecture. After outlining the architecture, Banks shares how Connect fits into the service mesh category, examining how it provides the control plane and the data plane.
Consul Connect has four features that Banks points out as being key to making Consuls new network segmentation abilities scalable and high-performance:
- Service-level access intentions
- Local agent caching
- Connection-level access enforcement
- Native application integration
Consul Connect also has three key features that help it make service communications highly secure:
- Zero-trust network: Everything inside and outside the datacenter or host is untrusted until verified.
- Proxy to app network trust: Anything that can talk to the private side of a proxy is trusted.
- Service identity rooted in ACLs: Access control lists tie rich policies about who can access what to a secret token.
Consul Connect, like all of HashiCorp's products, was designed with pragmatism in mind. It was built to support both legacy and leading-edge infrastructures. It's protocol and network agnostic, portable, lightweight, and pluggable—taking advantage of the great work others have done in the proxy space (Envoy, HAproxy).
Those are just some of the Connect design decisions Banks explains in great detail. Watch the final five minutes of the talk to learn which features are expected in the GA release of Connect later this year.
Software Engineer, HashiCorp
Thank you. As Mitchell said, my name's Paul and I'm one of the engineers that worked on Connect. I'm super excited to come and to explain a little bit more depth about this new product today. We've already seen this high-level overview. Connect is our new feature for secure service-to-service communication.
If you're watching a recording of this talk later and you didn't already see the keynote that we just saw, then go right back and watch that first, because I'm going to carry on right where we left off there.
In the keynote, we set out some of the motivation and some of these high-level goals we're trying to solve with Connect, and I want to take a little bit more time to dig into how it works and also how it builds on the architecture that we already have in Consul. We'll talk about some of the more important security details and, as Mitchell said, we're going to look at some of the design decisions that we made and a bit of the background for why we chose the things we did. Finally, we're going to talk about how we see this feature developing from today.
» A Look at the Consul Architecture
First up, we're going to talk about how Connect works in a little more detail. Before we do that, not everyone here in the audience is probably a Consul expert, so we're going to review a little bit about how Consul works.
Consul, at its core, has this client-server architecture. We ship a single binary, and the binary can run in either client or server mode. In your data center, you have three or five servers, and then all the other hosts run the agent in client mode. We support multiple data centers and you confederate all these data centers together to be a single logical cluster.
The Consul servers are really the brains of your cluster. This is where we store all the state, and there are three or five nodes because we want to make sure that state is highly available and strongly consistent. If that leader node fails, we can automatically fail over to one of the followers and maintain strong consistency. They use the Raft consensus protocol, if you know it, which ensures that every write goes to a majority of these nodes. This is why you need an odd number, three or five usually.
All reads and writes to this data—that's the service registry for discovery, the key-value store, and now with Connect this service access graph as well—all reads and writes to these things flow through the server nodes. The Consul clients, as I said, these are running on all the other hosts in your data center. They expose both an HTTP and DNS API, and they expose that to the applications that are running locally on the node.
Applications make requests via these APIs, and those requests are transparently forwarded to the servers by the local agent, so your application doesn't have to know anything about where the servers are in your network. The clients just transparently forward for you. The client agents are also responsible for registering these application and service instances with Consul and for running health checks against them to make sure that we will only discover healthy instances of a service.
Blocking queries are a really important aspect of Consul as a control plane. Most of the read endpoints on the HTTP API support long polling so that you can wait for updates to a result efficiently. But it's important to note that HTTP is only being used here on the localhost between the application and the local client. The client issues blocking requests back to the server, but it does it over its custom RPC protocol and it multiplexes multiple in-flight connections over a single TCP connection, so it doesn't matter how many blocking queries your application performs, there's still just a single TCP connection between every agent in your data center and the servers.
One challenge Consul has to solve: Your applications don't need to know where the servers are but the client does, and so it discovers this. We use a lightweight gossip protocol between all the agents and all the servers in the cluster, which is based on Serf. This gossip protocol gives us membership, so all clients and servers know about each other, and it also gives us a distributed failure detector to know when one of those nodes goes away.
» Consul Connect in Depth
On top of that existing system we have in Consul, Consul Connect adds these three components Mitchell talked about:
The service access graph
The certificate authority
A way to integrate with your application either natively or through a proxy
As we saw before, when we talk about service networking it's really important to distinguish between the control plane and the data plane. Consul is the control plane for this networking solution. It provides the service registry, it provides the access graph, and now it provides the certificate authority, and these are used to configure your data plane.
Your data plane is what transports your actual packets between the applications. Here we see from this slide earlier [6:18], this is the data path more clearly. The clients are configuring the proxies, but all the actual application data is flowing just through the proxies and not through Consul itself. The proxies form the data plane.
Control Plane Flow
Let's look a little bit at how this control plane works in a bit more detail. In this example, our application is represented by a proxy, but you get exactly the same flow if you're using the native integration Mitchell talked about to you. When the proxy first comes up, it requests both root certificates and its leaf certificate for the service that it's representing from its local agent.
As Mitchell mentioned, those leaf certificates use the SPIFFE format, and more or less, what that means is that service identity is encoded in a URI and that URI is added as a URI subject alternative name in the certificate. In this case, the local agent checks its cache and it sees it doesn't yet have a certificate for this service, so it generates a brand-new private key and a certificate signing request or CSR, and sends that up to the server.
As Mitchell showed, there's a pluggable backend on the server which is going to sign the certificate, validate it, and then it will return that signed certificate to the agent. Note that the private key never left that host. The client agent can then cache that certificate and it can return it to the proxy. That proxy can now accept new connections and it can initiate ones using the same certificate as its client certificate.
At this point, the story isn't over. The client begins to take an active role in managing the lifecycle of that certificate. First up, it issues a blocking request to the servers to be notified of any changes to the root CA configuration, but it also keeps track of the certificate's expiry time. Meanwhile, the proxy issues a local blocking query to the local agent and makes sure that any changes in either trusted roots or in its own leaf it will notice and be able to reload.
If either the roots change across the cluster because you've configured a new CA, or you've rotated that root key, the client-agent will ... Sorry, if the roots change or if it sees that the leaf is getting close to expiring, it will automatically generate a new key, a new CSR, get that signed by the servers, and then deliver it straight back to the proxy by interrupting that blocking query that's being held open. At this point, the proxy can transparently reload its certificates and it can continue accepting and making connections without dropping any packets.
Data Plane Flow
When it's time to establish a new connection, the proxy uses Consul's service discovery API, as we discussed, to find a healthy instance, and a healthy Connect-enabled instance, of the target service in the data center. In this example, we have a proxy for a web service, and it's trying to connect to our DB service.
The proxy discovers the IP address and port of a healthy Connect-enabled instance. In this case, it's going to be the IP and port of the DB proxy, and not of the DB itself, because that's what's listening. But it also receives this URI, which is the expected identity of the service it's connecting to. I've truncated that here just for clarity [10:21]. It can use this URI to ensure that the service it actually connects to is who it expects and prevent man-in-the-middle attacks.
The web proxy can now start this TLS connection. The handshake validates the certificates, both the client certificate and the server one, against the trusted roots that both of these proxies have loaded in a normal way, and the web proxy can also check that the server's identity matched the one that it discovered through Consul.
Once the DB proxy has validated the identity and validated the certificate chain of the client, it then sends this authorization request to its local agent. It passes along the client's identity URI and also its own identity. It's important to note here that the client agent has already cached the subset of the access graph that matters for the DB service. Any rule that would affect incoming traffic to the DB service is already in memory on the client, so it can answer this authorization call typically, as Mitchell said, in microseconds.
If the connection's denied, then the proxy will just reset. It will close the connection and the handshake fails. In this case, though, web is allowed to speak to DB and so we get this TLS connection established and from this point forward it's completely vanilla, regular TLS. It's important to note that this continues and the certificate management continues for the entire lifecycle of the application instance, so new certificates are going to be picked up and new connections are going to be authorized for the whole lifecycle.
» Performance and Scalability with Consul Connect
Let's look at how this design ensures good performance and scalability. The most important thing to notice, Mitchell said already, is that the service access graph is scale-independent. If there's one web instance or 100, it's still just a single rule. More important than the amount of data that it would require to replicate 100 rules rather than one, is how dynamic they are, how frequently they change.
By comparison, if you have a system based on IP addresses and you have to whitelist 100 web servers IPs on every database host, it's not just 100 rules instead of one rule, but those 100 rules change every time you bring up a new web instance, or kill one off, or move one to a different host and so on. So you have a huge amount of churn, as well as a lot of data to move around.
With our Intention graph it's a single rule, and that rule doesn't change unless you actually wanna stop the web from connecting to the DB, which is gonna be rare.
Next, our local agent caching. We've already seen how that makes things super quick in memory for authorization responses. But the graph state for that is also updated using Consul's blocking queries. That scalability and that proven control plane we've already seen lets us keep all of that state pushed out to the edges; to the agent memory, and updated typically in milliseconds even across thousands of nodes.
When proxies restart they can also start serving again immediately because the certificates are cached locally in the agent. So they don't need to generate a new key in CSR and reconfigure themselves, they can just pick right back up where they left off. And since we enforce access control just to connection time, there's no per-request overhead. I'll say a bit more about that in a minute.
Finally, as Mitchell described, if you do have a service that's really performance sensitive, with Connect there's always the option that you integrate with it natively, and have no proxy at all. Terminate the TLS right in your application.
» The Consul Connect Security Model
So we've looked at some of these moving parts, and how the control plane flow works, but we've glossed over just a few important security details. So I'm gonna dig into those now. This is gonna be a pretty high-level look at some of these issues, and we have a much more thorough and rigorous security model on the website, which is gonna be important to read when you're putting this thing into production.
The most important thing to understand is—who do we trust implicitly and who do we not? And when we say "Trust implicitly," we mean who do we allow access to some service, or to some secret without explicit authentication and explicit authorization to access that.
So Consul Connect enables this low-trust, or zero-trust network model inside your data center. As we saw, traditionally you have the secure perimeter, but everything inside is more or less a free for all. In our low-trust model, anything outside of the current host boundary is treated as untrusted. That means all traffic that comes in or out of your host gets encrypted with TLS, gets authenticated by a client certificate, and it gets authorized against this access graph.
So that's the same level of access control that's typically deployed on the public internet for secure services. But that's on every single host in your data center. It's always been the goal of network security teams to minimize the access within your data center, but as we saw from Armon's keynote, that just becomes impractical in this very dynamic world where identity is attached to IP addresses, where things are coming and going quickly. And if you do achieve it, it comes with a high cost, usually in effort to maintain, but also in limiting how agile your development teams can be and how quickly you can adapt to different service communication patterns.
The flip side to this, we don't trust anything outside the box, but at least when you're using a proxy you do have to trust the network inside the machine. Specifically, anything that can talk to the private side of the proxy is assumed to be trusted. It can make outgoing connections with the identity of the application that's running, and so it will inherit all of its privileges. This is typically true though in a sidecar model, where the proxy and the application talk to each other over the localhost, over a loopback device.
It's important to realize though, that any process that can talk on that loopback device is implicitly trusted. They can make outbound connections through that proxy, and they will be granted the same access and given the same identity as the application itself. In general, this is no different than an IP-based security system. For example, if you limit access to your database by whitelisting your web server IP addresses, then any other software that happens to be running on those web server IP addresses is also being implicitly granted that same access.
Where the single host is shared by different services, or where there's potentially untrusted software on the host, you can mitigate that using techniques like network namespacing, or containers. Each proxy can then only speak to the application, or the application can only speak to the proxy that's within its own network namespace. For example, if you run your application and your proxy as Docker containers, and you set them up so that they share a network namespace, then you get a private loopback device that only those two applications can speak on. This is exactly how pods work in Kubernetes if you've used it, and it allows you to keep multiple private loopback networks within your host.
In the case that you have to integrate with an external service, and you can't co-locate a proxy like a cloud database, you can still have Connect manage the access graph and the authentication for you by provisioning one or more dedicated proxy hosts that run a proxy, and that have exclusive private access to that external resource. How you set up that private access is really gonna depend on the service and the platform. It might be through a dedicated subnet, or VPC, or it might be through only distributing access credentials to the service to that limited set of proxy hosts.
ACL-based Service Identity
That leaves us with the important question of how do we decide who to issue certificates to? If anyone can get ahold of a certificate for any service, then obviously none of this security holds. The access graph is not gonna secure anything. Consul has an existing access control list, or ACL, as we like to pronounce it, system, and this ties rich policies about who can access what to a secret token. To ensure the security guarantees of Connect, we need to make sure that ACL system is set up in such a way that we restrict registrations only to trusted applications that have explicit access to register as that service, and therefore obtain the certificate for it.
We don't have time for a deep dive into ACLs today, but I'm just gonna cover a couple of basics and give you a really quick example of this because it's really important to how this model works. The most important thing to note is in Consul right now, the default is to have ACLs disabled completely. So if you enable Connect today, it will still work, it will still enforce the access graph, and it's great for testing. It's great for incrementally rolling out in production, you don't just wanna turn off access on reboot. But, by the time you've gotten to this production and secure setting, you need to have ACLs that can limit the access to register services.
Here's an example of an ACL policy. ACLs have a whole bunch of resources that they can control, but we're just gonna look at services right now. And today, the way ACLs work, the service specifier string here [21:45] is always a prefix match. So that first rule, with an empty string, is actually a wildcard. That's saying "This token grants access to read, or discover any service in the cluster." The second rule is the interesting one here. This is saying "This token grants write access to the web service." This is the permission that's necessary both to register an instance of web, and to obtain a certificate identifying yourself as web.
You generate the token by submitting that via the UI or the API, and you get back this token ID that we see here. This is the secret. This was only my local machine, don't worry you won't get access to anything. This is the secret that comes back that you need to distribute and have your application register with in order to be allowed, be granted access to make that registration. So you can put this token directly into a JSON or config service description file on the host, or you can register via the API using this token.
Exactly how you manage distributing the secret tokens to your applications in a cluster, that's a whole other talk that I'm not gonna give right now. But, at HashiCorp we have this thing called Vault that we think is kinda useful for that problem.
» Explaining the Consul Connect Design Decisions
We've got some idea about how these components work, and how we make our security guarantees hold. I want to highlight a couple of our design decisions, which might not be the same as the design decisions made by similar solutions to this problem. I wanna just give a little bit of background about why we made them the way we did.
The first one is that we built the core around layer 4 connections. So as a recap, in the OSI model, layer 4 is the transport layer. Think TCP connections, or TLS connections. Application protocols like HTTP or GRPC are up there in layer 7. Some similar service mesh solutions provide really rich layer 7 functionality, kinda first class right out of the box. And in a lot of cases, they tie access enforcement to layer 7 things. So they allow you to limit requests by URL if it's HTTP, or by RPC name.
Protocol-agnostic to the Core
Why didn't we go straight for all those bells and whistles? Well at HashiCorp we have this strongly held principle of pragmatism. We're building this stack—as Armon described right at the start of the keynote—this stack that we believe is the right way to move forward in this new dynamic world, but we want it to actually work for people that run stuff today.
This is a screenshot of the principles page on our website, and we use them to help make our decisions as we build new things. Connect was really motivated by working with a wide variety of organizations who have all sorts of legacy stuff, databases, even mainframes, monolithic applications running in their private data centers. But they also need to interface that with their new cloud native, public cloud, dynamic applications, and we need to support all of their existing workloads with Connect as first-class citizens, as well as their fancy new stuff.
It turns out, not all of their legacy mainframes support GRPC. So we need to be protocol-agnostic right at the core to support all of those workloads first-class. Consul's always aimed to be as widely portable as it can. If you look at service discovery, we use DNS as our primary way. Why? Because every application ever written uses DNS and can take advantage of our service discovery and our load balancing without any changes. You can get more sophisticated if you use the HTTP API, but that comes at the cost of changing your application and re-writing them. We're doing exactly the same thing with segmentation with Connect.
Minimal Performance Overhead
Keeping authorization at layer 4 also has a lower performance overhead. We've already seen how we make our authorization pools really quick. But even so, if you put that overhead on top of every single HTTP request, or every single RPC pool that's coming into your really busy service, that's a whole different ballgame to what we have here. It's a much bigger burden.
In the case where you really do need to expose privileged APIs to just some of your clients, you can just do that trivially by running multiple instances of your identical application. One of them you configure to expose this admin, or privileged API, and then register them as different services. Maybe billing and billing-admin, and at that point you can control them with different rules in the access control graph.
Platform to Build Layer 7 Features
So just like HTTP service discovery API adds richness onto our DNS that's widely portable, just being layer 4 at the core doesn't limit us to not including great layer 7 features—like request retries, rate limiting, canarying—but we're gonna do that through integrating with more sophisticated existing proxies. And they're not core to the product, you can still get all the security and segmentation benefits without using those.
Built-in Proxy With No Major Code Changes
That brings us to our next decision. Why on earth did we build our own proxy? There are obviously some great options out there, Envoy, Linkerd, HAProxy, so many more. We certainly don't intend to provide another general proxy solution with all the same features as these other options. We definitely knew that right from the start our data plan needed to be pluggable, we needed to take advantage of the great work that's already been done in this space.
But another thing that we talk about a lot when we're designing features is focusing on workflow over technology. We want to make the operator experience as simple as possible to get you from playing TCP everywhere, to authenticated and authorized applications with a couple of lines of config change that Mitchell just described. We're releasing Consul 1.2 today that has Connect in it. And if you upgrade an existing cluster and make those three or four lines of config change Mitchell showed, you get all of these features with no further changes to your application. And we can only do that because there's a built-in proxy that's built right into the binary of Consul. There's nothing extra to install, and we can manage that proxy through the agent that's already running on your host.
In terms of performance, it turns out in practice, most of our applications don't have super stringent latent CSLOs, and actually the built-in proxy may well work fine. We've not done super scientific benchmarking on it yet, but on my laptop I can run five Gigabits per second from one application to another, through two proxies. So my laptop is doing the TLS on both ends, and we can still do five gigabits per second, and the latency introduced is microseconds over localhost.
So it's quite likely that if all you're worried about is performance, most of your applications will be fine with the built-in proxy. So you can really take advantage of that ease-of-use. But of course, if you do have higher performance needs, if you do want to take advantage of more sophisticated layer 7 features, then we'll be pluggable with other proxies.
Network and Platform Agnostic
Another design decision though that's important to talk about is we're very intentionally network and platform diagnostic. We've not made any assumptions about your network, all that we require is plain IP connectivity between your hosts. And we've also intentionally ignored, or intentionally avoided, relying on things like configuring IP tables, messing with your SDN or your cloud networks, to provide our basic guarantees. That does mean that you'll need to make sure that you do meet the security model that we document. But mostly that's really obvious stuff like not exposing the private side of your proxy over the network generally.
So it's important that our core stay really portable. But we anticipate that there's gonna be a lot more tooling to come, that will help integrate this with popular platforms. Schedulers like Nomad and Kubernetes and other networking tools too.
» Future Roadmap for Connect
Finally, I wanna talk quickly about some of the future things we have for Connect. These are things that aren't in the beta release that's available today, but they're our short-term plans.
Third-party Proxy Integration
The first one we've talked about a lot, and that's integrating with third-party proxies. The plan is to integrate first with Envoy first-class, and then work out how we can pull together the ecosystem. None of these proxies support the same interfaces. So how do we make it pluggable?
We're also, later in the year, gonna have a GA release of Connect, and that will include some enterprise features which are needed by some larger organizations. These are things like:
First-class multi-data center support
Certificate revocation, and so on.
ACL System Improvements
As we've seen, we're really dependent on our ACL system to enforce these identity guarantees, and so we've got some work to do to improve them and make them more flexible, but also to really improve the workflow and the tooling around setting everything up in a kind of default, safe, way.
Streamlining Proxy Deployment in Schedulers
We already have demos on the documentation you can see, for how you can run this today in Nomad and in Kubernetes as a sidecar proxy. But we wanna build out the tooling for that so that it's automatic, so that you can have access controllers in your Kubernetes cluster that will drop the sidecar in automatically and so on.
Finally, there are just a ton more opportunities we have with Connect. We have loads of ideas. As an example, observability is a really important thing. With Connect, as with other service meshes, we have all this important data about what's actually going on in your data center. Who's talking to who? But beyond just basic metrics and visualization of that information, we could have higher level insights. We know who is supposed to be talking to who. We could tell you things like: your web app is allowed to talk to your billing app, but it never does. Is that a mistake? Should you un-grant that permission? And there's more.
» Community Feedback
The main reason that we're releasing this Beta early, today, without all of this stuff built, is that we really wanna get a lot of feedback and a lot of usage from our community too. We wanna know how you wanna use this thing and what's gonna make it work really well for you. So we're keen to hear that feedback, and it will help shape and prioritize this roadmap.
So to sum up here, Consul is a mature, and it's a widely used, distributed data center control plane solution. We've already solved a lot of the really hard problems around consistent state, around scaling gossip up to thousands and thousands of nodes. And it's relied on by some huge organizations to run really big clusters that are critical to their business today.
On top of that, Connect is adding automatic certificate distribution and a scalable way to enforce service-level access control. At the data plane, we're shipping a built-in proxy to get started quickly, and we're soon gonna support, and plug-in, to a wide ecosystem of sophisticated proxies that is out there.
So please have a go. You can get stuck in with demos and docs and more. Wonderful colleagues behind the scenes have put all of this live while we've been on stage this morning so you can see the Connect announcement blog at that link at the top. And you can, we didn't have time for a demo today, but you can try this out for yourself at this link. Our friends at Instruct have put together a playable demo which sets you five or six challenges for getting your Connect cluster set up. Even cooler than being able to play this online, we have this same demo playable in 8-bit glory on a physical arcade machine over, by the HashiCorp booth. So come and see who can get the high score configuring their Connect cluster today.
We're really excited to have this launched and announced, and we are really excited to see what you all use it for, and how you get on with it. Please give us some feedback, it's really important and we just want to invite you to help us shape the future of service discovery with Consul.
We don't have time for Q&A, please do come and find me, or Mitchell and Armon, we'll be around by the booth or around today, and ask us about this stuff. We're really excited. Thank you very much.