Mitchell Hashimoto Unveils Consul Connect at HashiDays 2018
Jun 26, 2018
Mitchell Hashimoto unveils HashiCorp's latest innovation: Consul Connect. Learn how Consul's new Connect features elegantly address the challenges of network segmentation in service-oriented applications, making it an ideal service mesh provider.
HashiCorp Consul is the networking glue for large-scale distributed systems at PayPal, Capital One, Stripe, and many other high-profile firms. For the last few years, Consul was primarily known as a service discovery tool. Along with that functionality, it also provided distributed configuration management with a central key-value store.
With Consul's latest 1.2 release—which contains a new system called "Consul Connect"—it now provides all the components of a full-fledged service mesh. New service segmentation features from Connect provide the final piece of the service mesh puzzle.
HashiCorp co-founder and CTO Mitchell Hashimoto takes a deep dive into these new features that are fully free and open source:
Automatically Encrypted Traffic: Use Connect to automatically encrypt all traffic via mutual TLS — meaning it's encrypted while in transit.
Easy Connection Rules: Connect uses a service access graph to allow or deny service communication by creating intentions: rules that use the logical name of a service instead of IP-based rules. This makes rule modifications simple regardless of your application's scale; it doesn’t matter if there is one web server or 100. Intentions can be configured using the UI, CLI, API, or HashiCorp Terraform.
Proxy Sidecars: With Connect, you can attach lightweight sidecar proxies to services without having to modify your application. These sidecars automatically establish inbound and outbound TLS connections, providing the data plane piece of the service mesh pattern out-of-box. Consul can also support third-party proxies such as Envoy.
Native Integration: For the most performance-sensitive applications, Consul Connect APIs support native integration to establish and accept connections without a proxy.
Certificate Management and Rotation: Consul comes with a built-in certificate authority (CA) provider that can integrate with HashiCorp Vault, and can also be extended to support any other PKI system. Connect also rotates both root and leaf certificates automatically with no service interruptions.
SPIFFE-based Identities: Consul uses the SPIFFE specification for service identity. This enables Connect services to establish and accept connections with other SPIFFE-compliant systems.
Connect completes the service mesh capabilities of Consul, providing both a control plane and a data plane. It provides secure service-to-service communication with automatic TLS encryption and identity-based authorization that is performant, easy to manage, and works everywhere.
Read the announcement blog to learn more.
Founder & Co-CTO, HashiCorp
I'm super excited to talk about Consul Connect today. It's something we've been working on in Consul for a very long time. We started talking about it and then designing it over a year ago and we're super excited to bring it out into the public today.
As Armon mentioned, Connect is a feature that's built directly into Consul and so for those who are less familiar with Consul, Consul's a product that we've had that's free and open source since 2014. In that time, it's amassed quite a large community; it's indicative by the GitHub stars. There are over 12,000 GitHub stars on Consul. That's somewhat of a vanity metric, so you could also look at things like downloads and actual usage. Consul gets over a million downloads monthly and these are deduped downloads. Also, we know of multiple customers that are running single clusters as large as 50,000 agents. So all of this is to say that Consul's very popular, it works at scale, and it's known to have a lot of operational stability attached to it.
Some of the users that have talked about Consul publicly are listed here. You could look these up online and find talks associated with them about how they use Consul. And the amazing thing is, all these users use Consul in a way that's very, very core to their infrastructure. The problems that Armond mentioned—discovery and configuration—that Consul has solved for a number of years are extremely important to modern, dynamic infrastructures. Consul plays a critical part for these companies.
So we've built Connect directly into Consul as a new feature especially because it needs discovery and configuration capabilities. But, we get a lot of benefits from that, including building on top of this operational stability and building on top of systems that we know are already mature and work at a very large scale. And you'll see how those are very important pretty soon.
» Introduction to Consul Connect
So just to reiterate, what Connect is, is a feature for secure service-to-service communication with automatic TLS encryption and identity-based authorization that works everywhere. And the really key words in this sentence are: automatic TLS encryption, identity-based authorization, and that it works everywhere. I'm going to dive into each of these in more detail.
But taking a look at where we're coming from once again, in a traditional environment, our view of identity was generally tied to IP addresses and hosts. So, in this very small example, we have two hosts with IPs attached to them and we have a firewall in the middle. We'd create a rule that says, "IP 1 can talk to IP 2." And when we make that connection, it's allowed and it generally happens over plain TCP due to a variety of other complexities.
In a Connect-focused world, we're instead diving deeper into the host. We're not looking at the IP. The IP still exists because a host still exists, but what we actually care about is the services that are running on that host. In this case, if we dig into IP 1, we can see that we have an API service and a web service and if we dive into IP 2, we see that we have a database. In this scenario, we don't have a firewall anymore, we instead create rules that are based on their identity. And when they connect, in this case—API going to the database—we have a rule that says the API can talk to the database. It's allowed and we do this over mutual TLS. This gives us both identity and encryption.
But then, we can also have another rule that says web cannot talk to DB, and so when that connection is attempted, the connection is refused, even though the connection is coming from the same source machine. The IPs in this case really don't matter. We could dive directly into the service. We get fine-grained control and everything is encrypted automatically.
The other exciting thing I have to say is, everything we're gonna talk about today over the next 10 to 20 minutes is free and open source.
Connect is built up of three major components.
The service access graph
The certificate authority
A pluggable data plane
We take those three components and wrap them up into an easy-to-use and operate package. So let's dive into each one of these starting with the service access graph.
» Service Access Graph
The service access graph is how we define which services can actually communicate to each other. In the world of Consul, this is done using something we call Intentions. Intentions define a source and destination service and whether the connection is allowed or denied. Using this, each individual service can have its own rules that are independent of the number of instances you have of that service. So it's scale-independent. In a traditional firewall based world, if you had a 100 web servers that ran on a 100 different machines, that was, at a minimum, a 100 rules if it was talking to one service. Bring up, let's say five databases and now you have a multiplicative explosion of the number of firewall rules you actually need.
In a highly dynamic, ephemeral world, you would need to create tooling to dynamically update these firewall rules as things come and go. And we've found that that's just difficult to scale both organizationally and technically. But with Connect and with Intentions, it's just always one rule. If you have 100 web servers, five databases, and they're scaling up and down, the rule is always web can talk to DB; it's just one rule. It's organizationally simple and technically very very scalable. These intentions can be managed with the CLI, the API, the UI, or Terraform. All four of these methods are available immediately with the launch of Connect.
Looking at an example of the UI, this is what we're launching today (~25:00 in the video). You could view your intentions directly in the UI. Search them, filter them, see which ones to allow, see which ones deny, and they're all sorted by the priority in which they would be applied as well. You could edit Intentions and they take effect almost instantly. You could watch those two services happily connecting together, or set to deny, you hit save, and the connections stop working. These Intentions can also be managed by a very easy-to-use CLI for people who are more CLI friendly. All of these, behind the scenes, are using the same API and we're also launching Terraform resources so you can manage that with code as well.
A really important property of Intentions is that they separate the actual rule creation of what can talk to each other from the service deployment. Likewise, when you create Intentions, the services don't need to exist when you do that. You could say: the web service can talk to the database before the web service or database even exist, so that when those are actually deployed, the connections work or don't work immediately. You could also use the ACLs within Consul to define separately who could actually deploy a service and who would actually manage the rules associated with communication for that service. So you could have different groups of people—if you want—or the same people, that could actually modify the intentions versus actually registering the service.
So those are Intentions. They're extremely easy-to-use. Extremely simple as a data model and they're fast.
» Certificate Authority (CA)
The next important concept is the certificate authority. The certificate authority is the way that we establish two really important properties of Connect. Identity and encryption. To build these properties we use TLS (Transport Layer Security). TLS is a very well-adopted protocol and it has this really nice property that it was designed especially for completely untrusted networks. Specifically the public internet. And that is the type of mentality we're trying to bring down into our infrastructure. We're trying to make that a low-trust environment where we get end-to-end identity and encryption so that can confidently make these connections despite not trusting the network in between or trusting the applications around us. Using TLS, we get identity by baking that directly into the certificate and we get encryption by nature of TLS's transfer protocol.
A challenge with TLS, and a primary reason it's not better adopted within the data center, is actually generating, managing, and rotating certificates. This is a pretty big challenge and one we have quite a bit of experience with thanks to Vault. So what we've done with Consul is actually build all of this directly into it. We've expanded Consul's APIs to support APIs for requesting root certificates, signing new certificates, generating intermediates, and rotation.
Rotation is a really big one. This is something that's generally really tricky. Because, ideally, in a perfect world, you want short-lived certificates so that if certificates were to be compromised, the period of which it's compromised is fairly short even if you have revocation. With Consul, you're able to do that because we can automatically rotate all the certificates with zero disruption to service connectivity. This is all built directly into Consul so as you update the configuration, if you change the root certificates, if you change the way that certificates are generated, Consul automatically orchestrates updating the certificates across thousands of your services and overlaps them so that's never a point in time where service connectivity doesn't work.
We also have an approach with pluggable CA providers so that you could use the PKI system that your organization has chosen to adopt to generate and sign these certificates. So, let's see how that works. One of the ways that Consul could generate certificates is using a built-in CA. We've built a really basic, built-in CA based off of Vault's code directly into Consul. So when you adopt Consul with Connect, there are no external dependencies. You immediately have all the tools you need to start using Connect.
This basic CA is fully functional and the way it works is—the clients in the distributed Consul cluster actually generate the public and private keys locally on their machine. So we distribute the compute requirement of generating all these certificates across your cluster. After generating the keys, they send a CSR over to the server, the server receives this signing request and then returns with a signed certificate. The server itself is the only thing that ever has the private key for the route or intermediate certificates and likewise, the client is the only thing that ever has the private key for the actual leaf certificate. The secret material is distributed through your cluster, so that no single server has access to everything.
Like I said, we have pluggable CA providers, so another provider that works immediately with launch is, of course, Vault. Vault is a way to have a lot more configuration, a lot more control, a lot more policy, over how your certificates are stored and generated. In this scenario, instead of using the built-in CA, what Consul would do is receive the CSR, forward it over to Vault. Vault performs the signing operation, sends the certificate back, and Consul sends it back to the client. In this world, Vault has the private key for the root, and Consul never gets to see that. Vault has all the secret material. The server is just the pass-through for this API. It's important to note the clients always use the same API with Consul, so no matter what CA provider you actually use, the APIs to request root certificates, sign new leaf certificates, etc. are always the same. The server itself is actually the only thing that's communicating to different providers in the back.
A really fun property is—because we can do automatic rotation, you could switch between your CA providers anytime you want. You could get started with the built-in CA provider, get going, and then at any anytime switch to Vault, and we automatically manage the rotation from an old provider and a completely different root certificate, to a new provider and a new root certificate, and generally, even with hundreds or thousands of services, this happens so fast throughout your cluster, you don't even believe the rotation happens, but it did.
Another important note is the format of the certificates. The certificates themselves are, of course, just standard X.509 certificates—the same TLS certificates you're generally used to—but one thing we've done is we've adopted the SPIFFE specification for identity. SPIFFE is a way that's published by the CNCF for encoding identity within a certificate. There's a number of SPIFFE compatible systems out there. By adopting the SPIFFE specification for identity, one of the things we've gained with Connect is interoperability with other systems. Because our certificates that we generate and the certificates that we accept are SPIFFE based, that means that if you have an external system that uses SPIFFE, your Consul services could connect to those SPIFFE services over Connect and it works. You just lose some of the authorization from the intentions of Consul.
And the reverse also works. We could accept connections from SPIFFE-compatible systems into Consul and then keep the same TLS certificates, keep the same identity, keep the same encryption, but on the receiving case, we could also validate that with our service access graph and authorize the connection. This is really, really important for interoperability. So that is a certificate authority.
» Pluggable data plane
The next important thing is the pluggable data plane. In any service mesh solution, it's really important to understand the difference between the control plane and the data plane. The control plane is generally responsible for configuring and defining the rules for service connectivity, routing, and so on, and the data plane is the actual thing that's responsible for mediating and controlling active traffic. With Consul Connect as a control plane solution, we're defining intentions, we're distributing the configuration in the form of TLS certificates, and we're handling all this control.
The data plane; however, is pluggable, and Consul doesn't try to do this itself. The way this works is one of two approaches.
We support a sidecar proxy approach
We also support native integrations directly with Consul.
Both of these approaches use local APIs to access the service graph and the certificates. Because of Consul's proven architecture (that works at scale) of having agents on every machine, we can efficiently updates certificates and root certs in the background, out of the hot path, and create subsets of the graph and cash them on every single machine. So that for active connections, for getting new TLS certs, everything—it's usually accessing locally cached data. So all those APIs calls respond in microseconds. The overhead for using connect on any service is very, very minimal, and both sidecar proxies and native integrations use these APIs.
Sidecar proxies are the first approach. They are a way to gain the benefits of Connect without any code modification to your existing applications. Almost any application could immediately start getting the benefits of Connect with zero change whatsoever to the actual binary itself. And the way this works is by putting a proxy next to it that automatically handles the wrapping and unwrapping of TLS connections. This has a minimal performance overhead. It introduces a new hop, of course, but this hop is strictly localhost. And it's very, very quick. And then the API call, like I said, to Consul is all local, so this generally responds in microseconds and you can't notice it. The benefit of the sidecar proxy approach is that it gives operators the most flexibility. You could choose which proxy you want to use, and whether you want to use a proxy at all, and how you want to sort of run that proxy. There's a lot of flexibility in terms of how this could be deployed.
Consul has two approaches to running sidecar proxies. One we call managed and when we call unmanaged. In the managed approach, Consul will actually start and manage the lifecycle of the proxy for you. It'll send configuration to it, it'll make sure it remains alive, it'll make sure it's listening on the right ports, that the catalog is always up to date, that it's healthy. It sets health checks automatically. The managed approach is really, really easy.
In the unmanaged approach, it's just like running any other application or service in your infrastructure. The operator is responsible for starting the proxy, registering it with Consul, and configuring it. But it's just the first class proxy like anything else, it's just that Consul isn't managing it for you. There's an additional security benefit to using the managed approach. When you use a managed proxy, Consul uses a special ACL token, so that that proxy can only access read-only data related to the application it's proxying itself. That's not currently possible with unmanaged proxies.
Visually, the way this looks is like this (~36:20). If the dotted lines represent a host and we have the Consul client and an application, and our application wants to talk to another application on another host, we would start a proxy—managed or unmanaged—alongside the application representing that individual application. It would receive its configuration—its TLS certs, the port to listen on, etc—from Consul directly. And when the application wants to create a connection, it connects over plain TCP to a loopback address on the machine—not over an untrusted network, over a trusted local network. That proxy then uses Consul's service discovery feature that we've had for years, to find the other Connect-enabled application. So this will connect to another proxy and, as you'll see in a second, it could also just connect directly to a natively integrated application—there might not be another proxy on the destination side. This proxy then connects back to the application.
In this model, the application itself doesn't need to even be aware of TLS at all. It just needs to be able to make basic TCP connections and you get the full benefit of Connect right away.
The proxies themselves are completely pluggable. Consul exposes an API that proxies can integrate with to immediately start using Connect, and it's just one API they really have to integrate with. So this makes it really easy to add new proxies. With the release of Connect, we're also releasing integration with Envoy, thanks to our partners at Solo and their product Gloo. This enables you to use Envoy wherever you want to get higher level features like telemetry, Layer 7 routing, etc. that perhaps other proxies won't offer.
I should note at this point, it's also important that we're shipping with a proxy built into Consul. It's a basic proxy. It doesn't have a lot of features, but it will get you connections right out of the box. This is really important because again, when you deploy Consul and enable Connect, there are no other dependencies you need. Consul is the only system you need to get automatic TLS connections between any two services. But you always have the option to use other proxies and these proxies could change on a service by service basis. Most applications aren't that performance sensitive, so using the built-in proxy is totally fine. But for certain applications that are either performance-sensitive or have the need for higher level features like telemetry, tracing, routing, etc., you could deploy something like Envoy next to those. Some applications require things like Nginx, or HAProxy, or other very specific data plane solutions, and those could also integrate with Connect, and you could use those. So your whole infrastructure could be totally heterogeneous. You could use the right tool for the job, and you could always lean back on the basic built-in proxy for the easiest operational simplicity there is.
The other option you always have is to natively integrate. This is relatively unique to Connect. With native integration, because we're basically just doing standard TLS—it's standard TLS with one extra API call to Consul to authorize a connection—almost any application can very easily integrate with Connect without the overhead of the proxy. This introduces a basically negligible performance overhead. It's standard TLS and the one API call generally responds in one or two microseconds, so it's not noticeable during the handshake. We recommend this really only for applications that have a really strict performance requirements, because it does require code modification and so it's a pretty expensive process to actually get deployed out there. What we recommend instead is just starting with the proxies, seeing how it works, and if you need that extra performance, integrating natively. And of course, natively integrated things and proxies, in terms of Connect, are indistinguishable so they could all connect to each other.
So those are the three components that build up Connect and they're all really important and we think we've exposed them in a really elegant and easy-to-use package.
» Upgrading to Consul Connect
But of course, the whole thing holistically also has to be easy-to-use and operate, and we think we've done that as well. When you upgrade Consul, to get Connect, you add three lines to your servers and restart them one by one. After adding these three lines, Connect is ready to go on your entire cluster. The clients themselves, besides upgrading, don't any configuration changes. And when you register a service, by adding one extra line—no code modification—by adding one extra line to your service definition, you could request that Consul manages a proxy, starts it up, chooses a dynamic port, and registers it with the catalog. So by putting this one line of JSON in your service definition, for the PostgreSQL service, in this case, your Postgres database is now ready to accept identity-based encrypted connections just by reloading Consul.
Another principle challenge of these sorts of solutions is that, in an ideal world, the only exposed listener in all these things is the TLS listener that requires strict identity. But a challenge with that is the developer and operator; human-oriented connections. In the example of the PostgreSQL database, what if I as an operator, need to open a Psql shell into Postgres to perform some analytics? It's kind of tricky to get a TLS certificate, connect, make the right port, all those sorts of things. So we've thought about this as well.
Consul ships with a command that is easy to run, you could run directly on your laptop and it lets you masquerade as any service you have permission for. So in this case, we're running a proxy, we're representing the service web, you need to have the right ALC token to be able to do that, and we're registering an upstream of Postgres on the local port 8181. This will automatically use Consul service discovery to find the right proxy or natively integrated application and exposes it locally on port 8181. So then I just make a normal, unencrypted plain TCP Psql open shell, to localhost on port 8181. And now this connection is actually happening over mutual TLS, fully encrypted, authorized, directly into your datacenter. It's super, super easy to get connections to anything, even if they only expose Connect. And that's what we recommend. Thank you.