Watch Mitchell Hashimoto explain the new features coming in Consul 1.6 service mesh.
Consul's new Mesh Gateways will provide simplified and secure cross-cluster communication without the need to configure a VPN nor create complex routing rules, enabling the scale needed for multi-platform and multi-cloud environments,
Cool. Thank you, Armon. Hello, everyone. Armon did a great job of framing the problem and showing the importance that the network plays in being able to really manage these multi-platform, multi-cloud heterogeneous environments, and I'm going to overlap a bit with Armon. I'm excited to talk about the work that the Consul team has done, I think it's really amazing, but I first want to just review what we announced last year right here in Amsterdam, because we've been building on top of that.
Last year we announced a feature in Consul called Consul Connect; and Consul Connect, in one sentence, enables secure service-to-service communication with automatic TLS encryption and identity-based authorization that works everywhere. I went in detail last year an explained all of this; this year I'm just going to show a quick diagram just to get us back up to speed. This is similar to what Armon talked about, but simplified. The world before Connect was very much like this: you viewed things at an IP level; and when on service wanted to talk to another, you would references the other IP, make the connection, this happens usually through a firewall, and this also usually happened with plain TCP. It was usually plain TCP simply because managing TLS is quite difficult; you have to think about how you protect keys, how you rotate keys, how you issue certs, how you distribute certs, so there's a lot going on. Typically, this is what you would see in most data centers: non-encrypted traffic using IPs to route from one to the other.
With Connect, what we did was allow you to view these connections at a more logical service level. So, instead of saying IP1 to IP2, you would think about it in terms of API to database. You would use the service name. The IP still exists, of course, because that layer of the network still exists; but from a connectivity standpoint, it's mostly invisible and automatic to the user. This all happens over mutual TLS because Connect and Consul are managing the CA for you... issuing the certificates, handling rotations, all this is automatic... and you're able to make rules that instead of IP1 to IP2, is API to database, or say more explicitly, "Web cannot talk to the database," and you can enforce this even though API and Web share the same machine, share the same IP.
This is Connect, and it's what we announced last year. It's been working great, we've seen a lot of users adopt it, and we're really excited about it. We've been improving on it, but there's still some challenges. One of the major limitations of this model is that the things at the end of the day still have to be routable. Your API service, when it translates down to an IP, and your database service translates down to IP, you still have to have connectivity, so they usually have to live in the same VPC, or the same subnet or something like that; and with this multi-platform world, this is more and more becoming untrue even at simple level. The most common case is that you're running multiple regions, multiple VPCs, maybe in the same cloud platform, and now this isn't free. You don't automatically get access from API to database; you have to somehow set up the connectivity between these two thing to make this work. Where this is becoming very apparent is in deployment for this Kubernetes.
Kubernetes generally wants and overlay network. So, Kubernetes within this one cluster has nice, full connectivity; but as soon as you want to connect either to a VM outside of it or another separate Kubernetes cluster that you might have, this problem arises and it gets very difficult. This sort of thing is happening earlier and earlier, and it's not just a big company problem; it happens as soon as basically you adopt the Kubernetes service, and this is a solvable problem. This isn't an impossibility, "Can't be solved with the technology we have." We have the technology today to solve this, but the challenge is that you essentially require a networking professional that really understands how to set this stuff up, and the process to set it up is typically very manual and labor-intensive. So, common approaches to make this work today may be things like VPN tunnels, Ipsec tunnels. If you're doing hybrid cloud private to public, you might be setting up Direct Connect, direct fiber between different data centers. There are answers to do this, but it does get very difficult.
There's also another scenario that Kubernetes tends to introduce to make this even harder, which is that by default many hosted and on-premises Kubernetes distributions create the same IP space for their overlay networks. Instead of having 10.0.1.1 and 2.1, you actually have the same IP space on both sides. Now, you have two pods and two different clusters that actually have these identical IP address, and how are they supposed to communicate to each other? In this scenario, things like VPNs no longer work and you have to do something a little bit more creative. Again, tractable problem, theoretically solvable problem, but practically very, very difficult and haven't seen it happen in a while. For larger companies the problem gets much worse, and so this scenario becomes quite realistic.
Over the past year, I've talked to more than a dozen of our users and customers that this is exactly the scenario that they're in. They have a traditional on-premises data center running something like [inaudible 00:27:24], they have some VMs in a cloud such as Amazon and Azure, and now they've adopted a platforms like Kubernetes. They're all running separate networks, and yet they want connectivity between all of them. Again, between the on prem and VM and cloud, you could typically set up something like a Direct Connect; the challenge with that is it goes zero to 100 really fast. I mean setting up a Direct Connect is an extremely expensive... and you really have to make that investment to make it happen. You can't just try it out, right? This is very difficult to set up. The other challenge is there's no consistency between those. You could do Direct Connect between your EC2 and VMware that's not really going to automatically work between VMware VMs want to talk to your Kubernetes pod, so you have to do something different for that and you have to set this all up, so this becomes very difficult.
As a solution to this challenge we built, and I'm excited to announce, a feature called "Mesh Gateways" into Consul. In one sentence, Mesh Gateways are a feature that enable secure services, service communications, built on top of Connect, that enable this across networks, maintaining complete end-to-end encryption that also works everywhere, and it's all automatic. Let's go back to this challenge that I posed earlier, the simplified example where you just have two different networks, the way this would work in a Mesh Gateways world is like this: you would deploy a Mesh Gateway device, just software, onto the borders of your network; these could be VMs, they could be containers, they could be anything. The only requirement is that these deployments could talk to both the private network, as well as the external network; and once you deploy them, Consul sees that and automatically routes traffic through the gateways across the different networks.
By having these gateways be limited, you could set up direct firewall rules so that one gateway could only talk to another gateway, so you don't have traffic free-flowing on the public internet or going anywhere; you could lock this down so that only your regions and your added devices could talk to each other, and that enables all of your services within each region or data center or platform to be able to communicate. When you do this, the security is maintained throughout. From API to database, it's encrypted all the way through. The Mesh Gateways cannot and do not decrypt the traffic; they just route it to the next place that it needs to go, and I'll talk more about that in a second.
Going back to the more complicated scenario, it now just looks like this: for every single region or network that you have, whether it's three up to N that you have, you run a gateway device on the edge of each network, and now you have complete connectivity across every region globally. The magic of this is that it really just makes it seem flat, right? Your service, from a developer's perspective when they want to talk to the database, they don't need to worry about there that is, how to get there, or anything; you just say the same thing, "database.service.consul," you connect to it, and Consul handles the complexity of getting that routing to the right place, maintaining security throughout, etc.
As I said before, the problem was solvable, the problem was tractable, the issue was that it was very complicated; so our goal was not just to make this work, but to make this work easily, magically, automatically. The way this ends up looking in Consul is like this: you run a built-in command, and in this case it's the Envoy command because we're going to use Envoy; this starts up a Mesh Gateway, and all you have to do to configure it is give it a private address to listen on, and then a public address to listen on. With just this configuration, it'll register with Consul, so Consul's service registry is now aware the gateway exists; and when API wants to talk to database, Consul knows the database is in a separate data center, Consul has been data center aware with services since they first came out five years ago; and when it knows this is in a separate data center and it sees that a gateway is registered in its catalog, it begins routing traffic through that gateway.
If you try to connect with a service that was within the same data center, it connects directly to the service and does not use the gateway. You could also run multiple of these. For high-availability reasons, or to balance bandwidth across these gateway devices, you could run as many of these as you want to. They're stateless devices, so you could run two, 20, 50 if you wanted to. You can do whatever you need. You could also set up configuration of whether services should or should not use gateways, so it doesn't need to be global. You could say some services can use the gateway, and other services I never want it going through a gateway across the data center, and you can do this on a service-by-service basis. This isn't completely obvious, but what I'm doing here is configuring the defaults for every single Connect proxy that we have set up, and I'm saying that the default should be to always use the gateway devices for traffic.
In addition to building the core feature set, the Consul team has done a good job of enabling gateways with other things in the Consul ecosystem; for example, right away we have support for gateways in the UI. If you found a service that was a gateway in the UI, you'll see next to the name that there's a label that notes that this is a gateway, and you could also click the IP addresses to see what its external addresses, what its local addresses, and what its data center specific addresses might be.
Let's talk a little bit about the details of how this works so it's not magic. As you might have figured out from the previous slide, the gateway proxy is just Envoy, and so it's really important to understand, and I explained this last year, that Consul and Consul Connect is a control plane; it handles the configuration, the metadata, issuing, so it's getting that data, but the actual bits flow through the data plane and Consul doesn't do that. Consul is a complete plugable data plane. We have built-in support for Envoy, but Envoy is just implementing a public API; so anything else could register as a proxy, and it could also register as a gateway, and you can mix and match these. You can image that the gateway device through your on prem data center may actually be a hardware device. The hardware device could integrate with Consul to APIs and act as a Mesh Gateways; but in the cloud, you might just use the built-in Envoy software, and that would be there, and you could mix and match these things.
From a security perspective, as I mentioned previously, gateways not only don't see your payload, but they cannot decrypt the payload, so we maintain full end-to-end encryption; there cannot be any man-in-the-middling happening for this traffic. The way we do this is that the gateways do not have access to the private keys to decrypt the data. The routing works by just inspecting the SNI headers which just say something like "database-dot some data center," it's what you're trying to reach; and besides that, they can't actually decrypt it. The only thing that has the private keys to decrypt the payload is the end service.
Additionally, intentions, which are the rule system that Consul uses of what services could talk to others or cannot talk to others, are enforced across data centers, so you're able to set up rules such as allow API to db, and deny Web to db, and you don't think in terms of data center. It doesn't matter where the db is or Web is; as you move these things around, as you move it from on prem to cloud or vice versa, your rules and your security remains enforced. Even if I have to route across data center, it's all enforced.
The amazing thing about these Mesh Gateways and Connect in general is the separation of concerns that it can enable. For a developer, they don't need to think about routing; they just talk to whatever service they need to talk to, and Consul handles that for you. For someone like an operator who's setting up these clusters, as long as they configure the gateways correctly, then the entire world was flat, anything could talk to anything and it's there. Then, for someone who's more concerned about security, there's a central intention system that says what can and cannot talk to the others regardless of where it's deployed, and so you create these nice separation of concerns that could be focused on a specific problem set, but all work together to enable what each group wants. Mesh gateways are an open-source feature and they're available in Consul 1.6, which has been released for beta today. The Consul team has done a great job; I think that itself would be an amazing announcement, but they've done more. They're continuing to build upon Connect. I'm excited to announce that Consul has completed the much awaited Layer 7 features completely.
Today, routing in Consul looks like this. You would request a Web service, and Consul looks up the healthy instances and returns them directly to you. This is unchanged if you do nothing and upgrade Consul, but we've introduced a number of new features. With the new Layer 7 functionality, service discovery can now look like this. I should mention that each box that I introduce in the middle is totally optional... you could have one and not the other; you could have two, but not three... but if you wanted to do the full path, this is what it would like. I'm going to walk through an example of what's possible with Consul now if we had everything enabled.
Let's start with HTTP routing. With HTTP routing, you can now configure in Consul Layer 7 HTTP routing rules based on request tasks, headers, query parameters, and HTTP methods. Following this example, I'm going to configure some routing rules for our Web service. What I'm doing here is configuring a router for the service Web, and then I'm setting up a bunch of routes. In this case, I'm saying I want to match an HTTP request; and when the pass prefix is "admin," I want to rewrite the destinations to actually the service "admin" rather than the service "Web." In addition, I want the prefix to be rewritten to slash; so when this HTTP request hits the admin service, we get the prefix strip. So, if you said, "Go to web.com/admin/user," it would actually go to the admin service and they would only see /user. What ends up happening is we requested web.service.consul, that matched that routing rule, and so it redirects admin to service admin and pass slash, and now it enters the next step in this discovery chain which is traffic splitting.
With traffic splitting, we're able to set weights and redirect traffic between multiple subsets of services that are deployed. Continuing this example, we are now configuring the admin service, because remember we were redirected from "web" to "admin." So, Consul is going to look at the traffic-splitting rule for "admin" rather than "web," and we're setting up some splits, and the splits in this case are saying, "Send 10% of traffic to version two, and 90% of traffic to version one." Pretty simple, I think. For this example, let's say we hit the 10%, 1:10 dice roll, and we got subset V2.
So, we came in as "web," got redirected to "admin," got split onto the V2 path, and now we hit the step of custom resolution. Think of resolution in Consul before as simply, "Find all healthy instances and return them." You can now customize this logic that Consul uses to do a number of different things. You could say, "Find healthy instances that match with certain metadata keys," "Find unhealthy instances," if you want, so you could failover to other data centers if no healthy instances are available. You could customize this resolution logic.
For this example, we're just going to set up resolution logic that knows how to route this V1 and V2 subset. What we're doing here is creating a service resolver for again the admin service. We're saying the default subset, if none of our filters match, is V1, assume it's V1, and then we create the filters in order to assign it to a subset. So, we say you're assigned to the V1 subset if you have a metadata key version that is equal to 1, and you have a V2 subset if you have a metadata key version equal to 2, and these filtering rules could be pretty arbitrarily complex logic on a number of different fields within the service; so you could do tags, you could to metadata, and it can get pretty complicated if you want it to. With that, we know that we're looking for metadata version equals 2; and finally at the end of the chain, Consul looks for those instances, looks for the ones that are healthy, and returns those all the way through.
In this example, we started requesting one thing; and because of the HTTP routing, traffic-splitting, and custom resolution, it completely changed the results that you would get by default. Again, all of these boxes are totally optional. You could have just traffic-splitting on with a normal connection, you could only have custom resolution just to do failover, you could set up just basic HTTP routing to set up simple routing between different services, or you could use them all together.
Of course, this seems kind of expensive, right? If every single request to the Web had to go through this entire chain, that's a lot of things to do; Consul is much smarter than that. We could rely on our agent-based model, and what we actually end up doing is on each agent at the edge, when the first request comes in we do this, and then we set up an edge-triggered cache within that client; so that from that point forward when any of these rules change, we push the new results to the client regardless of if there's a request coming in, and so that every request going forward from that is effectively free; it's just looking up on a table of where to route to. If we don't get requests for a certain amount of time, it drops off the cache and you get all that back. So, this is how this works. It's effectively free for high-traffic scenarios, and that's thanks to our client-based agent model. Just like Mesh Gateways, all of this functionality is available open-source in Consul 1.6, the beta that's available today.
Finally, for Consul, I want to talk about Consul and Kubernetes. Armon mentioned the realities and the importance of a multi-platform world; Kubernetes is quite a large platform in that ecosystem. With Consul, we've been integrating deeply with Kubernetes, but one of the first questions is, "What does that mean?" Kubernetes provides a number of the primitive that Consul does on the surface. It has discovery through functionalities such services, kube DNS, kube-proxy, it has service configuration using things like configmap, and it has some limited set of segmentation using network policies and controllers. The challenge is this generally only works for Kubernetes; and worse, it only works for single-cluster Kubernetes. So, even if you're pure Kubernetes and you have two clusters, these don't magically work together; you have to figure out a way to set up Federation, which doesn't quite work yet, and it's difficult to do. If you have something that is non-Kubernetes, then this is mostly off the table.
The goal with Consul is to unify everything; this is the "works everywhere" point that we make in all our descriptions of our features. Consul works with Kubernetes, it works with cloud, it works with VMs, it's multi-data, multi-cloud, etc., and importantly it exposes the features in a consistent way across all these different environments. But, of course, we don't just want to say it theoretically works, it's just a binary and you could run it anywhere, but it's important to integrate with those ecosystems in order to make it feel natural. So, what we've done with Consul is, for the past year, we've had a project called Consul Kubernetes that provides first-class native integrations with Kubernetes to enable many of these features of Consul to work in a consistent way.
I won't go into detail, but just to highlight some of the features that are available now with Consul and Kubernetes. We've had an official Helm Chart so you could install pretty much all of Consul's features directly into Kubernetes. We have something called an "auto-join provider" so that if machines outside of Kubernetes want to join a Consul cluster within Kubernetes, they don't need to know Kubernetes APIs, pod IPs, how to route to them, anything; they just say it's in Kubernetes, "Here's the rules and filters," and it just goes and finds them and joins them for you to form the cluster. We do catalog thinking so that you could continue within your Kubernetes environment to use first-class Kubernetes services, which are automatically synced into Consul's catalog for external services; and vice versa, external services are synced into Kubernetes service object that they could use internally to route to all these different services. We do connect auto-injections so that your pods automatically get the Connect proxy injected into them, so they could use the Connect functionality; they don't need to manually start up Envoy or anything like that.
From a security standpoint, a challenge with something like Consul is we have an ACL system in Consul, and the question is, "How do I get an ACL token to talk to Consul?" With the past couple releases, you can now authenticate the Consul just directly with a Kubernetes service account so you don't need an ACL token, we automatically do that auth you, and there's also automatic TLS. So, if you're in a Kubernetes environment or otherwise, automatic TLS could be enabled so that all the clients talk to the server over TLS with all the certificates automatically managed. Now, this is a cool feature that we've been able to build on top of Connect internally so that even if you're not using Connect externally, the Consul cluster itself could leverage its own service mesh capability to do automatic TLS for all its private communication.
Then, just like UIs, the Kubernetes team with Consul has done a great job of integrating Mesh Gateways right away for day-one release, and so they work right away. The way the Mesh Gateways would work is you check out the latest version of the Helm Chart and you set some configuration, and this is the minimum configuration you would need: you say, "Enable Mesh Gateways," you set the number of replicas... again, this is for high availability, or just more bandwidth availability on your gateways... and then you just tell us what the outside external address to talk to this gateway is, you run your Helm update or Helm install, the gateways are deployed, and all your traffic within Kubernetes can now talk to any other network.
I mentioned this earlier, but a really important use case for this is multiple Kubernetes clusters. So, even if you're pure Kubernetes, you could set up Mesh Gateways now and have your pods be able to route with full-service HTTP routing, traffic-splitting, custom resolution, automatic TLS, authorization all included across Kubernetes clusters even if the two pods have an identical IP address, so you could have the same overlay on both sides and this still works. This is a really, really magical and important feature to enable, again, making the world seem flat and making your service just be able to connect to each other.
It's really fun to talk about all these new features, but it's also important to hear from real used about how they're using our software. In order to do that, I would like to invite to the stage, Laurent from Datadog, to talk about their usage of our tools. Thank you.
A Leadership Guide to Multi-Cloud Success for the Department of Defense
A Leadership Guide to Multi-Cloud Success for Federal Agencies
Cloud 1.0, 2.0, and 3.0: The 3 Phases of Cloud Adoption
Network Automation on Terraform Cloud With CTS