Presentation

HashiDays 2018 Keynote: Introducing Consul Connect

Consul Connect—a major step forward for the service mesh space—is the focus of the HashiDays 2018 keynote. Learn how these new features make life easier for the operators that manage and secure distributed applications.

HashiCorp opened its HashiDays 2018 conference in Amsterdam this week with major announcements for our open-source Consul product. First released in 2014, Consul is a utility for connecting and configuring applications in a distributed environment—either on-premises or in a public cloud. Consul already runs on more than 5 million machines worldwide and has over a million downloads each month.

In this keynote, Armon Dadgar and Mitchell Hashimoto, the co-founders and co-CTOs of HashiCorp, introduce Consul Connect: an expansion of Consul's service mesh capabilities. With this release, Consul adds service segmentation to its list of high-level features—a list that already includes service discovery and distributed configuration management.

With this announcement, you'll learn how Consul Connect eases network management with:

  • Sidecar proxies: Which bring TLS communication to your services with no dependencies and no application modification.
  • Service Access Graph: Which gives operators centralized, scale-independent control over allowed service communication.

And how Consul improves security with:

  • Automatic Traffic Encryption: With mutual TLS and a low-trust posture.
  • Certificate Management: With a built-in certificate authority (CA) provider that generates, distributes, and rotates certificates with no service disruptions.

There is a small performance impact, but this can be mitigated with Consul's new native integration capabilities—allowing users to build connection management without proxies.

Consul Connect is a major step forward in the service mesh space, and this keynote lays out all the benefits for engineers and managers to consider.

Speakers

  • Mitchell Hashimoto
    Mitchell HashimotoCo-founder, HashiCorp
  • Armon Dadgar
    Armon DadgarCo-founder & CTO, HashiCorp

Transcript

Good morning, everyone. Thank you so much for joining us, and thank you, Nick, for the very warm introduction. I want to welcome all of you to HashiDays officially. I know Nick already said it, but we have an awesome lineup of talks. It's going to be a really, really exciting event, so we're glad to be able to do it.

For many of you, I expect this is actually the first HashiCorp event you've been to, and I think maybe some of you were here for our actual first ever European event, which was not very far from here in Amsterdam two years ago. Fun little fact: the giant H is actually making a return from the first HashiConf event here. It got saved, and we wanted to bring it back for you.

Last year we made sort of a slight mistake. We took a detour from Amsterdam, and we missed out. We ended up in London last year, so we were very, very excited to be able to bring it back to Amsterdam this year and join all of you.

What you'll see around you is many, many people, about a little over 400, in fact. These are your peers in industry, these are folks who are joining us from all around Europe, all around the world. It's practitioners who are here to learn about the state of the art, whether it's in DevOps, in cloud infrastructure, in how we work together and work as teams more efficiently.

So I'd encourage you, don't just talk to the folks you already know. Use this as an opportunity to meet new folks, talk about new things, learn and share what you already know with people who maybe don't already know it.

The HashiCorp Diversity Scholarship

As part of that effort, one of the things we care about deeply at HashiCorp is, how do we make our communities more diverse? A program that we kicked off earlier this year, thanks to Mishra—if you find him anywhere, he is the mastermind behind this—is the HashiCorp Diversity Scholarship.

What we felt is, one of our struggles is, how do we bring in more people from under-represented groups that don't usually find their way to technical conferences like the ones we put on? So what we wanted to do is find ways to bring more of these people to our conferences and be able to meet them and share what we know with them, and bring their perspectives into the community. So we're glad to have several people joining us from all over the world to participate in this event, thanks to the scholarship.

If you know folks who are interested and would to join us for HashiConf in the US, we have a similar program for that one as well.

The market shift from static to dynamic software environments

In the spirit of sharing knowledge, I think something probably many of us in this room know is we're going through a very, very large transition right now in the market. We're pretty universally moving from a world of private data center, VMware traditional point-and-click administration to—how do we start to embrace public clouds, whether it's AWS, Azure, GCP, all of the above?

There's this sort of transition taking place in the market as we try and both change where we're landing, moving from our private data centers into the cloud, and change the process by which we think about infrastructure, provision it, and move to more of an agile, self-service, DevOps model.

In our view, this changes a lot of things. As we go through this transition, it's not just a shift of one or two tools, it's not just a slight tweak of our process. Our view is, this is a pretty large change in the way we think about delivering applications, and it impacts many different groups.

It impacts the way our operations teams think about provisioning infrastructure. It impacts the way our security teams think about securing our applications, our data, our underlying infrastructure. It changes the way we deploy our applications, how we package it, how we think about CI/CD, what are the run-time environments we use for it? And lastly, it changes the way we think about networking, both at the physical level as well as the application level.

In our view, there is a broader kind of meta-theme. It's this meta-theme around a transition from a world that was much, much more static to a world that's much, much more dynamic and ephemeral and elastic. And as we undergo this change, it starts to break a lot of things.

So whether we're talking about infrastructure, where we used to have dedicated servers that were relatively homogeneous to—now it's very heterogeneous. We're running across multiple environments, and these things are coming and going all the time. We're not provisioning the VM and letting it live for months or years. We're provisioning the container and letting it live for hours or days.

So a very different scale of infrastructure, a very different elasticity of infrastructure. So how do we change how we think about infrastructure? It can't just be the same point-and-click approach as we try and do at orders of magnitude more scale and orders of magnitude faster.

New challenges in security, development, and networking

As we think about the shift in security, we're moving from a world where we largely depended on the four walls wrapping our infrastructure. We had a notion of the network perimeter, and we pinned things on IPs. IPs gave us a sense of identity. We knew this IP is the web server for at least the next few years—to a world where we don't really have four walls. We're kind of an API call away at any time from any node, serving or receiving public-facing traffic. And at the same time, the IP is just this recyclable unit. The VM dies, a new VM comes up, gets the same IP. Containers are recycling it all the time. So how do we think about the change in security as we move to this perimeter-free, low-trust model and lose the identity we had at an IP level? So there's a lot of challenges there.

As we think about our development tier, we are going from maybe a handful of relatively monolithic frameworks—maybe our giant Spring framework under our C# framework—to now we want to support many, many new ones. There's a Cambrian explosion of interesting tools that have come out over the last few years, so whether it's our container platform, our Spark big data platform, event-driven Lambda architectures, there's a huge variety of new platforms we want to explore and leverage and make our developers sort of have a toolchain, so that as their application is more event-oriented, they can use that. Or if it's more big data-oriented, they can use something like Spark, and provide all these as part of the toolkit.

Lastly, there are the changes in networking land, which is, as we're making this shift, as we're shifting from this dedicated, relatively stable infrastructure where we had the notion of an IP, to now this very dynamic ephemeral infrastructure, how do we change our networking to keep up as well? We don't want to think in terms of IPs and manually updating our load balancers. Instead, we want to think in terms of fine-grained services, which might be a container, might be a lambda function, might be a VM. But these things are coming and going and scaling up and down. So in our view, this dynamic change is sort of the underpinning theme to all of these transitions.

What you'll see is, this has been our focus for a long time from a toolchain perspective, is how do we lean into this change and build a toolchain that was designed for it? Thinking about cloud as sort of our native operating environment, and really focusing on what should that experience be as we're moving to this world where we want infrastructure-as-code, we want microservices, we want cloud-oriented infrastructure?

Networking challenges outside the monolith

Today I want to spend a little bit more time talking about what's happening in networking land, and starting with really breaking down the monolith and what changes as we start to do so? When we talk about the monolith, what we usually mean is a single application that has many discrete subcomponents to it. An example I like to use is, suppose we're delivering a desktop banking application. It may have multiple subcomponents. A might be log-in. B might be view balance. C might be transfers. D might be foreign currency. So these are four discrete types of capability, different pages the user's interfacing with, different APIs, things like that. But we're delivering it as a single packaged application, a single monolithic server.

What's nice in this format is, when these systems need to interact with each other, system A needs to call system B, it's easy. We just mark a method in B as being public. We export it, and now A can just do an in-memory function call. No data is leaving the machine, we're just doing a quick function hop over, and then jumping back to A. So there's a lot of nice properties about how we compose these systems together.

But what about things that run outside of the monolith? Because not everything is going to be within the app. What we find is most of the things are really databases. We have most of our logic encapsulated within the application, and we're just talking to databases by and large, through static IPs, when we talk about monolithic apps. As we need to start scaling up these applications, what we typically did was not break out the sub-pieces, because we can't, they're being delivered as a single application. So instead we deliver more copies of the monolith and then split our traffic over multiple instances, using a load balancer approach.

To secure the overall system, we split up into three logical zones. Zone 1, we don't trust at all, is our demilitarized zone, traffic coming in from the internet. Then kind of the middle tier zone, the application zone, where the monolith runs itself. And finally our database, data zone, which is shielded from everything except for the applications.

So what changed? Why do we need to do anything different? What's happened in the meantime? I think the first big thing is we've stopped writing monolithic applications. Instead, over the last few years, there's been a shift away into microservice or service-oriented architectures, where the core crux is, how do we move away from packaging and delivering all of these subsystems as a single deliverable and deliver them as discrete logical services?

The advantage of this is, now if there's a bug in, let's say, our log-in portal, we don't now have to wait for B, C, and D to all be in shape to be able to cut a release and deliver the monolith as one unit. If there's a bug in A, we can patch A and just redeploy A without having to wait for development teams B, C, and D to be ready and build the compiles and redo all of the QA. So it gives us a lot more development agility. We're able to deliver these different discrete capabilities at whatever cadence makes sense, whether it's feature delivery, whether it's bug fixes, whether we're scaling up and down these individual pieces. We gain a lot more flexibility because we're not tightly coupling all of these different functions together.

Service discovery

Unfortunately, like most things in life, there's no free lunch, so what we are gaining with developer efficiency of being able to do and run A as a separate service, we're starting to lose in terms of operational efficiency. There are new challenges that we inherit as a result. The first one, the one that becomes obvious the fastest, is how do we do discovery? Historically, what we used to do is just say mark this as a public function, A can call it, and now it's just an in-memory function hop. Well now, it's not just "mark it as a public service," because it's not compiled into our application. It's not part of the app, it's not even running on the same machine. It's somewhere over the network. So as system A, how do we find, how do we discover system B to call it over our network?

Configuration management

Another challenge we inherit is, how do we configure our distributed application? As a monolith, all of its configuration lived in a single massive properties.xml file, but the advantage of this was it gave all of our application a consistent view. If we changed something into being maintenance mode—we want to do database maintenance or a schema change—we would change the config file, and all of the subsystems A, B, C, and D would believe we're in maintenance mode simultaneously. Versus now, we have a distributed configuration problem. We don't want A to believe we're in maintenance mode while B does not believe we're in maintenance mode. We might get inconsistent behavior at the application level. So how do we deal with the fact that now we have this distributed configuration and coordination problem?

Service segmentation

The last major problem we're inheriting is a security problem. In the traditional monolithic world, we had the three zones, but the challenge was, what we were doing with the three zones was basically segmenting our network. And when we talk about network segmentation, what we're doing is taking a single larger physical network and splitting it into smaller chunks, so segment A and B as part of a larger physical network. What this let us do is restrict the blast radius, so if there was a compromise in segment A, it wouldn't overflow into segment B. We could control the traffic at a coarse grain between these different segments. There were many different techniques for doing this, virtual LANs or VLAN, firewall-based approaches, software-defined network-based approaches. But overall, what these gave us was a relatively coarse-grained way of bucketing different aspects of our infrastructure together, and so each of these segments may have still had dozens or hundreds of discrete services as part of it.

Now, the challenge as we start talking in a microservice architecture is, where do we draw those dividing lines? We still have the line on the left, which goes to our demilitarized zone, and the line to our right, that goes to our data tier zone. But now our internal application zone has a much more complicated service-to-service flow. It's no longer one application that's talking internally via function calls, it's many discrete services talking over a network. So how do we start to draw lines in between it? With a simple example, we might look at these four services and say, well, you can still do the same thing. You can still cross-hash and put firewalls in between all of these things.

The problem is, this is meant for illustrative purposes. This is a simple example. As we start talking about real infrastructure, it's not four services, it's hundreds. In the case of some of our customers, it's thousands of applications with complicated service-to-service communication flows, where it's no longer obvious where these cut points are. It's not trivial to figure out where do I deploy firewalls and what should my network topology look like to constrain this traffic anymore?

The ideal network segmentation scenario

So how do we think about this problem? In some sense, what it starts with is saying, what's ideal? What would be kind of the perfect scenario? The perfect scenario, we'd be able to move away from a coarse-grained model, where there are hundreds of services as part of a virtual segment, to saying there's actually only the services—the fine-grained boundary only matches exactly two services. It's only the sender and receiver. That's what I've indicated with the orange boxes. What if you could draw your network segment that finely, where you said A can talk to B, and then you maybe have another fine-grained segment that says C and D can talk bi-directionally to each other?

In this arrangement, we bring down the segment to only those services that essentially need to communicate. But these things would be impossible—it's not easy to cleanly separate them. It might be the case that A actually still needs to talk to C and B still needs to talk to D. So, how do we avoid creating a huge zone that says, "A, B, C, and D can talk to each other freely," right? Instead, we want the ability to overlap these definitions. So, we may have an overlapping definition that says well, "A can also talk to C," even though we've already defined A can talk to B as well. But, because A can talk to C and B should not imply that C and B can talk to each other. This shouldn't be associativity of access.

So, what we'd like to do is maintain a fine grain of how these things actually communicate without resorting to creating the large segment (the large blast radius) of just saying all of these services can communicate cause it's too hard to find the dividing lines between them.

Solving service-oriented networking challenges with Consul

As we talk about the challenges of moving from monolithic architecture to microservice architecture, what we're doing is sort of talking about the trade-off between our developer efficiency and our operational challenge. What we gained was, now all these pieces can all be developed independently, deployed independently, scaled independently. But we've inherited three operational challenges. How do we have all of these pieces discover one another, how do we solve our distributed configuration challenge now that we no longer have one configuration file, and how do we segment our network such that we don't have an enormous blast radius?

The way we think about it is, these three capabilities together are really what a service mesh aims to solve. As we go to a microservice, or service-oriented pattern, what are the challenges and is there is a solution that thinks about them in a well-integrated way as opposed to a patchwork of different technologies we have to bring together.

As we look at how we've solved this problem over the last few years, for a large part we've looked at the first two. So for some folks who are less familiar with Consul, what we've done in the past is really through two different mechanisms, one is Consul has this notion of the service registry and the registry is a central catalog of all of the nodes in your infrastructure, all the services running, the current health status. And the goal is that this register captures everything that's running such that you can solve the discovery problem. As the service comes up, it can be programmatically inserted into the registry and then when any of your downstreams need a route to it, they can basically query the registry online. So instead of using a static IP address, that's maybe going to a load balancer, you can just talk to the registry and just say, "What are all of the downstream databases," or, "What are all of the downstream APIs?"

This has historically been integrated using a DNS interface. So, for most applications, there are no change to them. They just started querying for database.service.consul and behind the scenes, that's being translated by Consul into a lookup of the database. So it's able to let us sort of mask the location of services and deal with IPs changing and instead hardcode a service name and not an IP address.

The other challenge we've looked at for a long time was distributed configuration, and our view was, how do you put that into a central key-value store and then expose that with a series of APIs and the ability to block and receive changes in real time? So Consul's HTTP API allows you to trigger and notify any time a change is made. So now you can switch a flight that says, "We're going into maintenance mode." And all of your services can get that in real time as opposed to changing 50 configuration files and redeploying all of your services. So looking at how we solve the distributed configuration problem.

The question then remains for us, how do we solve segmentation? This has been an exercise left to reader. So today, we're very very excited to talk about a first-class solution for this problem, that we're calling, Consul Connect.

So now I'd like to welcome Mitchell Hashimoto onto the stage to talk to us about Consul Connect in more depth. Thank you so much.

Mitchell Hashimoto introduces Consul Connect

I'm super excited to talk about Consul Connect today. It's something we've been working on in Consul for a very long time. We started talking about it and then designing it over a year ago and we're super excited to bring it out into the public today.

As Armon mentioned, Connect is a feature that's built directly into Consul and so for those who are less familiar with Consul, Consul's a product that we've had that's free and open source since 2014. In that time, it's amassed quite a large community; it's indicative by the GitHub stars. There are over 12,000 GitHub stars on Consul. That's somewhat of a vanity metric, so you could also look at things like downloads and actual usage. Consul gets over a million downloads monthly and these are deduped downloads. Also, we know of multiple customers that are running single clusters as large as 50,000 agents. So all of this is to say that Consul's very popular, it works at scale, and it's known to have a lot of operational stability attached to it.

Some of the users that have talked about Consul publicly are listed here. You could look these up online and find talks associated with them about how they use Consul. And the amazing thing is, all these users use Consul in a way that's very, very core to their infrastructure. The problems that Armon mentioned—discovery and configuration—that Consul has solved for a number of years are extremely important to modern, dynamic infrastructures. Consul plays a critical part for these companies.

So we've built Connect directly into Consul as a new feature especially because it needs discovery and configuration capabilities. But, we get a lot of benefits from that, including building on top of this operational stability and building on top of systems that we know are already mature and work at a very large scale. And you'll see how those are very important pretty soon.

Introduction to Consul Connect

So just to reiterate, what Connect is, is a feature for secure service-to-service communication with automatic TLS encryption and identity-based authorization that works everywhere. And the really key words in this sentence are: automatic TLS encryption, identity-based authorization, and that it works everywhere. I'm going to dive into each of these in more detail.

But taking a look at where we're coming from once again, in a traditional environment, our view of identity was generally tied to IP addresses and hosts. So, in this very small example, we have two hosts with IPs attached to them and we have a firewall in the middle. We'd create a rule that says, "IP 1 can talk to IP 2." And when we make that connection, it's allowed and it generally happens over plain TCP due to a variety of other complexities.

In a Connect-focused world, we're instead diving deeper into the host. We're not looking at the IP. The IP still exists because a host still exists, but what we actually care about is the services that are running on that host. In this case, if we dig into IP 1, we can see that we have an API service and a web service and if we dive into IP 2, we see that we have a database. In this scenario, we don't have a firewall anymore, we instead create rules that are based on their identity. And when they connect, in this case—API going to the database—we have a rule that says the API can talk to the database. It's allowed and we do this over mutual TLS. This gives us both identity and encryption.

But then, we can also have another rule that says web cannot talk to DB, and so when that connection is attempted, the connection is refused, even though the connection is coming from the same source machine. The IPs in this case really don't matter. We could dive directly into the service. We get fine-grained control and everything is encrypted automatically.

The other exciting thing I have to say is, everything we're gonna talk about today over the next 10 to 20 minutes is free and open source.

Connect is built up of three major components.

  • The service access graph

  • The certificate authority

  • A pluggable data plane

We take those three components and wrap them up into an easy-to-use and operate package. So let's dive into each one of these starting with the service access graph.

Service Access Graph

The service access graph is how we define which services can actually communicate to each other. In the world of Consul, this is done using something we call Intentions. Intentions define a source and destination service and whether the connection is allowed or denied. Using this, each individual service can have its own rules that are independent of the number of instances you have of that service. So it's scale-independent. In a traditional firewall based world, if you had a 100 web servers that ran on a 100 different machines, that was, at a minimum, a 100 rules if it was talking to one service. Bring up, let's say five databases and now you have a multiplicative explosion of the number of firewall rules you actually need.

In a highly dynamic, ephemeral world, you would need to create tooling to dynamically update these firewall rules as things come and go. And we've found that that's just difficult to scale both organizationally and technically. But with Connect and with Intentions, it's just always one rule. If you have 100 web servers, five databases, and they're scaling up and down, the rule is always web can talk to DB; it's just one rule. It's organizationally simple and technically very very scalable. These intentions can be managed with the CLI, the API, the UI, or Terraform. All four of these methods are available immediately with the launch of Connect.

Looking at an example of the UI, this is what we're launching today (~25:00 in the video). You could view your intentions directly in the UI. Search them, filter them, see which ones to allow, see which ones deny, and they're all sorted by the priority in which they would be applied as well. You could edit Intentions and they take effect almost instantly. You could watch those two services happily connecting together, or set to deny, you hit save, and the connections stop working. These Intentions can also be managed by a very easy-to-use CLI for people who are more CLI friendly. All of these, behind the scenes, are using the same API and we're also launching Terraform resources so you can manage that with code as well.

A really important property of Intentions is that they separate the actual rule creation of what can talk to each other from the service deployment. Likewise, when you create Intentions, the services don't need to exist when you do that. You could say: the web service can talk to the database before the web service or database even exist, so that when those are actually deployed, the connections work or don't work immediately. You could also use the ACLs within Consul to define separately who could actually deploy a service and who would actually manage the rules associated with communication for that service. So you could have different groups of people—if you want—or the same people, that could actually modify the intentions versus actually registering the service.

So those are Intentions. They're extremely easy-to-use. Extremely simple as a data model and they're fast.

Certificate Authority (CA)

The next important concept is the certificate authority. The certificate authority is the way that we establish two really important properties of Connect. Identity and encryption. To build these properties we use TLS (Transport Layer Security). TLS is a very well-adopted protocol and it has this really nice property that it was designed especially for completely untrusted networks. Specifically the public internet. And that is the type of mentality we're trying to bring down into our infrastructure. We're trying to make that a low-trust environment where we get end-to-end identity and encryption so that can confidently make these connections despite not trusting the network in between or trusting the applications around us. Using TLS, we get identity by baking that directly into the certificate and we get encryption by nature of TLS's transfer protocol.

A challenge with TLS, and a primary reason it's not better adopted within the data center, is actually generating, managing, and rotating certificates. This is a pretty big challenge and one we have quite a bit of experience with thanks to Vault. So what we've done with Consul is actually build all of this directly into it. We've expanded Consul's APIs to support APIs for requesting root certificates, signing new certificates, generating intermediates, and rotation.

Rotation is a really big one. This is something that's generally really tricky. Because, ideally, in a perfect world, you want short-lived certificates so that if certificates were to be compromised, the period of which it's compromised is fairly short even if you have revocation. With Consul, you're able to do that because we can automatically rotate all the certificates with zero disruption to service connectivity. This is all built directly into Consul so as you update the configuration, if you change the root certificates, if you change the way that certificates are generated, Consul automatically orchestrates updating the certificates across thousands of your services and overlaps them so that's never a point in time where service connectivity doesn't work.

We also have an approach with pluggable CA providers so that you could use the PKI system that your organization has chosen to adopt to generate and sign these certificates. So, let's see how that works. One of the ways that Consul could generate certificates is using a built-in CA. We've built a really basic, built-in CA based off of Vault's code directly into Consul. So when you adopt Consul with Connect, there are no external dependencies. You immediately have all the tools you need to start using Connect.

This basic CA is fully functional and the way it works is—the clients in the distributed Consul cluster actually generate the public and private keys locally on their machine. So we distribute the compute requirement of generating all these certificates across your cluster. After generating the keys, they send a CSR over to the server, the server receives this signing request and then returns with a signed certificate. The server itself is the only thing that ever has the private key for the route or intermediate certificates and likewise, the client is the only thing that ever has the private key for the actual leaf certificate. The secret material is distributed through your cluster, so that no single server has access to everything.

Like I said, we have pluggable CA providers, so another provider that works immediately with launch is, of course, Vault. Vault is a way to have a lot more configuration, a lot more control, a lot more policy, over how your certificates are stored and generated. In this scenario, instead of using the built-in CA, what Consul would do is receive the CSR, forward it over to Vault. Vault performs the signing operation, sends the certificate back, and Consul sends it back to the client. In this world, Vault has the private key for the root, and Consul never gets to see that. Vault has all the secret material. The server is just the pass-through for this API. It's important to note the clients always use the same API with Consul, so no matter what CA provider you actually use, the APIs to request root certificates, sign new leaf certificates, etc. are always the same. The server itself is actually the only thing that's communicating to different providers in the back.

A really fun property is—because we can do automatic rotation, you could switch between your CA providers anytime you want. You could get started with the built-in CA provider, get going, and then at any anytime switch to Vault, and we automatically manage the rotation from an old provider and a completely different root certificate, to a new provider and a new root certificate, and generally, even with hundreds or thousands of services, this happens so fast throughout your cluster, you don't even believe the rotation happens, but it did.

Another important note is the format of the certificates. The certificates themselves are, of course, just standard X.509 certificates—the same TLS certificates you're generally used to—but one thing we've done is we've adopted the SPIFFE specification for identity. SPIFFE is a way that's published by the CNCF for encoding identity within a certificate. There's a number of SPIFFE compatible systems out there. By adopting the SPIFFE specification for identity, one of the things we've gained with Connect is interoperability with other systems. Because our certificates that we generate and the certificates that we accept are SPIFFE based, that means that if you have an external system that uses SPIFFE, your Consul services could connect to those SPIFFE services over Connect and it works. You just lose some of the authorization from the intentions of Consul.

And the reverse also works. We could accept connections from SPIFFE-compatible systems into Consul and then keep the same TLS certificates, keep the same identity, keep the same encryption, but on the receiving case, we could also validate that with our service access graph and authorize the connection. This is really, really important for interoperability. So that is a certificate authority.

Pluggable data plane

The next important thing is the pluggable data plane. In any service mesh solution, it's really important to understand the difference between the control plane and the data plane. The control plane is generally responsible for configuring and defining the rules for service connectivity, routing, and so on, and the data plane is the actual thing that's responsible for mediating and controlling active traffic. With Consul Connect as a control plane solution, we're defining intentions, we're distributing the configuration in the form of TLS certificates, and we're handling all this control.

The data plane; however, is pluggable, and Consul doesn't try to do this itself. The way this works is one of two approaches.

  • We support a sidecar proxy approach

  • We also support native integrations directly with Consul.

Both of these approaches use local APIs to access the service graph and the certificates. Because of Consul's proven architecture (that works at scale) of having agents on every machine, we can efficiently updates certificates and root certs in the background, out of the hot path, and create subsets of the graph and cash them on every single machine. So that for active connections, for getting new TLS certs, everything—it's usually accessing locally cached data. So all those APIs calls respond in microseconds. The overhead for using connect on any service is very, very minimal, and both sidecar proxies and native integrations use these APIs.

Sidecar proxies

Sidecar proxies are the first approach. They are a way to gain the benefits of Connect without any code modification to your existing applications. Almost any application could immediately start getting the benefits of Connect with zero change whatsoever to the actual binary itself. And the way this works is by putting a proxy next to it that automatically handles the wrapping and unwrapping of TLS connections. This has a minimal performance overhead. It introduces a new hop, of course, but this hop is strictly localhost. And it's very, very quick. And then the API call, like I said, to Consul is all local, so this generally responds in microseconds and you can't notice it. The benefit of the sidecar proxy approach is that it gives operators the most flexibility. You could choose which proxy you want to use, and whether you want to use a proxy at all, and how you want to sort of run that proxy. There's a lot of flexibility in terms of how this could be deployed.

Consul has two approaches to running sidecar proxies. One we call managed and when we call unmanaged. In the managed approach, Consul will actually start and manage the lifecycle of the proxy for you. It'll send configuration to it, it'll make sure it remains alive, it'll make sure it's listening on the right ports, that the catalog is always up to date, that it's healthy. It sets health checks automatically. The managed approach is really, really easy.

In the unmanaged approach, it's just like running any other application or service in your infrastructure. The operator is responsible for starting the proxy, registering it with Consul, and configuring it. But it's just the first class proxy like anything else, it's just that Consul isn't managing it for you. There's an additional security benefit to using the managed approach. When you use a managed proxy, Consul uses a special ACL token, so that that proxy can only access read-only data related to the application it's proxying itself. That's not currently possible with unmanaged proxies.

Visually, the way this looks is like this (~36:20). If the dotted lines represent a host and we have the Consul client and an application, and our application wants to talk to another application on another host, we would start a proxy—managed or unmanaged—alongside the application representing that individual application. It would receive its configuration—its TLS certs, the port to listen on, etc—from Consul directly. And when the application wants to create a connection, it connects over plain TCP to a loopback address on the machine—not over an untrusted network, over a trusted local network. That proxy then uses Consul's service discovery feature that we've had for years, to find the other Connect-enabled application. So this will connect to another proxy and, as you'll see in a second, it could also just connect directly to a natively integrated application—there might not be another proxy on the destination side. This proxy then connects back to the application.

In this model, the application itself doesn't need to even be aware of TLS at all. It just needs to be able to make basic TCP connections and you get the full benefit of Connect right away.

The proxies themselves are completely pluggable. Consul exposes an API that proxies can integrate with to immediately start using Connect, and it's just one API they really have to integrate with. So this makes it really easy to add new proxies. With the release of Connect, we're also releasing integration with Envoy, thanks to our partners at Solo and their product Gloo. This enables you to use Envoy wherever you want to get higher level features like telemetry, Layer 7 routing, etc. that perhaps other proxies won't offer.

I should note at this point, it's also important that we're shipping with a proxy built into Consul. It's a basic proxy. It doesn't have a lot of features, but it will get you connections right out of the box. This is really important because again, when you deploy Consul and enable Connect, there are no other dependencies you need. Consul is the only system you need to get automatic TLS connections between any two services. But you always have the option to use other proxies and these proxies could change on a service by service basis. Most applications aren't that performance sensitive, so using the built-in proxy is totally fine. But for certain applications that are either performance-sensitive or have the need for higher level features like telemetry, tracing, routing, etc., you could deploy something like Envoy next to those. Some applications require things like Nginx, or HAProxy, or other very specific data plane solutions, and those could also integrate with Connect, and you could use those. So your whole infrastructure could be totally heterogeneous. You could use the right tool for the job, and you could always lean back on the basic built-in proxy for the easiest operational simplicity there is.

Native integration

The other option you always have is to natively integrate. This is relatively unique to Connect. With native integration, because we're basically just doing standard TLS—it's standard TLS with one extra API call to Consul to authorize a connection—almost any application can very easily integrate with Connect without the overhead of the proxy. This introduces a basically negligible performance overhead. It's standard TLS and the one API call generally responds in one or two microseconds, so it's not noticeable during the handshake. We recommend this really only for applications that have a really strict performance requirements, because it does require code modification and so it's a pretty expensive process to actually get deployed out there. What we recommend instead is just starting with the proxies, seeing how it works, and if you need that extra performance, integrating natively. And of course, natively integrated things and proxies, in terms of Connect, are indistinguishable so they could all connect to each other.

So those are the three components that build up Connect and they're all really important and we think we've exposed them in a really elegant and easy-to-use package.

Upgrading to Consul Connect

But of course, the whole thing holistically also has to be easy-to-use and operate, and we think we've done that as well. When you upgrade Consul, to get Connect, you add three lines to your servers and restart them one by one. After adding these three lines, Connect is ready to go on your entire cluster. The clients themselves, besides upgrading, don't any configuration changes. And when you register a service, by adding one extra line—no code modification—by adding one extra line to your service definition, you could request that Consul manages a proxy, starts it up, chooses a dynamic port, and registers it with the catalog. So by putting this one line of JSON in your service definition, for the PostgreSQL service, in this case, your Postgres database is now ready to accept identity-based encrypted connections just by reloading Consul.

Another principle challenge of these sorts of solutions is that, in an ideal world, the only exposed listener in all these things is the TLS listener that requires strict identity. But a challenge with that is the developer and operator; human-oriented connections. In the example of the PostgreSQL database, what if I as an operator, need to open a Psql shell into Postgres to perform some analytics? It's kind of tricky to get a TLS certificate, connect, make the right port, all those sorts of things. So we've thought about this as well.

Consul ships with a command that is easy to run, you could run directly on your laptop and it lets you masquerade as any service you have permission for. So in this case, we're running a proxy, we're representing the service web, you need to have the right ALC token to be able to do that, and we're registering an upstream of Postgres on the local port 8181. This will automatically use Consul service discovery to find the right proxy or natively integrated application and exposes it locally on port 8181. So then I just make a normal, unencrypted plain TCP Psql open shell, to localhost on port 8181. And now this connection is actually happening over mutual TLS, fully encrypted, authorized, directly into your datacenter. It's super, super easy to get connections to anything, even if they only expose Connect. And that's what we recommend. Thank you.

More resources like this one

  • 3/15/2023
  • Case Study

Using Consul Dataplane on Kubernetes to implement service mesh at an Adfinis client

  • 1/20/2023
  • FAQ

Introduction to Zero Trust Security

  • 1/4/2023
  • Presentation

A New Architecture for Simplified Service Mesh Deployments in Consul

  • 12/31/2022
  • Presentation

Canary Deployments with Consul Service Mesh on K8s