Watch Armon Dadgar explain how Consul integrates Kubernetes workflows in this whiteboard video.
Hello, and welcome. Today, I want to spend a little time talking about Consul and Kubernetes. When we talk about Kubernetes and Consul, I think there's often a bit of confusion around exactly what it solves? Doesn't Kubernetes have its own built-in service discovery? How do these two systems work together?
If we start by talking about Kubernetes, then yes, at the core, Kubernetes does include a basic form of service discovery within any given cluster. If I have a single Kubernetes cluster and I have applications A and B, and they're running within that cluster, then they can use Kubernetes as native DNS-based service discovery to find and route to one another within that cluster.
Consul starts to play a role once we start to draw a slightly more complicated picture. For a lot of folks, their challenge is they don't have a single Kubernetes cluster. They have multiple.
You might have a different cluster with other applications running on them. Now the challenge is how do the applications on cluster 1 discover and talk to applications in cluster 2? This becomes one of the first areas where the value of Consul comes to play — around: how do we do this type of service discovery? Consul can span all of that and act as a single, consistent registry across multiple clusters.
It can sync and integrate with the native registry that Kubernetes has and create a more global visibility of all the different services. That way, applications within a cluster can query Consul to be able to route and talk to services in a different cluster.
Now, this extends beyond multiple Kubernetes clusters. We might have applications that run outside of Kubernetes entirely. For example, you might have a VM-based application that might still be registered with Consul, and we want to enable those applications to interoperate. A might want to discover E or vice versa — E might want to discover A running within Kubernetes.
Then we might also have other platforms. We might be using, for example, ECS (Amazon Elastic Container Service) as a different container-native platform within Amazon. And great, we still have a service discovery challenge of how all these things talk to one another.
The first layer where Consul comes in is acting as a universal service discovery layer. That might be across multiple Kubernetes clusters. It might be between our Kubernetes and our VM-based workloads or multiple container-native platforms such as ECS and Kubernetes.
This becomes the first part: How do all these things discover one another? And we can even connect that service discovery to solutions like API gateways. Consul has native support for an Envoy-based API gateway, but it also integrates with popular solutions like Traefik, NGINX, HAProxy, Kong, etc.
Those API gateways can also similarly query Consul and say, "If this request is coming in and we need a route to the service, where's that service running? And there are potentially multiple copies of it so we need to load balance across them."
This first level challenge is around: how do we create a single universal directory of where all our different services are running? A service registry that allows all our applications to query it and do service discovery. Then we can integrate with things like API gateways to enable that discovery challenge. Layer 1 is solving the connectivity challenge across all these different systems.
Layer 2 is where we start getting a little more sophisticated. How do we start to think about service mesh? If we start, it's around service discovery first, and the core catalog. Then the more sophisticated use cases that start getting deeper are service mesh.
Now, with service mesh, I think there are multiple benefits we start to get. At the core, what does service mesh even mean? It's a disaggregated pattern. Rather than us trying to push all the traffic through a central set of load balancers, firewalls, and WAFs — and all this network middleware — instead we're going to push a set of software proxies or sidecars to the edge.
Alongside these applications, we are going to run a proxy, for example, Envoy. These proxies are running everywhere. They're alongside these various applications. The problem with having proxies distributed everywhere is that you don't want to have to manage configuration distributed across thousands — or tens of thousands — of nodes.
So, while we're distributing the data plane everywhere with the proxies running everywhere, we're centralizing the control plane and the management metadata to somewhere more central such as Consul. That becomes the heart of service mesh.
The value of doing that is we can now start to define various types of controls. A key one becomes how do we micro-segment our network? Or how do we have explicit security controls around who's allowed to talk to whom?
We might want to define, for example, a central rule that says, "Service A is allowed to talk to service C." We're going to define that rule using the logical identity of the application — in this case, service A and service C.
I don't care what their IP address is, but we've segmented the network. We're saying great, A can't talk to anything it wants , but it can talk to C. How does that work?We're going to push down a set of certificates to all the applications that are registered with Consul running in service mesh mode.
There'll be a TLS certificate that gets pushed down to all these applications. That certificate follows a common format known as SPIFFE. This provides a universal way to effectively encode the identity of that service — so we know the identity of this service is C within that certificate in a way that can be authenticated peer-to-peer.
We don't have to talk to a central system. When A initiates a connection directly to C, they're going to both present their certificates and establish a Mutual TLS (mTLS) session. This encrypts all the traffic that's moving between these applications but allows C to inspect A's certificate and validate that is service A. Service A can validate C’s certificate and say, "That is service C." They can check against a logical rule that says, "Should service A and C be allowed to communicate? Yes or no?"
If yes, great. We allow this traffic pattern to take place, and the services can talk. If no, we're going to close that connection and generate an error saying service A and C shouldn't be talking to each other. We're going to reject that.
This becomes the core approach to how we start thinking about segmenting our networks. Oftentimes, we refer to this as a micro-segmentation. The reason it's a micro-segment is we're not talking about segmenting on a core screen network boundary.
All these applications might be in the same subnet or the same VPC. But we're now saying it's a very micro segment, even though they might be on the same subnet. In fact, they might be on the same IP address. We're still segmenting to just those services that should be allowed to talk to each other.
B might be on the same machine as A. But it doesn't have any authority to talk to C. Unlike a traditional network-based approach where you might segment a whole subnet at a time using a firewall or security group rule. This allows us to get a little bit more segmented.
As we start getting a little bit fancier — because we're traveling through these proxies and these proxies are smart; they're aware of the protocol, the traffic is flowing through them — we can start to do more interesting things. For example, with Layer 7 management of traffic.
We might want to look and say, "Based on various paths, we're going to allow and deny that request — or based on various patterns of it." It starts letting us get more sophisticated. We might also use Layer 7 within an API gateway path to say, "If it's `foo`, that goes to service A, but if it's `bar`, that might go to our traditional VM-based service. In this case, service E."
We can start integrating with service mesh some of these Layer 7 constructs — such as path-based routing, more sophisticated policy at that protocol-aware layer — to start doing more interesting things around traffic management, routing, etc.
Then the last piece is how to enable more observability. These proxies, by virtue of sitting in the data path, can collect pretty rich telemetry. We can look at how many requests are being made between A and C. What is the error rate? What is the latency? All that telemetry gets collected by these proxies, and then they can export that to a system, whether that's Datadog or Prometheus or a different APM system.
As a user, you can come in and say, "I want to understand what's my error rate between these services? How much traffic is flowing between all my different applications?” I can see that through my telemetry and monitoring systems because the mesh is enabling me to collect that and have all that observability data.
These become some of the core benefits as we talk about service mesh. It’s thinking about security improvements and manageability around Layer 4 and Layer 7. Obviously, we don't have to only work at Layer 7. We can also do Layer 4 policies. Then observability: can we see it and profile our network and get a better sense of what's happening? Those are some of the core values.
The final piece — as we think about broadening this to: what's the end-to-end of automating a network — is that oftentimes in between some of these applications we still have traditional network metalware. It might be that we have a firewall between service A and service E. E might be a database, for example.
How do we update these things? Often we end up saying our application might be running in Kubernetes. We may be autoscaling it in an automated way. But then we have to file a ticket and wait for someone to manually update the firewall, for example.
This is where Consul is integrated with Terraform. We allow you to define with infrastructure as code — using a Terraform template — how you should manage these network appliances. The network appliance might be a firewall, a load balancer, an API gateway, or an underlying network fabric if you're using an SDN or something. We can define — with the Terraform module in an infrastructure as code way; what the inputs to that, and how we should configure the firewall without managing specific IP addresses.
Then we connect that to Consul using what we call the Consul-Terraform-Sync. That acts as a bridge between Consul and Terraform.
As service A gets deployed, service A might register a new IP address for that service. We can then pick that up and that will get registered with Consul. We will detect that change and automatically invoke the appropriate Terraform script, which will then update the firewall in an automated way.
So, from our app team's perspective, we just deploy and manage our applications. We don't have to care how the network works. That just gets registered and automatically updated rather than the developer having to know, "Oh, I have to file a ticket now to manually go update my firewall or my load balancer or whatever to get traffic."
When we zoom out, what we're trying to enable here with Consul is that focus on the application. The app team should really focus on being able to deploy their app, manage that, and then the network should be automated around them.
Whether that's an API gateway allowing traffic to come in; whether that's east-west in terms of — what are the policies that allow you to discover and secure your connection to these other services — or if there's middleware such as load balancers , firewalls, etc. How do those get updated automatically to support the application without having to file a ticket?
All of this and this piece focuses around that network automation. All of this together is how we think about doing service networking end-to-end and where Consul integrates and supports Kubernetes.
Now, if you're asking, "What are the start points? How do users interface with this, deploy this, manage this?" There are a number of different ways.
You can deploy Consul directly onto Kubernetes using a Helm chart. We have native Helm charts to be able to do that. At the same time, we also have what we call the Consul Kubernetes CLI. You can also use that to deploy and manage Consul running on top of Kubernetes.
Then there are a number of ways to interface with it. If users are more comfortable and they want to specify CRDs directly against Kubernetes, they can do that and use native YAML, native CRDs, native Kubernetes workflow to define their policies and controls within Kubernetes — and that will get synced with Consul.
Or, if they want, they can interface directly with Consul as well. That has a number of different endpoints, whether that's a CLI-based approach, API-based approach, infrastructure as code using Terraform, a UI, etc.
There are multiple paths in. For a Kubernetes native user, they might be more comfortable with the CRD. But if you're doing more sophisticated tooling, you might want to use the API or the Terraform providers directly to manage that all.
Hopefully, this gives a sense of how these systems work together. I think it’s exciting when we talk about some of the case studies — folks like Datadog who use Consul, and talk about how they manage that across dozens of Kubernetes clusters.
It's really about enabling the developers to go fast and focus on their application and not have to think about the networking that supports that underneath. Hopefully, this was helpful to learn a little bit more about how Consul and Kubernetes work together.
Thanks so much.