Multi-Cloud Service Networking for Humans
Jul 17, 2019
Paul Banks gives us an explanation of why your teams would want a service mesh, in plain English, and then demos a few new features of HashiCorp Consul 1.6, which now joins the ranks of full-fledged service meshes.
In this talk, Paul Banks, the engineering lead for HashiCorp Consul, explains service mesh from a few different helpful lenses:
- As a pattern
- How it maps to your organization's teams
- How it maps to your infrastructure team
- How it provides network connectivity as a service to developers
Service mesh not only delivers a more decoupled, self-service way to provide standardized network connectivity throughout your organization, it also provides a more granular way to enable security and advanced progressive delivery options (e.g. canary releases, regionspecific releases). These are all provided through sidecar proxies that attach to services, so none of this needs to be coded into each application. something you have to bake into your applications.
Consul is a service networking solution that can give organizations these service mesh features or simply provide DNS, dynamic firewall updates, dynamic load balancing or updates.
Consul runs wherever your workloads are, whichever scheduler you're using, whichever platform you're on, whichever cloud provider, and it's multi-datacenter right out of the box.
This presentation will give you a straightforward view of the benefits of service mesh and then demo two features of the newly released beta of Consul 1.6.
Software Engineer, HashiCorp
Hi, I'm Paul. I'm an engineer on the Consul team, and I want to start today with a Venn diagram. That's the wrong slide. Here's the Venn diagram. I was keeping you on your toes. I want to set the scene for some of the terminology we're going to use. We talk about Consul as a service networking tool, and we just saw in the keynote that what we mean by that is that it covers this range of problems with service networking.
We can do service discovery, just through DNS. We can integrate with load balancers, as in applications through our API. Or we can go with a full-fledged service mesh, for these more advanced situations.
I'm going to be talking about service mesh today, but I want to just make it clear that this is kind of one advanced way that you can manage your services, and that for Consul, it's just one of our first-class solutions. I want to think today about service mesh from an organizational perspective. That's the "for Humans" bit in the title, and instead of just jumping straight into the technical features, I want to think about who it's for, and why it came to be.
Earlier this year, James Governor from RedMonk gave a talk at QCon in London, and he introduced a whole new buzzword to our industry, which is always fun, and I'm going to talk a little bit more about that later. But his motivation for that whole talk, if you watch it, is that he felt like there was a lot of hype for the last year or so about service mesh, about the cool things you can do with one, and not a lot of very clear explanations of why you would actually want to use one in the first place, so hopefully we're going to cover some of that.
As I said, we need to understand teams, and organizations, to really understand service mesh, so I'm going to spend the first little bit talking about that, and then the second part, I'm going to show you some demos of some of those cool things we just saw from Mitchell and Armon.
I don't know if you recognize this. In the 60s, Melvin Conway wrote a paper, and it had this assertion in, which has later come to be known as Conway's Law. And it's a little bit of a mouthful, if you try and say that sentence, but it boils down to:
The structure of your organization is going to fundamentally impact the design of your systems, your products, your software, and also the way you run your infrastructure. And it becomes obvious why this is the case, when you consider communication overhead. When you try and scale up a group of people working on a task, as you add more people, if they need to coordinate and have everyone on the same page, and everyone making decisions together, then it gets exponentially more expensive as you add people to that group.
A lot of our organizational structures are built around this idea that we need to keep decisions made in smaller groups, and we need efficient communication for the paths that really matter within the organization. This is what gives rise to the fact that your software modules, the systems you build, they have to align with the structure. Otherwise, they get very hard to work on.
» Service mesh as an architectural & organizational pattern
We'll come back to Conway's Law in a second, but today I want to introduce a distinction between service mesh, as a bag of features and as a cool new technology, and service mesh as a pattern. We talk about the features, things like dynamically routing requests, things like service identity-based security, and observability, but none of those things on their own are truly unique to service mesh. You can implement all of those in other ways, and there might be good reasons to do so.
I think it's important to think about service mesh, at its core, as an architectural pattern, and let's understand why that comes about. [Let's imagine a company that ships software services. As the company grows, you end up with multiple development teams, who all need to ship independently. They're shipping different services, different products, perhaps, but it doesn't make sense if all of those teams, tens or hundreds of teams, are just reinventing all of the basic machinery around deploying software. CI and CD, infrastructure provisioning, how they manage their security, and how they manage their service connectivity.
And so what... The natural solution is to have this separate infrastructure team. And the role, and the output of this team is going to vary, depending on the nature of your organization. For some organizations, they may produce tooling, they may produce best practices, they may embed as consultants on development teams. In others, it may be that they have a full, managed platform as a service, and the developers teams just kind of deploy straight into that.
However it is, the crucial point is that they do work that can be consumed by development teams. They can avoid this duplication of effort, and they can build infrastructure that's run as a service to developers, so the development teams can self serve, and be responsible for their own application lifecycle.
The service mesh pattern really exists to decouple the concerns of these different groups of humans. It's kind of like Conway's Law applied to infrastructure. We shape our infrastructure design around the teams, and the concerns of the teams that need to operate it. The pattern is all about separating these concerns of reliable, controllable service communication, from the application code and configuration itself.
This decoupling happens in practice, because we deploy service mesh as a sidecar proxy, as separate process, that sits alongside the application process, and that means that the lifecycle of that proxy, the components of the service mesh, they can be managed, and configured, and upgraded by an infrastructure team, without coordinating that with application teams. So, without coordinating new builds or new rollouts.
The other key point is that service mesh brings this central API for controlling your connectivity, and that allows your infrastructure teams to build tooling on top of the network, so that they can build things like sophisticated rollout mechanisms, automate incident responses, and things like that, that can then be provided as a service to applications, without affecting the application code.
Another big concern that most larger organizations have is heterogeneity in the infrastructure layer. And we heard that in the keynote a little bit with multi- everything, and we'll come back to that in a second, but another example that comes up here is that many organizations will have different application teams, different service teams, working in different technology stacks. Hopefully it's not quite as bad as this slide, otherwise you're in for a fun time, but the idea with service mesh is that it separates the concerns of that reliable communication away from the application team, so you don't have to re-implement controlled retries and things like that in all these different languages, and you are separating the application developer choices from the infrastructure capability.
» What are the features of a service mesh?
We've seen how service mesh, as an overall architectural pattern, aligns with the needs of your organization, so I want to look a little bit more at the features, but still looking at how they service needs within your teams. These are roughly the high-level features you get from a service mesh:
Dynamic request routing
Service identity and authorization
Resiliency patterns (patterns for communicating in a reliable way between services, things like circuit breakers, and rate limits, and retries)
Consistent observability across all of your services
But these are all technical things. Why are they useful in organizations?
» What are the features of progressive delivery?
Well, here we come back to James Governor's new buzzword. He's actually here today, so I'd love to talk about this some more with him. So, progressive delivery is this umbrella term for a bunch of practices that have been around quite a while. I imagine the vast majority of the audience here, who are practitioners, have done at least one of these things before:
Canarying a release
Feature flags to control who gets to see what in production
Deploying to different regions at different times
Opt-in public betas
The overall idea is that large organizations can't just deploy all their changes everywhere, as soon as they hit green, and it's not because they're not DevOps enough, it's not because they don't have CD technologies. It's because they need to limit the risk. They need to have managed rollouts with canaries. They need to do phased releases to different user groups. They need to experiment. Maybe the algorithm needs to be tested in production, to see if it's actually performing as well as the old one before it gets rolled out everywhere. You need to do regional rollouts. All of these things are kind of the norm for most large organizations.
» Dynamic request routing
The common theme, or some common patterns behind all of these is they all need the infrastructure, some infrastructure support. You need to, at a minimum, be able to dynamically configure how your traffic is flowing through your system. Note, this doesn't mean you need a service mesh. Many companies have been doing this stuff by dynamically configuring their edge load balances and similar, but you need dynamic configurability in your infrastructure. You need consistent observability, because there's no point doing a canary release, if you can't actually tell if the canary's doing any better than the baseline and so on, for A/B tests and similar.
But more importantly than all this, you need a central API to manage this stuff, so that you can actually build tooling, and make this the default path for rolling out code for your organization. If you have tens or hundreds of development teams, and they're releasing every day, or maybe multiple times a day, you can't expect them to be going through manual, laborious, canary rollout processes every time they do that, unless you have tooling available for that.
In the last week or so, a paper crossed my path this year, from Netflix, I think, about automating some of their chaos testing, and if you read that paper, it's fascinating. These are all ingredients they need. They need to be able to dynamically route traffic. They need to be able to automate the changes between those. There was a similar blog post from LinkedIn, about automating load testing within their infrastructure, and all the same concerns come out there. These are all potentially under this umbrella of progressive delivery.
» Service identity and authorization
Service mesh enables these ideas, these techniques, but they enable your infrastructure team to build this out as a service for your application teams. They are really important, but I think these are only a part of the use case for service mesh, so I'm going to expand on progressive delivery. We also want to think about security, so we see these big breaches of major companies rather too often, and it drives home this truth that the perimeter is not secure enough. We can't just rely on people not getting in that front door. But trying to enforce very fine-grained access policy at the IP level, in the network layer, doesn't scale well as your workloads get more dynamic.
The IP address is just really not the right unit of addressability, or the right unit of identity, to be able to control these dynamic, containerized workloads. But as you move that security concern up into the service mesh layer, and you use mutual TLS instead of IP firewalls, it becomes feasible to manage encryption, and authorization, with restrictive access policies everywhere, all the time, for all of your services.
» Resiliency patterns
Businesses also need to find ways to continually improve the reliability of their service, which is especially challenging, because they also need to be shipping things faster, and I've not yet seen a business who are reducing the complexity of their offering. We're always building new services, and getting more complex, so how do we keep things more reliable?
To do so, we need all of these best practices that have grown up in companies that have been doing this service thing for decades now, around short timeouts, limited retries, circuit breakers, rate limits, all these things that help you protect against cascading failures, so when one of your services gets slow, does it take out everything in your entire product?
Service mesh can provide those features, but again, the important thing is that they can be centrally managed, and with that comes the ability for your infrastructure teams to define policy around how you set up your services. Your infrastructure team maybe could set a policy that you're not allowed to deploy a service in production that doesn't have rate limits set, that doesn't have short timeouts for all of its upstreams, and so on.
» Consistent observability across all of your services
And finally, observability. As all this complexity increases as we've been talking about, we find ourselves needing to understand production in a deeper way. It's no longer okay to just know about the logs, or the metrics for the one service we might own as an application team. We need to have a holistic view of all the services, to be able to understand production issues, to be able to debug, to be able to optimize things. Service mesh can help with this, because you get consistent metrics that are measured in the same way, by the same tooling, across all of your services. You can also enable tracing, for seeing distributed traces of all the requests through your infrastructure.
But even more than that, we're starting to see control systems that can automate things like canary rollouts, things like auto scaling, and I think we'll see more of this in the future. Maybe automated incident response.
But all these kinds of innovations need at least two things. They need input and output. They're control systems. They need consistent data about exactly what's going on. Who's talking to what? How's it performing? And a service mesh, through its observability, gives you that. But then they also need to complete that feedback loop. If they're going to actually help automate the change that they calculate, they need this centralized API and control plane, to go and make the changes.
» Demo: Consul 1.6
So, that's my perspective, my 'why service mesh' talk. I want to now go through some demos of some of these features we've looked at. First up, we're going to look at some dynamic traffic routing. Now, Mitchell has already talked through some of the features here, but just to give us a high-level overview...
Dynamic routing encompasses a wide range of use cases and possible configurations and features. Just as some further examples, we can do HTTP, or Layer 7-based request routing. So based on things like the path, or the verb of the HTTP request. This can be really useful for migrating from a larger service to a different service, but many other possible use cases.
We can split traffic between different services, different subsets of services, potentially different regions, and the kind of canonical example of this is to do a canary deployment, and we're going to have a look at that in a second.
Finally, we now allow you to have custom logic for how you resolve the set of instances you want to return, and that can be by filtering them down using filters, that can be by failing them over, when they get unhealthy, to another datacenter. You can just flat out redirect one service to a different one, which can help you with renaming, with migrations, things like that. And all of these happen without you needing to change all of your client configuration. You don't need to change the DNS names your clients are hitting. You don't need to change the configuration of your apps. You can just do this centrally, and have your traffic move.
» Canarying in Consul Connect
So, we've already talked a little bit about canary deploys today, and this is the demo I'm going to walk through now. Before we see the real thing, I just want to make sure we're all clear about what's going on, so we're going to have a web service running, and it's going to be initially directing 100% of its traffic. It's going to be making requests to this backend API service, which initially is just going to be sending to version 1, and then we're going to dynamically change that, to send some traffic to version two, as well, and we'll see how version 2 performs. We'll watch what's going on using Grafana, to visualize the metrics.
But I'll quickly show the configuration for this. Mitchell showed an example earlier, but it's pretty simple. You can manage these using HCL files, and submitting them via the command line. In this case, I'm just showing you the raw JSON, and how you would just curl it, if you wanted to do that, just to show this is a really simple API for you to integrate tooling against.
In this example, we are splitting by the weights for the two different legs of the split. You can have more than two. These are expressed in percentages, but they're floating point, you can use up to .01% resolution. Each leg is pointing at a different service subset, which Mitchell showed earlier. Think of things like named filters. You can assign a name, like V1, and that gets applied to a filter, like service meta version equals 1, so that's what these are doing here.
I could just demo this by
curling. Showing you this, doing some
curl, and we'll reconfigure, and we'll see it, but I really wanted a more visual way to show this splitting in action, and how dynamic it is, so I thought a little bit about how I could do that. I need an input that's kind of visual and dynamic, and then some inspiration hit me, and I'm really happy to announce, ladies and gentlemen, that I finally found the use for the touch bar on my Mac. This is... That gets the clap.
This is of course a toy. This is not a new HashiCorp product announcement. Don't go tweeting about touch bar ops, or anything like that, or maybe do. I don't know. No, let's not. But let's see this in action, shall we?
So, this is our Grafana dashboard. You can see at the top, this is the website service. It's receiving a pretty steady, roughly 100 requests a second, and so far, things are looking good. They're all green. We've got HTTP 200s across the board. Now, as we scroll down the dashboard, we've got some metrics down here. This is for the backend, for the API service, and on the left you can see version 1 is getting all the traffic right now, and version 2 is there on the right. The first thing we're going to do is just move over a small amount of traffic. We'll go for about 5% first.
So, we're going to invoke touch bar cam, so that we can see this. Here it comes. That's it. Get the app up, and then we're going to move that to around roughly 5%, and I don't know if you can see, but as soon as they change color there, that's the API request is done to Consul, and Consul's reconfigured all the proxies. Now, it takes a couple of seconds for the traffic to take effect, but you can see it has there. This is actually real time. We were tempted to shift that video a little bit, to make it look even quicker, but I wanted to keep it real time to make the point that Consul updates the proxies within milliseconds, and all the delay you see here is in Grafana, and in Prometheus, and in Envoy itself, doing a graceful reload of its config.
So, 5% is looking pretty good. We've got all green, so probably you want to take your canary deploys a little bit slower than this in production, but we're going to just go ahead and increase that to 50% here, and as soon as that gets fired off, Consul is rebuilding that discovery chain. It's recompiling all of their dependencies. It's figuring out how to configure all the proxies, and then it's pushing that down into proxies, and so you can see it takes about 15, 20 seconds in this case for Grafana to start seeing that change in traffic, but we can see there, 50% of traffic is now going to version two.
Now, this is still looking pretty good. We've got 100% success rate so far. Our version 2 seems to be holding up. We're not too worried. Before we roll it all the way out, though, we're going to just try it at about 75%, just to see how it's handling the load. And you can see as we increase that, I think it was about 78. Suddenly we're hitting some errors, here. We'll just give it a second, to see if these are transient, because maybe something just glitched when we happened to do that. We'll give it a couple more seconds, but we can see these look pretty consistent. It seems like our new service has some kind of scalability limit, and we don't want to expose our customers, so we roll that right back to 0%, and within a few seconds we're back to just version 1. All the errors have gone away. Our customers are happy, and we can go and debug our rollout, what happened for version two, from our logs and our metrics. That's canarying, with Consul Connect.
» Mesh Gateways
So, the other big feature we've talked about a lot this morning already is mesh gateways, and I'm going to demonstrate those in action, too. How do I visualize that? Well, we'll see. We mentioned earlier about heterogeneity being a big problem in businesses, and that means multiple cloud providers, multiple runtime platforms, and as Mitchell explained already this morning, these bring the problem of disconnected networks. Whether that's separate subnets, or separate VPCs, or separate cloud networks entirely, that are separated by WAN links, or whether that's just overlay and underlay networks, between Kubernetes, and between other things, we see a lot of cases coming up where just getting basic connectivity is a stumbling block for people to really be able to use these service mesh technologies in practice.
Mesh gateways solve this problem at this service mesh layer, without needing to add a bunch more complexity into our network, to manage things like overlapping IP addresses and so on. They behave like traditional network routers, but they're operating at the TLS layer, not at the IP layer, and as we saw, they're using server name indication, or SNI header, in the TLS handshake, to decide where to route the connection. But then after that, the payload is completely opaque.
All that's required is for the gateways themselves to have a publicly routable IP, or at least an IP that's routable from the other networks that you need to connect from. And because they're not actually terminating TLS themselves, they don't have keys. They actually, physically can't decrypt what's going through them, and we'll see that in real life in a second.
So, to show this off, I'm going to do multi-platform, multi-cloud. We're going to have one service running on AKS, Azure Kubernetes ServiceS, which is on an overlay network, in an Azure VPC, in America, I think. And we're going to have our backend workload is going to be running on Nomad, in a Google VM cluster, in Europe. Because why not? And we'll look at this.
The spoiler here, folks, is you've already seen this demo. I don't know if anyone was eagle eyed enough to spot, but up there, top left, the API service is running on Nomad in Google. And then in a second, I'll scroll back up, because you may not have noticed at the start, but that web service is running on Kubernetes in Azure. And of course, we could have just faked this Grafana dashboard, that's pretty easy, so we'll just click through and show that this was done for real. We've got the Kubernetes UI in Azure, and you can see in here we've got the web service running there.
And then after this, we'll just click through onto the Consul UI, and you can see on the top left, this is the Azure data center in our Consul cluster, and you can see it's running web, it's not running API, it's running a mesh gateway, as well. Then we can flip over and look at the Nomad UI, and see two versions. We've got a V1 job and a V2 job for API, and we've got a gateway running in Nomad, and this was just listening on a public IP of a Google VM. That's all we set up there.
And then this is the Google data center in Consul, and you can see the things. This is root on both of those gateways. On the left, you've got the egress gateway in Azure, and on the right, you've got the ingress gateway in Google. And we're just using ngrep to look at the actual packets flowing through this gateway.
And you can see that these are encrypted. We don't get anything, even when we're root on the box, looking at the traffic. In fact, the only things that are visible here—we just stopped the ngrep on the left—if we search for it, we can see the only thing here in plain text is this SNI header. And this is how the gateway knows how to route the packets, so this is a new connection being made from Azure, saying, "I need V1 of API, in the default namespace, in Google, internal is a detail," and the rest of it is a trust domain for the specific Consul cluster that we're running in.
So, that's mesh gateways, fully end-to-end encrypted, running between multiple clouds, and multiple platforms, in Consul Connect.
» Consul: Built for the real world
I want to sum up what we've talked about. We've talked about the service mesh as a pattern, how it maps to your organizational teams, how it maps to your infrastructure team, providing network connectivity as a service to developers. We've talked about how features can enable security, can enable advanced delivery options. And again, that's offered as a service. That's not something you have to bake into your applications.
We talked about Consul providing service networking solutions, whether it's just DNS, whether it's dynamic firewall updates, dynamic load balance or updates, all the way up to this service mesh sophistication, as and when you need it as an organization. But you get to share the technology. You have the same operational knowledge, that you can carry through from the day that you just need some basic DNS discovery, to the day that you want to do canary deploys across multiple cloud regions.
And it's really important to us that Consul runs wherever your workloads are, whichever scheduler you're using, whichever platform you're on, whichever cloud provider. And then mesh gateways, we've introduced to just try and make this much easier to set up and manage. To get over that initial hurdle of, "How do I get packets from one part of the network to another?"
So, as we've heard already, this is all available in Consul 1.60. We have the first beta available on the website today. Please download it. Try it out. Thank you.