Consul Service Mesh: Deep Dive

Sep 20, 2019

See live, in-depth demos of Consul's key service mesh features.

When migrating from monolith to microservices several years back, Nic Jackson, developer advocate at HashiCorp, wishes he had something like Consul Service Mesh and it's ability to execute progressive delivery deployment techniques. Instead, they did what every other web development shop was doing at the time. Big bang releases, which are considered painful and unnecessary today for any system of significant size.

This talk takes you deep into the Consul Service Mesh code and several examples, conveying helpful information for seasoned DevOps professionals. Examples include new features that allow:

  • Traffic routing
  • A/B testing
  • Traffic resolving
  • Traffic splitting
  • Canary deploys
  • Multi-cluster gateways

You can visit this GitHub repository to play with all of the examples from this talk.


  • Nic Jackson

    Nic Jackson

    Developer Advocate, HashiCorp

» Transcript

There are a lot of you folks in here. That many people want to see my demo fail? I know, I know you're out for blood. Look at this, matches my shirt. Now, if only I'd spent enough time on my demos I have in making sure that matched my shirt.

All right, so let's go. Consul Service Mesh is a tool that I absolutely love, and I'm going to tell you why. The reason that I love Consul Service Mesh is because it's a tool that I wish I had when I was developing systems.

» From monolith to microservices: A case study

The last big system that I developed, I was working for an eCommerce company in London, and we did the whole migration thing. We did the migration from monolith to microservices about 5 years ago, and it took a long time. It took us about 18 months to get the tooling in place, to get the process in place, to change code, to build schedulers, to move to cloud.

And the date came, Saturday, 9:00 PM, sitting in the office eating some pizza. Now, I want to caveat this by—this was headed to go wrong because somebody ordered anchovy pizza. Nobody should have to suffer that. But we were really excited, and we'd planned that we had to do this downtime. 12:00 AM was the time we were going to put the site into maintenance mode, and then we were going to do this migration.

Now 12:00 AM came, put the site into maintenance mode. Everything started to roll smoothly until about 2:00 AM. And at about 2:00 AM, once we'd drunk about 5 cases of Red Bull, we realized that the migration just wasn't going to happen. We had complexities. We could not get the system switched over, so we had to make the call to roll back. And that was sad.

The whole team was really dejected by having to do that, but we wanted to try again. And try again we did. 5 days later in the office, 9:00 AM, no anchovy pizza this time, guaranteed success, and it was going good, going really good.

2:00 AM had came, we'd got past the point that we were at before. The system was looking good.

4:00 AM, still going good.

6:00 AM, should we make the go/no-go decision? Yes. Turn off the maintenance page.

7:00 AM, getting into the car. Phone rings, CTO. Site's down. Okay, I've not slept for 24 hours. I can deal with this. Back up to the office. Everybody gone.

Eventually we figured out what was wrong.

Was it the deployment?

Was it the new cloud?

Was it the new system? No, we tested it. We tested it again. We had infrastructure testing.

We had all the tests, and it still wasn't working because maybe, just maybe, when we were testing, we were using the development credentials for the payment gateway on the production system... Maybe...

But where I'm driving at is that we had to do that big bang approach because the tooling that we had at the time necessitated that. And where we are now, we've got wonderful tools which allow us to take very, very different approaches and be more graceful in the way that we develop and we operate our systems. Give us extra security, safety. And I love some of these features because I don't want any of you to have to suffer the pain that I did at that point in time.

» The features of Consul Service Mesh

So I'm not just going to talk about that monolith migration. I've got a few patterns that I want to run through with you. We're going to look at some of the new features of Consul Service Mesh. I'm going to give you a quick run-through of some basic configuration, just in case anybody didn't read the caveat in the paragraph in my abstract which said, "I'm not going to go into any basic configuration."

I'm going to look at some of the traffic route and patterns, and I want to apply that to monolith-to-microservices. Just going to try and put some patterns into this. And we're going to look at traffic resolver, and we're going to look at traffic splitting, multi-cluster gateways, which I love, and service failover. So let's begin.

» Basic configuration

With Consul you're very familiar with this configuration. The service standards have been the mainstay of how you register services with the Consul client. And nothing much has changed there except that we added these new blocks which allow you to configure the service proxy, and you can do things in there like configuring the core that you want the data plane to run on. You can do things like configuring your upstreams, but what we introduced with Consul 1.5 is centralized configuration. Whereas the service configuration is specific to the service instance, central configuration applies to all of the services which are defined, in this instance, as web.

Here's a simple example of some service defaults. I'm defining the protocol of my web service as HTTP. If you want to do this as JSON, you can also define that same configuration file as JSON. We tried very hard to be able to give you that configuration option and define it as YAML... But we just couldn't do it... Maybe next version...

All righty. Those configuration entries though, we've got the proxy defaults. So those are global defaults, which you can apply to all instances of your proxy. Things like your metrics configuration, which metrics output format you want to use.

Service defaults, protocol, a couple of other things that we're going to look at.

Service router, how do we configure L7 traffic routing.

Service splitting capability of doing things like canary deployments, weighted traffic to different service instances.

And service resolution, service resolver. Think of that like virtual services in some way. And I'll probably get hammered for saying that.

» The discovery chain process

The way that these work is they work in a very specific order. You don't have to apply all of the things in the chain. You can use a service router on it, you can use a service resolver on it. But when you're using the centralized configuration, it's important to remember that they are applied in a certain order.

So we do routing, then splitting, then resolution.

» Traffic routing

Let's dive into our first example. This is where we're at. We've got this system here, we've got a monolithic web app, and we've got a monolithic payments app. Now, we want to modernize, but we don't want to do the big bang. We want to do a graceful modernization. What I want to get to is this situation. I want to be able to break out the currency part of my payment service and put it into its own isolated service. And I want to do that without reconfiguring or re-coding any of the instances of my web application.

I can do that using service routing. So I'm going to define a service router and it's going to look a little like this. What we have are the capabilities to add various different routes. Here I am going to define a very simple HTTP route. A path currency is going to go to my new currency service and any other path is just going to go to my payment service. When it goes through that resolution chain, the service router inside of Consul is going to automatically choose the correct upstream. You don't need to configure anything changing in your web application, Consul's going to do all of that for you just by setting up those central configs. So let's give it a look.

What we got? First let me just start my example. I built a few little examples for you and I've built all of this stuff in the lowest common denominator. I built it all just in Docker Compose, so you can take the repo, play with it on your own, and hopefully it will be useful examples for you.

We've got Consul up and running, and we've got our currency service, we've got our payments service, and we've got our sidecar proxies. They'll go healthy in just a second once the health checks kick in. But what we want to start doing is enabling that service routing. So let's have a look. Is that healthy?

The configuration that we're going to apply is this router. Exactly what I showed you on the slide before. Now we have to also look at the service defaults because what we need to do—we're using L7 routing—we need to configure our services to be L7. So, I'm going to use protocol HTTP for those things.

While I'm waiting for that, how do I apply central configuration? I can apply central configuration in a number of different ways. Well, two different ways really. One is I can use the command line and I can use the consul config write with a file (web-defaults.hcl). And the other way is that I can also use the API. So if you're taking an infrastructure as code approach, which you should use the API. Just to be able to, as the opposite of write, you've also got read.

I can do consul config list -kind service-defaults. You can see I've got one web config loaded into my system and I can read that. I consul config read -kind service-defaults -name web. So that's just coming out the other side there. That's all healthy.

Where we're at so far, if I curl my service, what's happening is my web service is calling the upstream and you can see there that it's hitting payments-v1. Now if I curl my service with my desired path of being able to have currency GDP, I'm still hitting the payment service because I haven't applied any of my configuration yet.

So let's go ahead and do that. consul config write currency-defaults.hcl, payment-defaults.hcl, and then the router. So this file here (payment-router.hcl). Okay, now let me curl my service route path going straight to the old monolithic service as I wanted. If I curl that with my new currency path, you can see that I'm now hitting the currency service.

In my upstream configuration for my web service, you can see there that I've just defined payments. The web service is completely unaware that the service currency exists. What I'm doing is I'm rerouting that traffic using the centralized config, which I think is incredibly neat.

» Traffic resolver

Next example, traffic resolving. Let's take a look at how we can use the traffic resolvers. Why would we want to use traffic resolver? Well I think A/B testing is a pretty neat example. This is what we have. This is our current state, but you know we're still doing a bit of a risky approach. We're not really certain that our current system works.

My code works, I don't know about yours. But wouldn't it be nice to be able to do something like this? Where I can actually have both of those things in existence, and I can route traffic to a very specific subgroup for my new system. That way I can gradually test it. I can try out the new features, do some A/B testing, maybe some dark deploys.

And again back onto the discovery chain, what I'm now going to do is I'm going to introduce this new component, which is the service resolver. And I'm going to use it with the service router. And with a service resolver, what I can do is I can define things like subsets. I'm defining here two service subsets, one for v1 and one for v2. So my v1 service is going to be selected if it's been tagged inside of Consul with metadata version=1. My v2 service metadata version 2.

An important thing to note there is the default subset. With a default subset, all of my traffic will route to my v1 subset in the absence of any overrides. And that's a very useful thing. With my router—and we looked at a example of how to do HTTP routing on paths—I can also do HTTP routing using HTTP headers. In the instance that the HTTP header test group where the value of B is present, all of my traffic will route to the v2 subset instance. If it's not present, I'm going to be using that default subset. So it's kind of cool.

This is the way it's going to work. We're going to set up this system. payments.service.consul com-something. If we go the route without the header we're going to hit the version 1 of our system. With the header, we're going to hit that version 2. So we can do that slow, gradual transition.

Let's take a look at it. First things first, let me kill my old demo and start my new one. All righty. So my configuration is very, very similar. Again, I'm using that centralized configuration. I'm defining my services because I want to be using L7 and they've got to be protocol aware. So HTTP. You can use gRPC as well, but... service defaults for my payments service. My resolver.

We looked at that and the slide got that default subset, which means in the absence of a subset it will always route to the version 1. Version 1 is defined by the metadata stored in Consul. I'm only using service metadata here, but you can actually use any of the filters which are capable on the health checks.

And finally my router. You can see, my setup there, got my payment service, payment v1 ID and v2 ID, v1 ID metadata version 1, v2 metadata version 1. Now in the absence of any centralized configuration, if I curl those endpoints, what I'm going to get is just standard load balancing. So kill it once, you can see I'm hitting v2, this time I'm hitting v1, because I've got no centralized configuration at present. So let's apply that.

consul config write payment_service_defaults.hcl, setting up that L7 configuration, currency defaults. My web defaults, my service resolver. And finally my service router.

If I curl my endpoint, what I'm going to get is my v1 on my service. Only my v1 on my service, and that's because of that service_defaults. Now when I apply the header to that, I should see this hit the v2 service. Let's add that header. Right. curl localhost, adding a header, test group B. And there we go, we're hitting the v2 service.

We've taken that big bang approach, we've deconstructed it, we've got the ability to run the old and the new system, and now we've got the ability to be able to subset and route only very, very specific traffic through to our new version. Gracefully doing that rollout, which is kind of neat.

» Traffic splitting

So what's next? Surely that's enough. That's pretty incredible. No? You don't like it? I mean I'll be honest, I wrote none of this, but I'll take all the credit. Let's take a look at our next example. This is going to be one of those talks, because I got so many examples. You know like a lot of talks, anybody watch them at one and a half speed? You want to watch this one at half speed.

Traffic splitting. It's all well and good being able to do that A/B type test. But one of the nice things about modern software practices is we can do these graceful automated rollouts. Being able to slowly roll out a version of new software and to be able to have that have only a very small fraction of traffic while you're checking that it's okay—because remember your code doesn't work and mine does—but it's very, very... in some ways it's a bit tricky, right?

With Consul Connect, you can use the traffic splitting L7 feature. Our desired state is that we initially want both of our versions of our software running. Our version 1 service and our version 2. But we want to send 100% of the traffic to version 1. Then we want to be able to test the water. Let's just check that the version 2 version of the software is behaving as we've expected.

We can do that by checking the various different metrics, and tracing, and other business metrics, and observability. We're going to do that gradually. Once we're confident, we're going to increase that, maybe do a 50/50, and then eventually we can roll that out so that 100% of the traffic is version 2 and we're going to then deprecate the version 2. And we're doing it safely with a minimum impact and disruption to our customers.

From a discovery chain perspective, we're now going to use these final two elements. We're going to use the service splitter and we're going to use the service resolver. The service resolver is exactly the same as what we used in our A/B test. We're defining the two subsets.

With the splitter, what we can do is define multiple splits. Here I've got two, and I've got two weights, 50/50. So the first split, 50% of traffic is going to go through to service subset v1. 50% to service subset v2. You can put whatever values you want in there as long as the total of all of the splits equals a 100%. When the traffic comes in, again, Consul is going to change the way that things are load balanced to the upstream based on those weightings. I don't need to make any changes or any behavioral changes to my application. This is all just going to happen for me by doing this configuration through my central configurations.

Let's take a look. So here we go. Again, let me just start my demo. Starting my services. Now what I'm doing is I'm putting things into my initial condition. I'm only going to have my version 1 service running because I want to be able to configure my service splitter and my service resolver before I deploy my version 2 service. We've got v1 of our payments running here. And what we're going to do is we're going to configure our service resolver to be able to define those subsets, and we're going to set up our service splitters.

First things first, let's get that configuration written. consul config write currency_default.hcl and again, my payment service defaults, my payments service, resolver. And my web service defaults. If I curl my service, what we're going to see is we're hitting v1 because we've only got our version 1 of our service running, but now let's deploy version two of our service. Oh, I shouldn't have done that. This was going way too well, wasn't it? You wanted blood. You got it. Thank you all. Let's just wait for this to start up.

We've now got our v1 and our v2 services running. Because we've got that centralized configuration already applied, when we curl our service we should only hit the v1 service. We're only going to head v1. I'm pretty brave and I'm just going to go straight to 50/50, because I like to YOLO things.

So let's apply that splitter configuration. Again, consul config write payments_service_splitter_50_50.hcl. Now, curling my endpoint you can see that I'm getting flip flopping between v1 and v2. 50% of my traffic going to one and 50% going to two. So what if I change that? Well, I'm pretty confident now, everything's working great. I'm going to send 5% to v1 I'm going to send 95% to v2.

Again, I'm just going to apply that configuration. All of these changes that you're making through central configuration is going to be impacted with zero downtime because of the way that Envoy works. Envoy's going to hot reload any of the configuration that we're passing down to it, it's not going to cause any impact to your existing traffic. 50/50. Let's curl that. Now we're pretty much hitting v2 all of the time. This is going to just call me a liar, but if I do this long enough it will hit v1 eventually. You got anywhere to go to after here? You can trust me that that works. I want to move on because I want to show you something even cooler.

» Canary deployments with A/B testing

We've done A/B testing, we've done traffic splitting, we've looked at service resolution and Canary deployments. What if I could do Canary deployments with A/B testing? That'd be pretty neat, because then I can be even safer. I can ensure that my test group only gets a select amount of my new service whilst I'm rolling out any updates. So my situation is that I want to be able to do this. If somebody is in the test group, they've got the HTTP header test group equals B, then they're going to hit the traffic splitter. Anybody who doesn't is just going to hit the v1 version of the service.

Now I'm actually using all of the elements of the discovery chain. I'm using the router to detect that there's an HTTP header, and then to send it off to a subset or a traffic splitter, and then to the service resolver to be able to determine which version of my service to select. Which is kind of awesome.

Service router, it's exactly like we saw in the A/B test. I'm just going to be using test group, with the exception I'm defining a route path now and I'm explicitly going to send that to my service subset v1.

All the things into action, but it's an incredibly graceful way of being able to introduce new versions of software into your environment. Let's not do that, and see how that works. I've got my router. All my configuration is already applied, so I'm currently running my service resolver and my service splitter. Let's apply the router.

consul config write payments_service_router.hcl.

Now, if I curl localhost, I'm hitting just my v1 version of my service, because I'm not in my test group. If I'm in my test group that I've got that header, test group B, then you can see that I'm going to hit the different version, version 2. And just to prove that that is actually traffic splitting, I'm going to switch this back to 50/50 and I'm going to apply that. So consul config write payments_service_splitter_50_50. curl localhost still hitting v1 there, test group. There we go. See? I wasn't lying to you, I wouldn't do that.

» Multi-cluster gateways

Where do we go next? Well where we go next is multi-cluster gateways (called Mesh Gateways). And one of the really, really impressive features is the ability that you can now route traffic between different service clusters.

I have my web service, and if my web service needs to talk to my payment service, which is in my PCI protected cluster, Consul's got the capability where that traffic can be transparently routed, regardless of what overlapping networks you may have, regardless of any sort of conflict, you don't need flat networks. Consul gateways are going to work.

And the way that it works is that the web service resolves the payment service to the gateway. The gateway is aware that the payment service resides in a different cluster, so it forwards the request to the DC2 gateway. The DC2 gateway then forwards it on to the payment service. At no point do you need to configure out anything specific inside of your service upstreams or anything like that, it just does it. And it does it securely because the gateways are just transparently proxying the requests. They're using SNI headers which give them the ability to detect where the destination is without having to decrypt any of the protocol or the packets. And it's just awesome.

We need to make a couple of small configuration changes, so we need to be able to federate our two data centers. I've got datacenter 1, primary data center's DC1. I'm setting up my advertise WAN address so that it can talk gossip across the WAN, and I'm enabling central configuration. DC2, specifying that as DC2, primary datacenter DC 1 the advertise address but just it's local IP it's running in the dot six.

The advertise address WAN, which is the WAN connecting the two datacenters, and the retry join, which is going to allow it to automatically join the other datacenter, federate all of its configuration security which is required for identity authorization and intentions.

Configuration for that is pretty simple. Again, we can use those service defaults, mesh gateway mode local, mode remote. So what's the difference between local and remote? With local, I'm always going to hit my local gateway, forward to the remote gateway, and enter the destination. With remote, I'm going to skip the local gateway and I'm just going to go direct to the remote gateway. So I'm saving a hop, but you've got to ask yourself, is that at the expense of your ingress security for your other datacenter? There's two ways that you can set this up.

And let's take a look at how this works. Because I am almost as bad as Clint for timekeeping. We're going to run this. So what am I going to be doing in my configuration? Well, the configuration differences that I need in order to be able to route that traffic is pretty simple. I'm going to define again in my service resolver and I'm going to use this time a redirect stanza block. I'm giving a hint that when I want to resolve payments, it exists in datacenter DC2. Just to make things interesting, I've actually got my currency service inside of DC1, so payments talks to currency, which is an a different data center and web talks to payments, which isn't a different datacenter because that's how I roll.

Looking at the Consul UI I've got my services there. You can see that I've got my federated system, got my payment sidecar running in datacenter 2. And I can go ahead and I can apply my configuration. So consul config write payments_defaults.hcl, consul config write web_defaults.hcl, my consul config write currency_defaults.hcl The two resolvers because again, 'payments' residing in DC2 needs to talk currency in DC1, and web in DC1 needs to talk to DC2. Currency resolver and my payments resolver.

Now, curl localhost, 90/90. And what you can see is happening is that there's an upstream for web, it's actually hitting services inside of DC2, so payments running in DC2. And payments running in DC2 is hitting a service and resolving to DC1. And all I've had to do is set up that configuration. It's an incredibly useful feature, incredibly powerful. I really love this thing. I could just go on, and I usually do, but I can't.

» Consul's service mesh features in review

Just to quickly recap some of the features that we've introduced. You've got the ability to do traffic routing: one of the example patterns, maybe monolith to microservices. A/B testing, use the traffic resolver, traffic splitting, the ability to do those canary deploys, dark deploys, multi-cluster gateways, incredibly powerful. Route your traffic between Kubernetes, virtual machines, virtual machines, Kubernetes, bare-metal, you name it you can do it. And service failover, traffic failover between clusters. (See a demo of these features that weren't covered in this talk by checking out Nic's Consul 1.6 webinar.)

I want to thank you for listening, but mostly I want to direct you to our URL there. All of the demos that I've been showing you, I literally just built those up in Docker Compose, because I can, because it's that easy. And I encourage you to just clone that repo, have a play. Let us know what you think. And if you want to chat about things, let's talk about it on the community forums. So discuss.hashicorp.

Thank you so much for listening.

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now