Demystifying Service Mesh
Sep 23, 2019
HashiCorp Solutions Engineer Stephen Wilson gives one of the clearest introductions to service mesh you'll ever watch.
To do service mesh the right way, you need to start with the right premise, the right philosophy, the right idea. You have to nail service discovery before you even consider service mesh. And in order for us to truly adopt service mesh and next-gen networking, we're going to have to start peeling away this tight coupling against the IP.
In this talk, HashiCorp solutions engineer Stephen Wilson gives one of the best, plain-English explanations of service mesh that you'll find.
- Stephen WilsonSolutions Engineer, HashiCorp
Hello, everybody. You all enjoying the conference so far?
Today I'd like to start out by giving you a little history lesson, if you will, about the telephone. This is Alexander Graham Bell, the picture you see here, and he is the inventor of what we would call the telephone. When he first invented it, it was primarily a point-to-point communication where you had to have 2 lines hooked together.
But in order to scale that and get to the point where multiple humans could talk to multiple humans at a given point in time, they had to make some level of interconnection, or points of intersect. In doing that, they formulated this idea of the operator.
If I wanted to make a telephone call, I would ring up, not the person I wanted to talk to, but an operator, and tell that operator that I wanted to talk to somebody, and then she would connect me. And the further the call that I wanted to make, the more operators I had to talk to in order to get there. But even at that, this didn't quite scale very well.
We had to move away from the manual aspect of a human interfacing and move to a more mechanical, in this case, what you see here. But even beyond that to a digital interconnection. Where now all of a sudden everything is routed automatically across the interwebs of space and time.
The parallels between app dev’s evolution and that of modern communications
That was the evolution of the telephone in order to bring you to modern-day communications. And if you think about the way that we've built applications and how we've interconnected them, we're actually taking a very similar journey.
If you think about it, back in the day with a client of the server idea, some of us have mainframes, I remember mainframes, we had that very simplistic client/server mentality, where I had a frontend, and it was very simple for me to route traffic to that monolithic service that I had running kind of out there in space. Everything was self-contained; everything was right with the world.
Then we started to scale this up, very similar to the telephone. We started to add more services, and we started to add virtualization, and we started to add physical switching, virtual switching, physical firewalls, virtual firewalls, all these different components in order to try to scale up the size and scope of the applications that we were trying to use to interface or communicate with others.
It’s time to move beyond tight coupling to IP addresses
And it all comes down to this simple idea: The identity of everything that runs on this network is connected to the simple 4 octets of an IP address. Think about everything you have, every device. In fact, there's a crisis brewing where we're outstripping this idea and moving to IPv6 just because so many IP addresses exist out there.
What this IP address has created is a tight coupling between applications and the underlying physical infrastructure that they interface with. Think about decisions that are made when you deploy an application, and think about all of the dependencies around the network that go into it. The idea of networking micro-segmentation, the idea of firewalling, the idea of load balancing.
We're going to talk about that in a little bit. But this nomenclature of the identity of your service being tightly coupled to IPs is slowing down your ability to not only innovate internally, but to move to that idea of a hybrid cloud or a multi-cloud operating model.
What if there was no IP?
I'm going to ask a question and we're going to go through this. But in order to really start to think about this idea of demystifying a service mesh, you have to look at the problem and ask a controversial question: What if there was no IP? What if you did not have access to an IP?
So I'm Steve Wilson, solutions engineer at HashiCorp, and we're going to drive through this a little bit.
As we walk through and explore this thought exercise of, “What if there was no IP?”, you have to start to ask yourself some fundamental questions about how you would architect and deliver applications.
The first thing is, how would services be able to find each other if they did not have this identity nomenclature of an IP address? Or how would services be able to securely connect to each other? And in that context, how do I start to do the normal things around traffic shaping for critical ideas of deployment strategies, update strategies, failures, those types of things?
As we walk through these questions, I want you to judge these answers by these criteria. The first is, whatever you decide to do, it has to be able to further or bring a higher level of portability to your application delivery, be it running on-prem, in the cloud, in another datacenter. It doesn't matter. You need to bring a level where I don't care where that service runs.
The next thing is, a higher level of agility. I want to be able to update immediately, not just every day, but at will. I also want to be able to lower my overall operational burden, this idea of the cognitive burden. I don't want to have to think as much about how I'm delivering and managing the intercommunication of all these services as I'm delivering them faster, and I have no idea where they're landing. And I want to be able to do that at an extremely low security risk if at all possible.
I know you guys are asking, "We've solved this, right? We've solved this with DNS." We haven't. That's OK. There are others out there that would say, "We have solved it, and it's with YAML." And there would be some people out there who might disagree, like HashiCorp co-founder and CTO Mitchell Hashimoto tweeting, "YAML is never the answer. Not once. Not even once." So let's start to explore these questions a little bit.
Service discovery is foundational to the service mesh
How are services going to be able to find each other? This is foundational to even beginning to break apart this idea of the service mesh. This is the dirty little secret that you typically won't hear being talked about when you hear the buzzword of service mesh.
You normally hear about the mutual TLS connections and you hear about circuit breaking. But no one's coming in going, "Hey, services finding each other is really a core critical component." Because they have assumed that you've solved it. And ultimately this is the expression in human language that you're trying to do: Web-App needs to talk to Order Processing, or needs to find Order Processing.
Load balancers are not the answer
When we go through this, this diagram is what it would look like in a very simplistic idea. I have a load balancer in front of my different versions of my services or multiple copies of the service running. I've assigned a virtual IP to that load balancer to give it some level of static identity. Remember back to that idea of an IP, right? And I've hooked those together with other components, be it load balancers, switches, routers, networking, whatever that may be above it. But ultimately I've given it a DNS entry of web-app.internal or order-processing.internal. And they're able to communicate in that capacity.
By having this load balancer, I'm getting some level of assurance of things around the idea of requests balancing, in the event that I don't flood my services. I also am able to handle any type of failure and failover within this. I'm able to allow traffic to be routed to healthy nodes.
What I've done here is I've added a layer of abstraction. I've put a virtual IP at the load balancer, the entry point, and then underneath it I've allowed those individual components to have their own IP addresses of any nature. But I can still find it.
But this creates 2 different teams that now have to communicate simultaneously; in this construct it’s a networking team and the application team. For the audience, this is kind of group therapy. If some of you start to feel a little anxious, that's OK. There are others next to you that are going to feel the same way.
This construct of, "I have an application team, and they're going to deploy a new application, Order-History, and it's out. I've done it and they've told the network team, 'That's now been deployed.'" The network team will say, "Thank you very much. I've received your ticket. You will hear back with us within 72 hours give or take." And then they go around and they configure a load balancer to go in front of it, and then the networking team will call back to the application team and say, "Your traffic pattern has now been configured."
Maybe I submit a ticket and I can take a week's vacation, because that's how long it'll take for traffic to get routed back to me. This is probably how it's done. We'll do a quick poll. How many of us is feeling this is a little too like reality? OK, several hands there. So I'm kind of on-base here. That makes me feel good.
Now, some of us may say, "OK, we've solved this problem. We have a solution to solve this issue."
Kubernetes is not the answer
Everyone knows this word, Kubernetes. In some corners, this may be a dirty word. But we all see this and we think that this is what's going to solve it. And to an extent it does. It has built-in service discovery, built-in load balancers, built-in internal networking. I could see how this would solve a lot of challenges, because it has all the things.
You're saying, "See, this answers those questions that you posed." But when you go to bolt this in, it doesn't take the place of it; it just becomes an extension of what you already have. Because you still have applications that sit out here on the bottom. And now what you've done is you've deployed a mobile access service. You've deployed a load balancer in front of that, an ingress controller, egress controller. You then hook it to load balancers in front of it in order for it to be able to find everything. And again, it has a VIP that is now a DNS for entry and exit.
And the same pattern happens. I deploy my Kubernetes cluster, and I deploy my application. Now I've deployed the traffic pattern externally so that services can either route to it or services inside can route externally. That's where Consul comes in. And okay, I'm a little biased. But I would probably agree with this even if I didn't work for HashiCorp.
Consul makes things simpler
When you look at this construct of Consul, it makes things a lot more simplistic. Because I'm not concerned with an IP address anymore. What happens here is, when I go and I deploy my Order-History, under the covers Consul is automatically converting that to some type of discoverable route in a nomenclature that's easy to locate. And this construct is service discovery.
And I cannot stress this enough: If you do not get service discovery right, there is no reason for you to do a service mesh. You must do this portion correctly, or your service mesh implementations will not go very well.
If you think about it, when you're running these service meshes, you're thinking about Kubernetes. You're using that service discovery, but you have to find some way to expose that service discovery. To register services externally becomes very challenging. In this construct with Consul, it becomes very easy.
It's something that's understood both inside of the world of Kubernetes and containers and externally in the world of bare metal, in the world of even mainframes. You can do that as well. So this allows you to span across and do service discovery very well in that nomenclature. And here's an example, again, very simple, something everything understands; I mean even my toaster at home understands doing DNS lookups.
Connecting services securely
The next question we have to move on and begin to communicate with is this idea of, How can services securely connect to each other? I've solved this challenge of service discovery, that's great. But how do I move in, and why do I want to move in, to this idea of securely connecting to each of these? Let's go back to our example of "Web-App wants to talk to Order-History." But not only do I want it to talk to it, I want to ensure that it's talking to it securely. And I want to ensure that only Web-App can talk to Order-History.
If we're going to decouple the IP, this is the next step. There is a high burden within the networking organization that they're not only managing pure network security; they're managing application security at the networking level.
If you think about the conversations that you've had around micro-segmentation at the IP level, and firewalling, and those types of things, they're bringing a high level of complexity to the network, and they're unable to optimize networks effectively. That's why it's very slow for them to make changes. Because they're not only having to worry about general network security, which they absolutely should be concerned with. But they also have to be highly concerned with application security flowing across that network link.
Firewalls everywhere are not the answer
So back to our example. App team, network team, but now we've added a new team. We've added a third group here, the firewall team. What I've done to ensure network security is put a bunch of firewalls in front of everything.
I'm hearing some laughter from this side, and I think they're laughing because if they didn't they would start crying. Because this is exactly how it looks in their environment.
Now I have firewalls in front of my applications that ensure that those application components can only talk to the load balancer. I have firewalls in front of the load balancer to ensure that they can only talk to other load balancers in front of them. And then I have to put a firewall in front of my Kubernetes cluster to ensure that I can route traffic and protect traffic going in and out of that ingress/egress point.
I was talking to a regional bank recently, and they told me that when you log into their online banking, you traverse 50 firewalls just in the login process.
Then you wonder, Why can't the network team do an upgrade to the latest firmware updates? Or how can they keep up? Because these are the types of complexities. We've grown, we've added more operators to the equation.
Now this is how it works. This is how the conversation goes. App team says, "I've deployed Order-History." Network team says, "Thank you very much. We'll get back to you sometime." Networking team then contacts the firewall team and says, "They put this out here. We can't configure the load balancers until you configure the firewalls."
So then the firewall team has to go through and parse through their 2 million-row Excel spreadsheet in order to validate everything. And then they contact the network team and say, "We think we got it right. It does sound like it works." Then the networking team makes their load balancer adjustments, and then they communicate back to the app team and say, "The traffic pattern is configured."
I talked to a customer 2 weeks ago, and they told me that this process takes them anywhere between 30 and 60 days.
Folks, once you solve the provisioning problem with Terraform, your next step is immediately moving into this realm of dealing with service discovery. And ultimately moving to this idea of service mesh. Because it is going to be the next bottleneck in your deployment workflows that is going to cause you the greatest queue times.
The multi-cloud challenge
And let's not forget that on top of this we're adding all of this fun stuff: Now I've got multiple clouds, and they express networking entirely differently than anybody else. AWS expresses it with all kinds of different connectivity nomenclatures. GCP expresses it differently. Azure expresses it differently than that.
And don't get me wrong; Terraform does a fantastic job of configuring and managing these things. But a lot of times we're building complexity into our Terraform code, all in an effort to maintain that level of network security.
Think about all the routing, all the firewall rules, all the security groups. We're inherently copying and pasting these operating models within our private datacenter and moving them to the cloud.
Now if we add the cloud, we're just adding more players to this group. App team now calls the cloud team, says, "I've deployed." Cloud team then calls the network team says, "We've added a new service, and it's out in GCP." Network team says, "Great, now we go to the …" you know? And the cloud team comes back and says, "We've configured our part. Network team's still going to do their part. They're going to get back to you."
Networking team does their firewall rules and everything else. Comes back to the network team. Network team then talks to the cloud team and also does talk to the app team and finally says, "Traffic pattern configured."
I've had customers tell me that sometimes this can take anywhere up to 3 months. Tell me, where are we getting the speed? We're moving closer and closer to this idea (of firewalls on every microservice). I don't know about you, but this scares me.
Let's move into this idea of, How do we secure applications if there is no IP? I'm going to be biased; you are at HashiConf; we are going to talk about Consul here. But let's look at this idea of how, once we've solved service discovery, a service mesh comes into play in this context.
Going beyond IPs with ACLs
Now I have Consul running, and within Consul we have some constructs of intentions. We have certificate management and we also have ACLs. What ACLs do is, it gives you a way for applications to get some level of identity that is not IP-based. I can give an application an ACL token. I can utilize Vault to generate that token based off of some type of authenticating machine identity.
Once I have that ACL token, that machine has an identity. It can then be mapped to an intention. And what that intention says is, "Web-App is allowed to talk to Order-History."
Now what happens is that we use our favorite proxy. Everyone knows Envoy. I don't need to go into too much detail about what Envoy does. But the key here is that, because of the service discovery capabilities within Consul and because of the ability to have a distributed key-value store, I can manage it centrally in a data plane and be able to deploy to Envoy and dynamically configure it on the fly.
That intention then translates and says, "This is the configuration for Web-App Envoy proxy. This is the configuration for an Order-History Envoy proxy." And then what happens is that a certificate is then dynamically generated that is mutually signed between Web-App and Order-History. So that when Web-App talks to Order-History , it's now talking over a private tunnel.
In the context of our Web-App, instead of trying to hit order-history.service.consul or order-history.mydomain.internal, whatever it might be—and I have to manage the port range of, like, 1531 or 9,000, or anything of that nature—the application can simply talk to localhost on any port. On the Order-History side, you can maintain default ports, because the communication traffic is now running at a non-privilege level between 20,000 and 22,000.
Now you can go to your network team and say, "The only thing I care about is that bits can flow from A to B on this port range of 20,000 to 22,000."
Security baked into the deployment pipeline
I have offloaded all of the application security and communication bits and put them into the application configuration. I'm baking security into my deployment pipeline. And if any of you have read any of the DevOps handbooks or the Phoenix project or anything of that nature, that is exactly what they tell you to do: that security must be baked into the application deployment pipeline.
Here's what it looks like. It's a very simple artifact. And within Consul Enterprise, we also are bringing to bear the ability for you to maintain a level of consistency from a security model perspective where you can comb through these and ensure that certain global policies would be enacted and configured.
Now these can be checked into version control. The big thing here is that change is now a PR request. And in that construct, my documentation updates as a matter of the change being implemented.
Instead of me changing some type of communication and then having to go back and document it, simply by making the change in the environment, the output is the documentation of what the environment is. Think of it like dynamic CMDB-type stuff.
Now if we go back to our same situation: I have Web-App; I have Order-History; I have all these different things. I now have federated datacenters with Consul, and I have all of these different services that can talk to each other. But I'm adding the cloud in here. We can't forget the cloud.
Now I have the cloud in here, and I have services running in AWS—and you saw the announcement from Azure. You're going to have a managed Consul within Azure. You can hook that right into your Consul on-prem—now I have all this stuff, and we have all the secure communication.
The mesh gateway reduces complexity in multi-datacenter networking
Then you go, "Wait a minute. The nightmare scenario: If I go to my networking team and talk about that this diagram on screen is what it's going to look like, they are not going to like this. This sounds in theory great. But this is not a scenario that people are going to be happy about."
And we thought about that. We thought very long and hard about how we would go about solving this particular challenge. And we came up with something very elegant in nature and simplistic, but it makes a lot of sense. What we have done is that we've created this idea of a mesh gateway. And now these mesh gateways sit in front of your services between datacenters. And what happens is that the applications talk to these mesh gateways, and then the gateways forward the traffic appropriately.
So now, in these multi-datacenter gateways, I can define. For instance, I can say, "Traffic can only stay in my local datacenter; it can't escape." Think of this as being maybe a default service policy with a tag of "dev." So dev instances can never reach outside of their local datacenter that they're spun up in. Or this may be a prod configuration. If you look where it says "mesh gateway," it's saying "remote," allowing me to say I can communicate externally or communicate remotely.
And so the last question becomes more of a benefit than it really is a challenge. And that is, How do I programmatically do traffic shaping based on changes in strategy?
We heard earlier today Mitchell Hashimoto talking about this idea of acquiring another company. How do I wrangle and bring them in? How do I manage any type of updating, rolling updates, and failure?
In this context, in the diagram on the screen, you can see that this can be kind of challenging. That can be kind of tough. But if we look at it in this context, in this diagram showing when route requests are based on HTTP parameters, this is what you're trying to do from the traffic-routing perspective. I'm trying to say I have a Web-App, that a certain URI endpoint points to a specific service.
The beauty of wanting to go down this route of being able to shape based on any type of HTTP parameter is I can take my big, ugly Java monolithic application, and I can begin to peel them off by URI callout and splay them out over microservices over time. Versus having to say, "I've got to break this all apart in one shot to get it out there."
I also want to be able to do some level of traffic shaping so that I can iterate faster. Remember, we want to move more agile in this. And I also want to be able to handle some type of failover. What happens if Order-History is no longer available? Well, in this context of having the mesh gateway and having all the interconnectivity between this, now I can handle not only HTTP traffic routing.
Think about it in this context: I take a service, I peel it off, I place it in maybe GKE or AKS, or something of that nature and I put it at the /endpoint-admin. When I have a service mesh in place, something like Consul, when my frontend hits that endpoint, it will be routed dynamically over my mesh gateway, securely into the cloud, through into my Kubernetes cluster and be processed.
What if I want to move it the next day? Maybe I want to iterate again and bring it to my second version, or I have some type of failure and I want to move it to GCP. Just by simply it coming online—because again, I've solved that service discovery problem—wherever it is, it will be able to be found. And as long as traffic in bits can route from A to B, I'll be able to service that request.
Again, super simple, super human-readable. It's documented right here within the codebase. This is auditable. It's static, meaning I can comb through it. I can look for any type of errors. I can validate that specific security permissions are being honored.
And a great aspect of this is now the developer has access to define this. The developer can define this, it can be reviewed, and it can be checked in. And what I start to do now is I start to break apart that IP.
Now Consul becomes that disintermediating layer, allowing the network teams to move effectively and efficiently. So now they can have more optimized networks for general security practice, but also for a higher level of throughput. Now they can do those upgrades, they can do those firmware updates. Because their concern is not as much about maintaining application security as it is about maintaining general good networking security practices.
And on top of that, optimizing the network for overall throughput.
No need to wait for networking teams when security is baked in
And the application teams, they're not encumbered by waiting for the networking teams to do all the application security mappings, because that's now all been offloaded into the application space.
So when we look at this in a development pipeline, in development, as an application developer delivering services, when my application lands in development, because it's talking to localhost, everything about where it's landed is defined by the underlying system. So environmental configuration, secure access controls, network optimization, all those things are defined but yet abstracted away.
As this moves across your pipeline, the application picks those up and inherits those from the environment in which it runs. No longer am I worrying about development instances accidentally talking to production databases. Or worse yet, having some type of production instances talking to development instances. Because I have baked that off as just an artifact of where I land.
In closing, if you think about this, just like we've modernized and moved from the old phone to where we are in this modern take on telecommunications, in order for us to really begin to move forward in adoption of service mesh and of next-gen networking implementations, we're going to have to start peeling away this tight coupling against the IP.
A couple takeaways: To do service mesh right, you need to start with the right premise, the right philosophy, the right idea. I know I can't say, "No IP ever," because then you go, "How are we going to get traffic routing and TCPs? Very important." I get all that. But it's more of a philosophy, right? I have to act as if I don't have access to it. Let's start from that premise and work from that particular constraint.
The next thing is, service discovery is table stakes in this shift. If you see me after, you ask a question, my first question to you is going to be, "How are you solving your service discovery challenge?" You absolutely must solve this first.
You must also have the measure of the ideas based on agility, portability, and resiliency. How fast can I move? What’s my breadth of where I can actually move to? And how can I take a hit and keep on moving forward?
And, look, Consul's awesome. If you haven't tried it, you should go check it out. There's a ton of stuff down in the hub.
Also, if you're going to be here tomorrow, find Nic Jackson’s presentation. Because he's going to use the examples that we talked about in this talk and go into the nuts and bolts of how it works and how you do it.
He's going to go really deep. He's not going to talk a whole lot about what we've talked about here. He's going to jump right into the nuts and bolts of, How do I stand this up? How do I get it working? And how does it actually look?
Thank you, everybody. I really appreciate your time. I hope you found this informative. Thank you.