Service mesh technologies have gained a lot of interest over the past couple of years. Although the concept of a service mesh might not be new to people, some of the implementation details are. This video explores some of the key concepts while offering guidance and implementation advice.
Hi, my name is Mishra. At HashiCorp we often get asked about service meshes. Service mesh technologies have gained a lot of interest over the past couple of years. Even though the concept of a service mesh might not be new to people, the implementation details are new to people. In this video, we will explore some concepts behind the service mesh.
Here we have a service called Service A. We have another service called Service B. These can be a VM or container—they can be running in a bare metal server, it doesn't matter. How would Service A talk to Service B? They could hardcode the IP for Service B in the config of Service A, which is easy. You can hardcode the hostname or the IP and make an SGP call, and you're off to the races. In reality, this is not the case. You have multiple instances of services. Service B, Service C—and all of these can be used in terms of receiving traffic on the other end.
Service discovery We have a service discovery problem. How does Service A talk to Service B? What instance of Service B should Service A talk to? We often ignore this idea of everything working fine in a datacenter—on a network. A network is not usually reliable.
Retries and circuit breaking A Service A call to Service B might fail. Should you retry? That's another question that you have to discover as well. Should there be retries? Is it even safe to retry? That's another question. Certain calls like a database may or may not be safe to retry. Those are the kinds of challenges you see in services-to-service communications as well.
Identity Now let's talk about security. Especially if you go into the whole cloud-native space—or zero trust networks—in which you don't have control over your network. How do you go about encrypting traffic from Service A to Service B?
Encryption So, it's okay to be in your DMZ—in your network—and maybe use plain text for certain things that are not sensitive when it comes to service-to-service communication. But in reality, we need some form of encryption.
You can use things like Mutual TLS to achieve this service-to-service encryption. To do that you have an identity problem. Each service needs to be given a unique identity. Let's say Service A might get a certificate called Service A. In this case, this instance of Service B might get a cert—let’s call it Service B1. You have to distribute these certificates across different instances of services—they might be running on bare metal or VM or container, it doesn't matter.
Now let's take a case in which you have established the Mutual TLS connection between Service A and Service B. Let's say this actually works. Let's say, Service A and Service B had the right certificates. They're able to make the Mutual TLS connection. They're able to successfully talk to each other on the network. But the question is, are they actually allowed to talk to each other? That's where you have this problem of whether there's a policy involved—of whether they're allowed to talk to each other.
You might have a set of policies that define in the business logic whether Service A or Service B should be allowed to talk to each other—should a user profile service be allowed to access the billing content. Those are the types of things that you define in policies. You do need some form of authorization there. So you have the AuthZ problem.
These are some of the concerns when it comes to service-to-service communication, especially when you move into the world of microservices, and especially if you're running these microservices on the cloud.
Before we go any further, I’d like to explore some of the first principles behind service-to-service communication. There are two things that I like to explore when it comes to the model that services take when they communicate with each other: One’s the smart client approach, and the other one is the smart network approach. Or you could call it the smart mesh approach.
Let's say we have a Service A and Service B. Very similar to the example before. They are able to talk to each other. All the smarts that come with service-to-service communication—this might be circuit breaking, retries, traffic shaping, and things like that—they are all built in the codebase for App 1, in this case, Service 1.
Service 1 contains everything it needs to do the service call. This might be achieved using a set of libraries. You might include a Java library or a Golang library, based on what you use in your company—and get that for free—as part of the code. Overall, the system is easy to reason about. Everything you're seeing over the network—or the type of calls you’re making—they are all defined in code. It’s fairly easy to reason about the holistic sense of the system.
Those are the kinds of challenges that you see in the smart client approach. A lot of companies have come out and talked about it. One of the successful ones, Netflix, has talked about it pretty frequently—that they have used libraries like Hystrix, Ribbon, and Eureka to achieve this service-to-service communication. These are libraries that they share in the organization, and developers include them and get the smarts into the application.
So now let's explore the smart network—or the service mesh. Let's say you have the same example. You have Service A, and you have Service B. So, you introduce another process which is outside of the application process—the proxy, or the smart proxy. This proxy is running alongside the application, and it's running in a VM, for example. Let's say in this case you're running Service A in a VM and Service B is running in Kubernetes, in a Kubernetes Pod.
This proxy is alongside the application in the VM, and is a sidecar in the Kubernetes Pod—easy. This proxy has all the smarts built-in. You offload all those responsibilities to this proxy. The prime motive of this proxy is to make the service-to-service communication easy—to do traffic shaping for you, authorization, generate certificates, and so on.
In this case, the way the service calls happen is Service A talks to this proxy on a local port, which is completely secure—it's in the same VM. This proxy forwards the request and does some form of service discovery to figure out where this instance of Service B is running—in this case, Kubernetes—and then forwards that call up to Service B.
This is how the flow—or progress—happens: Service A to the local proxy, to the proxy on Kubernetes and down again, local hosting onto Service B. In this approach, there's no logic that you need—or there's little-to-no code changes that are required—in the Service A code or Service B code.
All that code and all that smarts is built into the proxy instead. This is a huge advantage over the other approach, in which you had to ship these libraries across your organization, maintain the versions, and update the libraries—which would cause chaos in your organization because you have to coordinate these changes across many different groups.
In terms of complexity and reasoning about the whole network, now you have yet another process that you have to account for when you holistically think about the system. Your application owners might see an error over the network. They think it's for the application, but the proxy might be causing those issues.
Those are the kind of questions and the kind of concerns when it comes to the service mesh. The service mesh uses some form of telemetry—or does things like observability, which addresses those failures when it comes to seeing errors in production and going about debugging them.
Those are the pros and cons of using a service mesh—or smart network. So let’s dive deep into the service mesh architecture itself. In terms of the big parts that the service mesh brings in, there's the control plane, and there's the data plane. This proxy requires some form of data to configure itself. That's been provided using a config file, over the API.
A lot of proxies like Envoy expose an API that you can use and configure in runtime. Other proxies, like NGINX—you can provide them with a file that they can read, and they can reload the process and you’re off to the races. Those are the types of things that you do to configure the proxy.
This part here is the responsibility of the control plane. The control plane is responsible for configuring and providing that service discovery data. It's also responsible for managing authorization. It's responsible for storing the routes when it comes to what service should be talking to what service. That's all the responsibility of the control plane.
The proxy itself is the data plane. Its responsibility is to do the heavy lifting, which is to route packets. It's responsible for things like enforcing retries, enforcing circuit breaking, making sure it's able to generate certificates and use them to do that Mutual TLS connection between the services.
That summarizes the control plane side of things, and the data plane side of things, which make up this service mesh system.
When we talk about HashiCorp Consul, the Consul servers make up the control plane, and the Consul clients make up the data plane. The clients can give you a sidecar using Envoy. Or you can even configure other proxies to work with Consul's data plane.
That gives you a summary of the service mesh—how it applies to Consul, how it maps there, and all the other options that you have with smart client, smart networking. That's the first set of principles when we talk about the first principles for service-to-service communication.
The second set of principles is protocol awareness. This might be you're using a Layer 4 data plane versus a Layer 7 data plane. So when I talk about data plane, it's the actual proxy that's responsible for routing the packets.
With Layer 4, you have protocols like UDP or TCP. With Layer 7, you might see protocols like HTTP that are more popular. With the UDP TCP, you can do things like database calls and so on. Those work fine—no problem. They have an insane amount of performance when it comes to service-to-service communication and service to database communication, over the data proxy, which is super interesting.
On the HTTP side, you get this rich insight into a request so that you can do more complicated actions on the service query. So you can do traffic shaping based on a host header. Or you’re going to do traffic shaping based on a domain name. That gives power to the developers to do things like split testing and cookie-based routing. It's useful for doing service-to-service communication in the modern world.
Both of these have pros and cons. The pros being in the Layer 7 case, you can do more complex actions. In Layer 4, you're not inspecting the bucket. You can't do complex communications. But on a performance basis, Layer 4 might be more performant than Layer 7.
I hope you learned more about service meshes in this video. To learn more about the concepts behind service mesh, go to our learn platform.
HashiCorp Deep Dive Demos from Ignite and KubeCon Europe
Orchestration to Delivery: Integrating GitLab with HashiCorp Terraform, Packer, Vault, Consul, and Waypoint
Leadership Seminar: 하시코프와 함께 하는, 클라우드의 가치를 제대로 실현하는 운영방법
Unlocking the Cloud Operating Model on Oracle Cloud Infrastructure