What is the Crawl, Walk, Run Journey of Adopting Consul?
Jun 19, 2019
Network automation can address the challenges of microservice networking, and Consul can take you through the three phases of that automation journey.
- Armon DadgarCo-founder and CTO, HashiCorp
Hi, my name is Armon, and today we’re going to talk about the challenges of microservices networking.
As we adopt dynamic infrastructure, particularly in cloud environments, what does that bring from a challenge perspective, and how do we solve that with the Consul tool in an incremental way? What’s the crawl, walk, run journey of solving some of these network automation challenges without breaking everything about how we do networking.
When we talk about microservice networking and the challenges associated, I think it’s useful to step back and look at a bigger network. And when I talk about the bigger network, there are 2 key network paths to talk about. One is north-south, and when we talk about north-south, that’s traffic flowing from the public internet and our end users into our networks, so flowing into us, versus east-west, which is traffic flowing within our data center between different services.
When we talk about these 2 key flows, there are a bunch of key systems to consider. The first is, as traffic comes north-south from users over the public internet, what are the systems that are going to be touched? Common ones will include firewalls and load balancers. These load balancers will then ultimately bring us back to our different applications.
Maybe we have a web app and we have an API service and these services might need to communicate east-west to other services like a database. And our database may also have an internal firewall that protects it. So this flow from our API server, for example, to the database is an east-west flow, versus this path from our user to our web service, which is a north-south flow.
A more dynamic network
Now, as we start to talk about these challenges within the context of microservices, what are we really talking about? One aspect of this is, in a traditional world, these pieces were all relatively static, and some of the key network components were hardware. So as we talk about microservices in a cloud environment, we can’t necessarily ship our hardware devices anymore to the cloud. So some of these things have to be treated and managed like software.
The other thing is our applications become much more dynamic. We might want to be able to push 5 updates or 10 updates a day to our web server. We might want to dynamically scale up and down our API servers based on load. As we start to do these things, we put pressure on these different networking pieces. As an example, if we add an autoscale, or web servers and APIs, we now need to somehow update much more frequently the load balancer and potentially our ingress firewalls.
These things now need to be updated to allow traffic to new web servers as they scale up and then disallow access as they scale down and those IPs are no longer relevant. Similarly, as we talk about these firewalls internally, if I scale up the number of APIs, I need to update the internal firewalls and internal load balancers to allow access between these different services.
When we start to talk about the key different functions within the network, in various places what we’re worried about is authorization. When we deploy a firewall, what we’re using it for is to manage what services you’re authorized to talk to or talk in between. And our load balancers are acting as routing and naming devices. So it might be, to our user, they’re hitting api.hashicorp.com. That’s the name. But internally we might be wrapping that to different services depending on the endpoint that they’re hitting.
Strains on the network
These are key functions that exist in the network, but now we’re going to put pressure on them by virtue of having many, many more services. As we adopt the microservice architecture, we have many more of these things, but we’re also gonna make that much more dynamic. Instead of updating every few months, they’re going to update a few times a day, potentially.
We’re going to put pressure on all of these network services to update themselves. When we talk about this whole world, it’s these key functions. Authorization managed by things like firewalls. How do we update them? How do we update our load balancers as they provide things like naming and traffic management?
And then as we talk about our API gateways, how do they provide ingress as well as some amount of routing and filtering? These are the key pieces that we put a bunch of pressure on with our microservice architecture and that we’d like to deal with in a more sophisticated way so that our network doesn’t become the bottleneck that’s preventing our application teams from being able to deliver value.
Consul's Crawl, Walk, Run Journey
As we talk about the challenges of microservices in the networking environment, these are really what we’re trying to solve with Consul, primarily through the use of network automation.
When we talk about Consul, it’s natural to describe sort of a crawl, walk, and then run journey. I think with anything, it’s hard to go from 0 to 60. I think there’s incremental value at different stages of automation throughout our infrastructure.
The crawl phase
So I think at the most basic level, if we start off talking about the crawl phase, it starts by acknowledging, “OK, I have these applications, you know, let’s say A and B, that need to communicate with one another. And today we might constrain them to flow through, you know, a firewall that’s going to provide authorization, and a load balancer that’s going to provide naming. And the application is going to transit something like this. It’s going to hardcode IP 1 to go to the load balancer, and the load balancer will bring us back to IP 2, which is the end service.”
And so the first-level challenge in this journey is, How do we stand up Consul to act as a central registry? And this registry, again powered by Consul, what it does is as these services get deployed, they get registered. So Consul has a sort of bird’s-eye view of all of the applications running in our infrastructure. And it might be that A is running on a physical machine in our private data center, and B is running in a container on top of Kubernetes in the cloud. It doesn’t really matter. What matters is the registry has the sort of bird’s-eye view of all of the infrastructure, and where it’s running.
What this enables us to do right away is, now this registry that has the central knowledge of what’s running where can be queried through a number of interfaces. So this gets exposed either as DNS or as a rich HTTP API. And so now application A, instead of hardcoding IP 1, can hardcode a name. They can say, “I want to talk to service B, that service.consul.” And that’s going to get resolved via DNS to an actual IP.
So in this case we won’t have to hardcode this IP. That’ll get resolved by Consul to IP 2, and we don’t need to use the load balancer as a naming construct. We’re not hardcoding an IP to just take us to a load balancer to bring us back to the service. Instead, we’re letting Consul fill this in dynamically and take us back to the right IP.
So what this helps with is, as we scale this service, we add more instances of B. Maybe the initial instance fails. Consul’s able to dynamically mask all of this. Instead of returning IP 2, we’ll return IP 3 or potentially IP 4—one of the other healthy instances of B that are still running somewhere in the network.
So we start to move away from using a load balancer just as a fixed naming device, and really only need to use it in the cases where there are many millions of requests being made, or thousands of requests per second, and we need to have a load balancing capability, not just a naming capability. So this becomes sort of the crawl.
The walk phase
As we talk about the walk, the next step of this is, instead of managing manual updates to load balancers and firewalls, what we can do is establish almost a published/subscribed relationship. As these services launch, they’re basically publishing to the registry and saying, “Hey, I’m a new instance of B; I’m now available.” And what we can do is subscribe to these changes and use those to drive automation.
So instead of booting a new instance of B and then filing a ticket against our load balancer team and saying, “There’s a new IP 3 that’s running on, for instance, B. Go update the load balancer.” Instead, this piece gets automated. As an ops team, we simply deploy a new instance of B, or maybe it’s autoscaling up and down. That gets published, and then the subscription is where we inject our automation. We subscribe to the changes to the registry. Great, there’s a new instance of B. We use that to automatically update the load balancer, and now new traffic will flow to it.
The same idea applies to the firewall. We subscribe and say there’s a new instance of B. Great, authorize the existing instances of A to talk to this new IP of B, and update the firewall dynamically. So you move to this publisher/subscriber model, where the registry is allowing the networking team and the operations team to decouple their workflow, but do it still in real time.
The run phase
Then, as we go to the run phase, when we talk about service mesh, this is this idea of, How do we actually move these core functions out of the network entirely? Effectively, what we’ve done is solve some of these earlier challenges of authorization and naming and ingress within the network with different appliances.
So how do we move that out of the network? And the way this works at its core is, with our applications, we deploy a proxy alongside them. This might be, for example, Envoy. And so on both sides we’re going to deploy a proxy alongside the application. And so when A wants to talk to B, it actually talks outward through this proxy. The outbound proxy is responsible for the service discovery and the load balancing. It’s going to do that load balancing to discover which instance of B, if there might be many of them, should we actually send traffic to. And then it initiates the connection, and the receiving end is going to force an upgrade to Mutual TLS. So both sides have to present the certificate to say, “Who are you?”
And so the receiving end is upgrading us to Mutual TLS, and the reason is this enables us to authorization. So now we have a strong notion that A is talking to B, B is talking to A, right. This is given to us by a Mutual TLS. And we can tie that back to some central rule in the registry. So as part of the registry, we could manage a set of rules that say, “A is allowed to talk to B, but B is not allowed to talk to D.”
And now what we get is, using the certificate, we’ve established an encrypted communication between these 2 services. We have a strong notion of their identity, so the receiver can check, Is there a rule that allows A to talk to B? If so, great. Allow the traffic back, and now A is talking to B through these 2 proxies. If not, we close the connection and fire off an audit trail saying, “Some random app A just tried to talk to me. It shouldn’t be allowed to talk to me.”
So what this allows us to do that’s super interesting is, the first piece is now our central network. We don’t care about any guarantee it provides. Because effectively, we’re saying, “I don’t trust this network. I’m going to encrypt all the traffic over it. I’m going to use TLS to establish my identity on both sides.” This network could be a private internal network, it could be the public internet, there could be an attacker on the network—it doesn’t matter because we’re not trusting or assuming that this network is secure. This goes back to what you would call a zero-trust network. We’re assuming the network is adversarial.
So that has a really nice property: We don’t have to care what’s in the middle of it. The other nice property is, we’ve basically shifted the function of authorization and naming to the edge. So we don’t need to deploy firewalls and load balancers in the middle of our network anymore. That sort of east-west traffic path, we’ve pushed all that functionality to the edge of the nodes within the same box, and so we don’t need to deploy a whole bunch of firewalls and load balancers all around our infrastructure. The result of that ends up being a simpler overall network architecture.
The final really important piece is, when we talked about these rules of “Service A is allowed to talk to service B,” or “Service B is not allowed to talk to service D,” what we’re talking about is a logical service—the logical service A to the logical service B. What doesn’t matter is if there’s 1, 10, or 1,000 copies of A or 1, 10, or 1,000 copies of B. The rule is the same. And this is very unlike a firewall, where the rules are IP 1 to IP 2. So if I have 50 web servers and I have 5 databases, this same thing might turn into 250 firewall rules. To the equivalent of what we would say, at a logical level, is web to database.
Dynamic registration discovery
A key property here is, by talking about the ruleset—the intentions of Consul in a logical way rather than a physical way with IPs—the scale of management is potentially hundreds or thousands of times less complex. This forms the kind of crawl, walk, run journey for Consul. As we map it and think about these challenges, it’s really great within this east-west path that is going to be highly dynamic. As we add a bunch of services, how do we make the registration discovery more dynamic? We don’t hardcode IP addresses. We don’t need to worry about static things breaking as we’re moving very quickly.
As we move to the walk phase, these components that we need to update, these networking middleware pieces, we can drive all of those now against a common registry. So the common registry ends up being the driver against all of these different pieces. And at the run scale, we can actually simplify this east-west path and push these functions out of the way.
There’s still a place for load balancers and firewalls in this north-south path, but we can simplify the east-west flow and remove a bunch of the middleware, which either impose performance penalties because our traffic has to flow through it, or it’s going to impose management penalty. We have to go manage, potentially, hundreds of these devices. So this forms the ultimate crawl, walk, run journey for Consul.
I hope you found this video useful. I’d recommend going to hashicorp.com and either going to our Resources page, where you can find a lot more in terms of best practices and guidance and some of the challenges around microservices networking, or the Consul product page to learn more about how Consul can help with network automation.