Ensuring Cross Cloud, High Availability for Your Applications

This session features a deep dive into using HashiCorp Consul Connect in multiple datacenters.

Consul has features that ensure your application will remain available even if an entire data center goes offline. In this talk, you'll learn about using Consul's: - WAN gossip pool - Prepared queries - Failover policies


  • Bryan Krausen
    Bryan KrausenSenior Solutions Architect, AHEAD


Today's session is called "Ensuring Cross-Cloud High Availability for Your Applications." That's a long title for: We're going to talk about Consul and prepared queries and failover policies and stuff like that.

Let's get started with a Gartner quote: The average cost of IT downtime is $5,600 a minute. If you do the math, that's over $300,000 an hour. That could be not just lost revenue for a company, but also reputational costs and the cost of all your engineers and admins trying to fix the problems.

Downtime is bad

But, in regards to the dollar amount, in this presentation, it really doesn't matter. We're just focused on: We all agree that downtime is bad.

We all want to make sure that our applications are always highly available. Let's look at applications running in a single cloud. A lot of people have their traditional datacenter. They have maybe some legacy hardware, legacy applications. And if you're running on these things, I'm sorry I called your stuff legacy. But most people are moving over to public cloud platforms: AWS, Azure, GCP, etc.

A lot of times we're moving from these legacy architectures over to more modern frameworks. We're moving into things like microservices, Kubernetes containers. It allows us to move quickly. The problem is that, in this case, we're still in a single cloud. And what happens when services in the single cloud start to go down?

It may take hours or days to recover, especially if you're in a physical datacenter. You maybe have to order new hardware. In the cloud, maybe you have to rebuild things like VPC. A lot of times it's infrastructure as code, so it doesn't matter, but you've got to get your data from one side to the other in order to get your applications up and running.

If we take into consideration that previous quote, the average cost is usually pretty high for running in a single cloud.

When we move to applications running in multiple clouds, that's better, right? As smart administrators, engineers, and architects, we're now deploying our applications across multiple clouds.

The problem is, if we have a fire, not necessarily a literal fire, but if something happens in one cloud, how do we migrate all of our data to the other cloud?

This doesn't have to be from AWS to Azure, or Azure to GCP. It could be within AWS. It could be multi-regions or something like that. But it's still going to take us probably hours to recover. We may have a bulk of data that we need to move from us-east-1 to us-east-2, or something like that.

It's not necessarily going to cost us as much money as if we are in a single datacenter, but it's still going to cost us an average amount of downtime.

So again, reputational costs, application downtime, maybe that application is a revenue-generating application. It's still bad.

Now we move up so we have applications running in multiple clouds, but this time behind Consul—because this is the Consul talk. In our case, we have 2 clouds. There's a cluster in each cloud, and they're federated. Maybe our frontend is talking to our backend.

What happens when our backend dies? The application goes out and reaches to Consul, and it automatically connects to the surviving service, and now the application never goes down. In this case, there's not really any need to recover, and the cost is very minimal. It could be just the cost of Consul or the costs of running your applications up in another cloud provider.

How do we get all that straightened out with Consul?

Quick introduction: I'm Brian Krausen. I'm a senior solutions architect at AHEAD. AHEAD just won partner of the year yesterday. It was pretty awesome. We've got some partner certs, working with the HashiCorp team. Started a newsletter.

I have a Vault training course coming out, and I've participated in a lot of HashiCorp user groups and Hashi talks this year.

Why use Consul?

This is not a sales pitch, but it plays into what we're talking about here.

Consul has the ability to register any application. It doesn't matter if you're using Go or JavaScript or whatever. All those services can register themselves with Consul and be part of the service registry there.

It doesn't matter what platform we're running on. We can be running on Kubernetes, VMware, Pivotal. And it doesn't matter where the applications are hosted. They could be in Azure, AWS, or on-premises. Consul has the ability to take all those services no matter where they're running and provide connectivity between them all.

Let's talk about what Consul looks like in a single cloud. We have our Consul cluster, and we have our ghosts service, so it does service registration.

In the Consul cluster, we have a registration for a new service. It doesn't matter what it is, let's say MongoDB. We've got service registrations, and as the services are registered with Consul, Consul is doing health checking. Consul understands which services are healthy, which services are not, and where to send traffic.

When we want connectivity between our frontend and backend, maybe our frontend does either a DNS query or an API query to Consul to figure out: Where's my healthy backend? We get a response from Consul, and then we make a direct connection from our frontend to our backend. Pretty simple.

Consul and the multi-cloud model

That's basically how Consul works. What about in multi-clouds?

We have our service registration over in Azure. Again, it could be AWS over here and AWS over here, different regions; it doesn't really matter. I just put the icons on the diagram to show you that they were different.

Over in AWS, we may have the same services that are registered. Notice that the services in each respective cloud will register themselves with the Consul cluster that's in that cloud. We're not doing any cross-cloud. We don't have anything in Azure being registered in AWS, and vice versa.

That's how service registration will work with multi-clouds.

Accessing applications in the cloud

We have a single cloud in this diagram.

We have our user here, and we have our Consul cluster. In the middle we have 3 services that have registered themselves with Consul. And we have our user down here. She makes a request to Consul for the application. In this example we're using the HashiCorp website. All is good. It's probably a DNS query. Maybe this user's coming from a desktop.

So what happens when services start failing? We have a service on the left that failed. But if we refresh our page, the application is still running because Consul is acting almost like a traditional load balancer in a sense, where Consul knows that that service is no longer healthy. You notice we tore down the line between Consul and the app there.

So we're no longer going to send traffic over there. Same thing with the middle one. That goes down. We're still up because we have one surviving service that's providing the service for this user. But if the third one dies, then we have a problem. Now our user's sad.

So that's how we do it in a single cloud. If we don't have additional connectivity, maybe to a secondary cloud or a DR site or something like that, that we can failover, then our users are not going to be able to access our application.

What does it look like in multi-cloud? Again, we have our Consul cluster. Over on the left, we have Azure. And we have the same thing over in AWS on the right. Again, our user makes the requests; we get our application.

As the services start going down, we're still good. But after this third one, the user doesn't have access to the application. The service is down. The problem is that we have all these surviving services over in AWS. How does the user access those applications that are survived in AWS?

And at this point we don't have federation. We need to federate our Consul clusters. And so we enable communication between them, and our users access services regardless of which cloud they are being hosted in.

Federation of Consul clusters

What does it take to federate a Consul cluster? Federation is low coupling of the datacenters. What that means is we can have 2 datacenters or 2 clusters, and they can do their own thing.

Each have their own clusters to ensure availability of services. But they're also communicating to each other. But because they're not tightly coupled, if one side goes down, it doesn't necessarily impact the other one. So we can still have surviving datacenters if one happens to go down.

Federation will use a Consul WAN gossip pool. Consul will have 2 different types of pools. It has the local LAN pool, and now it has the WAN gossip pool. All of the servers, no matter what datacenter they live in, participate in the LAN gossip pool. When we federate, we require connectivity between our clouds.

Obviously, the Consul clusters have to have network communication between the 2. I have an asterisk there because with Consul 1.6, we have those mesh gateways, which can enable that communication for you. So if you don't have that communication between Azure and AWS or on-prem and etc., you can use the mesh gateways and still do this.

The last thing I want to point out is, federation is not replication. We're federating, we're communicating between the clusters, but we're not replicating any data between the clusters. If you have a KV store over here and a KV store over here, we're not replicating any of that data with federation, nor are we sending communication between the 2.

We're not replicating any 2; we're just enabling communication.

What does this look like? We've got our 2 clouds. We've got our services that are registered in each separate cloud. And then we have, as I mentioned before, a LAN gossip pool that lives on both sides, AWS and Azure. That's all the clients and all the servers communicating to ensure uptime for the services and who's a part of the cluster itself.

Now we have the WAN gossip pool, which enables that communication between all clusters that are federated together.

What does it look like from an access perspective? In this case, we have our user, we're accessing

In this case, my datacenters are called Azure and AWS. If you have a user that's accessing everything in Azure, etc., you don't have to add the datacenter because that's like a default.

The user makes the request to Consul: "I'm trying to connect to an application." Consul responds: "Go here." The user makes the direct connectivity to the application. But what happens if the application goes down? Something happens in Azure, and all your services go down. We can give the user at this point a different URL.

Before, remember, we had Azure at the bottom; now we have AWS. We can still talk to the same Consul cluster, assuming that the Consul cluster is still up if your services die. We can make the same request to our local Consul cluster. And we're saying, "I want"

Consul is going to forward that to the AWS cluster, make that request to say, "Where is this service?" We're going to get a response to our user, and the user's going to make a direct connection to our application that lives in Azure.

That's how federation works. Notice that what we want to do is change the URL that we had to give the user.

Prepared queries

And we don't necessarily want to do that. That's kind of inconvenient for the user. It's not really user-friendly. How do we ensure that DNS doesn't change for our users, but we still get that failover between the 2? We introduce prepared queries.

Prepared queries are objects that are defined in each datacenter level. We'd have prepared queries in Azure, in AWS. They are used to filter the results of the service requests.

We'll go into an example in a second.

The prepared queries are usually invoked by applications. When you have a frontend talking to a mid-tier or something like that, or microservices trying to communicate to each other, they'll reach out using DNS. And you have the prepared query set up, and that's how they reach the desired services that you want.

They can also be used by humans, as we saw in our example.

If we're trying to go to our website, we're going to have our URL for Consul. A lot of times we're going to use something like a CNAME. It's more of a friendly URL for the user. But humans can invoke those and retrieve the correct result.

Let's look at a quick example. This is a simple prepared query. The name of it is hashiconf-app. The service is conference-app. We've identified a tag that we want our user to hit, 9.10, which is today's date. Because we're creating a prepared query, instead of hashiconf-app.service.consul at the end, now we have hashiconf-app.query.consul at the end.

We'll have our Consul cluster over here, and we've got, say, 3 services that are registered on the right. These are a conf-app. We got 3 services, and when they registered themselves, they have service tags of 9.10, 9.11, 9.12. The idea was 9.10 is today, so that'd be Day 1. Maybe HashiCorp says, "If people were hitting the conference app, I want to show the schedule for Day 1. Day 2 would be 9.11, and maybe Day 3 would be feedback or something like that for the conference.

But we always want our users to hit this hashiconf-app URL. Our user makes the request and she gets directed to the one that has a tag of 9.10, because that's our prepared query. If we change that—say, "Tomorrow we want people to hit our new service"—all we have to do is change the tag and then change the prepared query.

What didn't change is the URL that the user had to target. We're still saying, "User, go to URL xyz," and on the backend, within Consul, all we have to do is change the prepared query, and now we're redirecting the user to a different service.

This can be used when you are maybe upgrading an application. With 1.6, you can even do more stuff, like Layer 7 routing and things like that.

We can maybe send some traffic over to one side, to your newer app versus the version that's stable right now. More a canary situation.

That's the idea behind prepared queries. We want to make sure that the URL stays consistent.

This works great when we're talking about local, but how do we ensure that it's cross-cloud? We want to introduce failover policies to our prepared queries.

Again, an extension of prepared queries is, when you're writing a prepared query, you add your failover policies. The failover policies are transparent to our applications. They have no idea that's really going on. They're just receiving traffic from the user, even though the request had originally gone through Consul.

The failover policy will determine the target for the request, meaning the location of where the service is. Not which service within that local datacenter, but, “Am I going to serve this service from the local datacenter, Azure, AWS, Datacenter 1, Datacenter 2?” However we have it set up.

Let's look at the failover policy types. What's available? We have our static policy. That's a fixed list. You would go in there and say, "I want DC 1, DC 2. It's going to follow that order of failover. That's telling Consul, "I'm gonna make the decisions around here."

Then we have our dynamic policy, which uses Consul network coordinates. It will send to the nearest DC, and that's based upon the round trip time from where the client is.

That's really telling Consul, "I'm going to set it up; you manage it for me. You tell the user where the closest service is and where they're going to get the best performance from."

And then we have a hybrid policy, which will use the round trip time first and then will use the alternate DCs that you have. It's kind of a combination of static and dynamic. That's like, "Let's do a little bit of both to ensure both performance and failover."

We want to use our round trip time to ensure that we get the fastest application, the one closest to us. But we also ensure that, if it doesn't matter in regards to speed, we have this kind of backup list here, DC 1, DC 2, so if something is not working, we can failover those DCs as needed.

A quick example. We have our 2 Consul clusters, Azure and AWS; I just picked 2 clouds. These could be different regions in the same cloud. If you're an AWS shop and you're using us-east-1 and us-west-1 or something like that, that could be the same thing. If you're on the datacenter, traditional datacenter on the left and AWS on the right. It doesn't really matter. It works the same way.

We have our user, again. Our user wants to access our app again.

The request is made from our frontend app to Consul: "I want to connect to the backend." It establishes connectivity to our backend, and now the user is presented with the application. This is more like thinking about microservices.

We break apart our application and enable communication between our microservices.

However, again, if our database fails, how does the app continue to work?

The frontend again makes a request to Consul, because, remember, in Consul, the default TTL on DNS is going to be 0. Every time we need to reach out to that database, we're going to make another DNS request.

It makes the request to Consul. Consul's like, "This service is no longer available, but this one is." So the application makes the call to the backend, and the application is refreshed, and everything's working. So you didn't really go down.

A quick review

Legacy requests would be considered DNS, going into DNS, configuring A records and stuff like that. We can automate that kind of stuff if we're using appliances that support API calls and things like that.

If we're using traditional DNS, like on Microsoft or something, it's much harder to do.

If we want local availability, we can introduce the prepared queries. We can say, "We want to upgrade versions," or, "We want to send folks over to these sets of servers for our application."

And then we have cross-cloud availability, which is the failover policy.

A couple of quick key takeaways here. DNS is pretty simple to do, but it's not really flexible in terms of providing high availability.

Some newer things are. If you're using like AWS or some other appliances, you can set up failover policies or routing policies within them. But traditional DNS is not really flexible.

Prepared queries are dynamic. So we can ensure that our application is up, and if we have failures, we can migrate users over to those other failures.

But prepared queries are only local. They're not necessarily cross-cloud or cross-datacenter.

Then we have our failover policy. Our failover policy is our ideal solution for multi-cloud apps. And that's it.

More resources like this one