consul

Resiliency and Recovery with Consul Enterprise

Recently, we did a webinar on the importance of resiliency and recovery when using HashiCorp Consul for business-critical applications. In this blog, we’ll revisit some of the topics that we covered in that webinar and explore how Consul Enterprise helps address some of the challenges that organizations face when migrating applications to the cloud.

»Resiliency: Proactively preventing issues

Oft quoted is the phrase, “the best offense is a good defense.” Typically, this is in reference to sports or strategy, but for this use case, we’re talking about building a strong defense against network failures. To do that, organizations need to take proactive steps to address the inevitable challenges they’ll face when scaling up.

What do we mean by proactive? We mean leveraging features that can reduce the likelihood of unforeseen events causing an outage. Since outages impact the whole organization, it makes sense that we build features into our enterprise offering that address these challenges.

»Automated Upgrades

Upgrading software can be challenging. For enterprises, the upgrade process is typically a large, planned undertaking that requires a sort of “all hands on deck” approach. The reason for this is the danger of incompatibilities between various software instances. While most software is able to work fairly seamlessly across different versions, there is always the chance of a breaking change or a failure resulting from an outdated instance. Rather than having to wait for these broad, one-time upgrade sessions, a better approach would be to enable your tool to conduct its own upgrades and switch over to the newer version once the risk of a failure has been mitigated.

Consul Enterprise handles the upgrade process for organizations by slowing updating the Consul deployment on a server by server basis. Once enough servers with the desired version have been added to maintain a quorum, Consul will start demoting outdated servers which can then easily be removed. This can be an incremental process and provides a way to minimize the interruptions that upgrading can create.

»Enhanced Read Scalability

Upgrading is just a small part of building a resilient platform. While it’s nice to have and enables faster adoption of newer versions, it doesn’t address the challenge of unforeseen events, like sudden spikes in inbound traffic. Organizations need to question whether or not their networks are ready for a sudden 5x or 10x spike in traffic. This could be from unexpected demand or a service issue causing a backlog of requests to get suddenly sent all at once. Regardless of the how, the questions to consider are: what happens when that does occur and is my network prepared?

To solve for this unknown risk, Consul Enterprise offers a feature called enhanced read scalability. Enhanced read scalability enables organizations to build a buffer of additional capacity by deploying additional non-voting servers to a Consul deployment with read capabilities. By not participating in voting, these servers are less resource-intensive but still participate in the data replication process. As more requests come in, these servers can shoulder some of the additional burden. In the event of a voting member crashing or needing to be reset, these servers can be promoted to become voting members so quorum can be maintained.

»Advanced Federation

Read scalability servers are great for running a less operationally intensive deployment of Consul, but that burden returns as the cluster is spread out across more datacenters, regions, or clouds. The larger the cluster size, the more servers need to participate in the gossip protocol and can be challenging to manage. Problems can arise as a result of latency from a health check or just general strain on resources. To alleviate this, Consul Enterprise offers advanced federation.

Advanced federation enables each datacenter to operate independently and only communicate with select others. For instance, assume that an organization is running Consul agents in five datacenters (dc1, dc2, dc3, dc4, dc5). Using advanced federation, operators could stipulate that dc1 is the primary data center and that all other datacenters participate in WAN gossip with dc1, but not each other. This type of architecture is commonly referred to as hub/spoke and is a great way to centralize information while allowing each spoke to operate on its own. Services running in each datacenter still participate in LAN gossip and handle requests, but do not have to replicate data everywhere else.

»Network Segments

So what about inside a single datacenter? Advanced federation creates operational efficiencies on a large scale, but not every organization is running multiple datacenters. Another scenario, an enterprise may be running a single datacenter, but only need certain services to be able to discover one another. Network segments enable organizations to restrict LAN gossip between client and server agents within a datacenter.

Think of this as a more lightweight version of Advanced Federation, running specifically inside of a datacenter. Again, this helps reduce the operational burden on datacenter resources because the scale of the gossip pool has been reduced. Less information is being shared across the entire cluster which enables operators to allocate more resources to the servers with the largest demand.

»Recovery: Reacting to when things go wrong

Hardening a Consul deployment is a great way to build confidence that it can handle businesscritical applications. Taking the proactive approach and improving network resiliency will help head off some curveballs that can be thrown at organizations. However, even the best built networks will experience something unpredictable.

There is always a chance that despite best efforts, a failure occurs somewhere and applications go down. It could be the result of a cloud region outage, a construction company knocking out a fiber optic cable, a misconfiguration that results in data loss, or any number of other issues. To be prepared for these circumstances, Consul Enterprise enables organizations to head off failures due to outages and quickly restore data in case of corruption or misconfigurations.

»Redundancy Zones

It’s a fairly common hypothetical, but in the event that a failure and subsequent outage occurs, it can be extremely challenging for enterprises. If a cloud provider experiences a regional outage, everyone running applications in that region needs to scramble to resolve it and migrate workloads to a new region. With Consul, enterprises can get ahead of that concern through redundancy zones. Redundancy zones are the process of selecting a non-voting member (one of the read scalability servers we mentioned earlier) and designating them as a failover in the event of an outage.

In the example in our learn guide, we have six agents running across three regions (three voting and three non-voting). Normally, if one of the servers in a three server deployment fails or loses connectivity to the rest of the cluster, Consul would no longer be able to achieve quorum and crash. With redundancy zones enabled, Consul recognizes this outage and subsequently promotes one of the active non-voting servers to participate in the quorum. This eliminates the risk of an outage due to circumstances beyond the organization’s control, like a regional outage, but it also can be leveraged in case of internal failures or interruptions in service.

»Automated Backups

Despite all the precautions being made, there may still come a time when an outage is inevitable. The question at that point is, how close to the most recent, stable state can I restore things to? Some enterprises may trust engineers to craft a custom solution for capturing backups, but the challenge is what happens when those individuals move on to the next assignment? Having a solution that is able to consistently capture information and enable simpler restoration of that information eliminates the need for custom solutions.

Consul’s automated backup capabilities provide that. Consul Enterprise will periodically run an agent snapshot command to capture current data, like K/V entries, service catalog, prepared queries, sessions, and ACLs, and store them in a remote location, like S3. In the event of data loss, operators can run a restore to pull the latest snapshot. Ideally, this feature would be a last resort, but it’s also useful in the event that Consul needs to be rolled back to a previous state because of some other internal issue.

»Conclusion

Going back to the initial quote, “the best offense is a good defense.” In the case of Consul, we feel that quote should read, “the best way to recover from an outage is to implement features to help prevent it.” While these features cannot guarantee unforeseen events from occurring, they can position an organization to recover quickly and efficiently with minimal data loss. For more information about Consul Enterprise, please visit our product page.


Sign up for the latest HashiCorp news

By submitting this form, you acknowledge and agree that HashiCorp will process your personal information in accordance with the Privacy Policy.