We are excited to release HashiCorp Consul 0.8. Consul is a critical infrastructure service that organizations rely on for service discovery, key/value storage, and health checks.
The focus for Consul 0.8 is operational stability with Autopilot features and federation improvements. Consul is highly available with three or more servers in a cluster, but it is complex to operate amidst failures, newly introduced servers, and upgrades. The goal of Autopilot is to automate Consul cluster operations so that it can be run in an auto-scaling group or scheduler without worry. Consul Enterprise has enhanced Autopilot functionality to automate the complete server cluster upgrade process and safely increase cluster size for read scalability. Federation improvements aim to simplify joining and operating Consul clusters across global infrastructure. With this release, Consul open source gains join flooding and soft-fail features to ensure that clusters are properly connected across datacenters. Consul Enterprise introduces network areas to enable advanced networking topologies such as hub-and-spoke.
Read on to learn more about the features in this release. You can also read the 0.8 changelog for details on complete ACL support, a more uniform CLI interface, and much more.
Introducing Consul Autopilot
Autopilot simplifies Consul operations by automating complex workflows, such as:
- Dead Server Cleanup: Remove a server that died unexpectedly and was replaced with a new server
- Server Health Checking: Determine the health of the Consul cluster
- Stable Server Introduction: Safely add new servers and remove old servers
Consul Enterprise has features for companies seeking higher reliability and additional operational simplicity:
- Server Read Scaling: Non-voting servers add more read scaling for stale queries
- Redundancy Zones: Promote hot standby servers in situations where it's hard to have all active servers (eg. running >3 servers with only 3 availability zones)
- Upgrade Migrations: Orchestrate in-place Consul upgrades
Future releases of Consul will add more Autopilot functionality to support new kinds of automation for common operational tasks.
Autopilot Safety Features
Autopilot safety features describe the set of Autopilot functionality that protects users and operators from putting Consul into an unavailable state. Consul 0.8 introduces non-voting servers, stable server introduction, and dead server cleanup as initial safety features.
At the core of Autopilot is the notion of non-voting servers, which join the Consul server cluster but do not participate in leader election voting. This functionality allows operators to increase the size of Consul clusters without affecting stability.
Stable Server Introduction only allows healthy servers to join a cluster, and only if the new server will create an odd number of servers in the quorum of voting servers. For example, a 4th server would stay a non-voter until a 5th server is added, and then they will both be promoted to voters together once they are healthy and stable.
Dead Server Cleanup automatically cleans up failed, unhealthy servers from a cluster, preventing operators from needing to force-leave the failed node or wait 72 hours for it to get reaped by the cluster. Together Stable Server Introduction and Dead Server Cleanup make operating and upgrading Consul clusters an easier process.
Autopilot enhances the upgrade process for both open source users and enterprise users. For open source users, operators should provision one new server at a time. If it is healthy and joins the cluster, then destroy an old server. Continue this process until the cluster has all servers with the new version of Consul installed.
Enhanced Autopilot (Enterprise)
Enhanced Autopilot adds convenience features to automate tedious tasks and improve scalability. Consul 0.8 introduces automatic upgrades and redundancy zones as initial Enhanced Autopilot functionality.
The upgrade pattern for Consul Enterprise users is to deploy a complete cluster of new servers and then just wait for the upgrade to complete. As the new servers join the cluster, the server introduction logic checks the version of each Consul server. If the version is higher than the version on the current set of voters, it will avoid promoting the new servers to voters until the number of new servers matches old servers, then Autopilot will begin to promote new servers and demote old ones.
Redundancy zones make it possible to have more servers than availability zones. For example, in an environment with three availability zones it's now possible to run one voter and one non-voter in each availability zone, for a total of six servers. If an availability zone is completely lost, only one voter will be lost, so the cluster remains available. If a voter is lost in an availability zone, Autopilot will promote the non-voter to voter automatically, putting the hot standby server into service quickly.
Autopilot is managed via a new HTTP endpoint and CLI command:
$ consul operator autopilot get-config CleanupDeadServers = true LastContactThreshold = 200ms MaxTrailingLogs = 250 ServerStabilizationTime = 10s RedundancyZoneTag = "" DisableUpgradeMigration = false
$ consul operator autopilot set-config -cleanup-dead-servers=false Configuration updated!
For more information on Autopilot, please see the Autopilot Guide.
Join Flooding and Soft-fail
Consul has always had support for multiple datacenters, and Consul 0.8 builds upon this with new federation features to allow operations to more easily manage complex federations of Consul clusters across multiple datacenters.
Join flooding uses information about the Consul servers within each datacenter to ensure that all servers are properly joined onto the WAN, or to any network areas which are discussed below. Prior to join flooding, it was easy to forget to join a new server onto the WAN, which would mean that not all servers would be able to properly route requests to remote Consul datacenters. Now as long as at least one server is joined, all others will join the WAN automatically too.
Soft-fail takes the same approach to server connection management added in Consul 0.7.0 for clients connecting to Consul servers within a datacenter and applies it to server-to-server communication to remote Consul datacenters. If there are issues between some subset of remote datacenters, Consul will still route requests as long as requests are still flowing. Prior to soft fail, any pair of datacenters having connectivity problems could cause all Consul servers to stop sending requests to those datacenters. Soft-fail makes large federations of Consul datacenters much more robust to connectivity issues.
Network Areas for Advanced Topologies (Enterprise)
Consul Enterprise adds a completely new network area mechanism for federating Consul datacenters that allows for new topologies, like hub-and-spoke networks. Consul's existing WAN support is built upon the same gossip mechanism that's used inside of Consul's datacenters, which requires all participating servers to be in a fully connected mesh with an open gossip port (8302/tcp and 8302/udp) in addition to the server RPC port RPC (8300/tcp). In organizations with large numbers of Consul datacenters it becomes difficult to support a fully connected mesh, and it's often desirable to have topologies like hub-and-spoke with central management datacenters and spoke datacenters that can't interact with each other.
Network areas are created on each side of a pair of Consul datacenters and then are joined together, forming a link between the two datacenters. Once the link is created, Consul agents can make queries to the remote datacenter in service of API requests (KV reads/writes, catalog and health queries, etc.) and even DNS requests for remote resources, just as is possible with Consul's WAN support today. Consul datacenters can participate in any number of network areas, as well as datacenters on the WAN, which eases migration to this new feature.
Here's an example creating an area to join together two Consul datacenters:
(dc1) $ consul operator area create -peer-datacenter=dc2 Created area "cbd364ae-3710-1770-911b-7214e98016c0" with peer datacenter "dc2"!
(dc2) $ consul operator area create -peer-datacenter=dc1 Created area "2aea3145-f1e3-cb1d-a775-67d15ddd89bf" with peer datacenter "dc1"!
(dc1) $ consul operator area join -peer-datacenter=dc2 127.0.0.2 Address Joined Error 127.0.0.2 true (none)
(dc1) $ consul operator area members Area Node Address Status Build Protocol DC RTT cbd364ae-3710-1770-911b-7214e98016c0 node-1.dc1 127.0.0.1:8300 alive 0.8.0 2 dc1 0s cbd364ae-3710-1770-911b-7214e98016c0 node-2.dc2 127.0.0.2:8300 alive 0.8.0 2 dc2 581.649µs
(dc1) $ consul kv put -datacenter=dc2 hello world Success! Data written to: hello
Unlike the WAN, traffic between network areas is all performed via server RPC (8300/tcp) so it can be secured with just TLS. This is much easier to manage across organizations, and eliminates the need to manage gossip keyrings in addition to TLS. For more information on network areas, please see the Advanced Federation Guide.
There are several important considerations to read about before upgrading to Consul 0.8, and some steps are required post-upgrade to fully enable the new features described in this post. Please see the Upgrade Guide for more details.
As Consul adoption continues to increase, we're committed to making Consul easier to operate and implement in complex environments. The Autopilot and federation features in open source make it safer to operate Consul clusters in one or many datacenters. Download Consul 0.8 to test out these new features!
For Consul Enterprise, the Autopilot and federation features are aimed at enterprises with increased scalability and complexity requirements. Enhanced Autopilot features automate the upgrade process and increase read scalability and the federation improvements allow enterprises to adopt Consul in more advanced networking topologies. Contact us to trial Consul Enterprise!