HashiCorp Consul 0.7

HashiCorp Consul 0.7

Sep 14 2016 James Phillips

We are excited to release Consul 0.7, a major update with many new features and improvements. This release focused on making it easier to operate Consul clusters, and built key foundations for continued operational improvements in future releases.

Consul is a modern datacenter runtime that provides service discovery, configuration, and orchestration capabilities in an easy-to-deploy single binary. It is distributed, highly available, and proven to scale to tens of thousands of nodes with services across multiple datacenters.

There are a huge number of features, bug fixes, and improvements in Consul 0.7. Here are some of the highlights:

You can download Consul 0.7 here

{ "Results": [ { "KV": { "LockIndex": 1, "Key": "/app/lock", "Flags": 0, "Value": null, "Session": "119a5e6d-4f67-db86-4a24-7ba515807fcf", "CreateIndex": 25, "ModifyIndex": 25 } }, ... ], "Errors": null }

Note that since values are being sent via JSON, they are always Base64 encoded.

The Results list (truncated in the example) has the outcome of each individual operation, and these also include the resulting index information for use in future check-and-set operations. The index feedback is useful even in operations involving a single key, since this previously had to be re-queried after an operation. If problems occur, a structured Error list helps map errors to the specific operation that failed.

We've intentionally added a flexible structure in the transaction and result lists, using a KV member and an object in order to leave transactions open to other types of operations in future versions of Consul, such as modifying tags on a service in the catalog based on obtaining a lock.

Consul Operator Improvements

Consul 0.7 contains several important improvements to make it easier for operators running Consul clusters.

This version of Consul upgrades to "stage one" of the v2 HashiCorp Raft library. The new library offers improved handling of cluster membership changes and recovery after a loss of quorum. It also provides a foundation for new features that will appear in future Consul versions once the migration to the full v2 library is complete. In particular, Consul will be able to support non-voting servers that can be on standby without affecting the Raft quorum size. This will make it simpler for Consul itself to orchestrate server replacement after a failure, and to provision servers across availability zones with one voter and one non-voter in each zone.

In addition to the new Raft library, Consul's default Raft timing is now set to work more reliably on lower-performance servers, which allows small clusters to use lower cost compute at the expense of reduced performance for failed leader detection and leader elections. A new Server Performance Guide provides details on server performance requirements, and guidance on tuning the Raft timing.

Finally, a new consul operator command, HTTP endpoint, and associated ACL now allow Consul operators to view and update the Raft configuration. A stale server can be removed from the Raft peers without requiring downtime. This is also a good foundation for future Consul operator tools, and command and HTTP endpoint parity enable interactive use and automation for these operations. The operator ACL with separate read and write controls also allows for delegation of diagnosis and repair privileges.

Here's an example consul operator command output for the raft subcommand, viewing important Raft information:

$ consul operator raft -list-peers Node ID Address State Voter alice follower true bob leader true carol follower true

See Consul Operator Command and Consul Operator Endpoint for details, as well as the updated Outage Recovery Guide.


We've developed novel extensions and techniques to the underlying Gossip Protocol to make failure detections more robust. This results in less node health flapping in environments with unstable network or CPU performance. We call this feature Lifeguard.

The underlying failure detection is built on top of Serf, which in turn is based on SWIM. SWIM makes the assumption that the local node is healthy in the sense that soft real-time processing of packets is possible. However, in cases where the local node is experiencing CPU or network exhaustion this assumption can be violated. The result is that the serfHealth check status can occassionally flap, resulting in false monitoring alarms, adding noise to telemetry, and simply causing the overall cluster to waste CPU and network resources diagnosing a failure that may not truly exist.

Lifeguard completely resolves this issue with novel enhancements to SWIM.

The first extension introduces a "nack" message to probe queries. If the probing node realizes it is missing "nack" messages then it becomes aware that it may be degraded and slows down its failure detector. As nack messages begin arriving, the failure detector is sped back up.

The second change introduces a dynamically changing suspicion timeout before declaring another node as failed. The probing node will initially start with a very long suspicion timeout. As other nodes in the cluster confirm a node is suspect, the timer accelerates. During normal operations the detection time is actually the same as in previous versions of Consul. However, if a node is degraded and doesn't get confirmations, there is a long timeout which allows the suspected node to refute its status and remain healthy.

Lifeguard makes Consul 0.7 much more robust to degraded nodes, while keeping failure detection performance unchanged. There is no additional configuration for Lifeguard, it tunes itself automatically.

Lifeguard is the first result of HashiCorp Research seeing production applications, and we'll be publishing a paper and submitting it to various academic conferences next year.

ACL Replication

Consul's ACL system requires a single datacenter to be designated as the authoritative ACL datacenter, and requests to modify ACLs or retrieve a policy for a token are always forwarded to this datacenter. There are existing configuration controls to allow a local datacenter's cache to be extended in the event of a partition, but any tokens not in the cache cannot be resolved locally. In addition, it's tricky to create a backup ACL datacenter or migrate the ACL datacenter.

Consul 0.7 adds a built-in ACL replication capability that's easy to configure using a single new acl_replication_token parameter. Once this is set on the servers in a datacenter, they will begin an automatic replication process copying the full set of ACLs in the ACL datacenter to the local datacenter. In the event of a partition, if the ACL down policy is set to "extend-cache", the locally replicated ACLs can be used resolve tokens. This also provides a simple tool to move the full set of ACLs from one datacenter to another.

There's a new API that allows operators to monitor the status of the replication process:

$ curl localhost:8500/v1/acl/replication { "Enabled": true, "Running": true, "SourceDatacenter": "dc1", "ReplicatedIndex": 5, "LastSuccess": "2016-09-13T06:31:50Z", "LastError": "0001-01-01T00:00:00Z" }

This makes it much easier to configure Consul in a way that's robust to extended partitions from the ACL datacenter.

Upgrade Details

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now