Today we are releasing Serf 0.8. Serf is a decentralized solution for cluster membership, failure detection, and orchestration. Serf is in use in some huge deployments (more than 10,000 machines in a single cluster), and powers Consul and Nomad.
This release brings improvements in Serf's gossip protocol which provide better robustness for applications that rely on Serf to detect the health of nodes in a cluster. It also includes some smaller updates and one important bug fix.
Read on to learn more about the gossip protocol improvements in Serf 0.8.
We've developed novel extensions and techniques to the underlying Gossip Protocol to make failure detections more robust. This results in less node health flapping in environments with unstable network or CPU performance. We call this feature Lifeguard.
The underlying failure detection is based on SWIM. SWIM makes the assumption that the local node is healthy in the sense that soft real-time processing of packets is possible. However, in cases where the local node is experiencing CPU or network exhaustion this assumption can be violated. The result is that node health statuses can occassionally flap, resulting in false monitoring alarms, adding noise to telemetry, and simply causing the overall cluster to waste CPU and network resources diagnosing a failure that may not truly exist.
Lifeguard completely resolves this issue with novel enhancements to SWIM.
The first extension introduces a "nack" message to probe queries. If the probing node realizes it is missing "nack" messages then it becomes aware that it may be degraded and slows down its failure detector. As nack messages begin arriving, the failure detector is sped back up.
The second change introduces a dynamically changing suspicion timeout before declaring another node as failed. The probing node will initially start with a very long suspicion timeout. As other nodes in the cluster confirm a node is suspect, the timer accelerates. During normal operations the detection time is actually the same as in previous versions of Serf. However, if a node is degraded and doesn't get confirmations, there is a long timeout which allows the suspected node to refute its status and remain healthy.
Lifeguard makes Serf 0.8 much more robust to degraded nodes, while keeping failure detection performance unchanged. There is no additional configuration for Lifeguard, it tunes itself automatically.
Lifeguard is the first result of HashiCorp Research seeing production applications, and we'll be publishing a paper and submitting it to various academic conferences next year.
Serf 0.8 has been designed to "upshift" automatically as nodes are upgraded to start taking advantage of the new Lifeguard features. For most configurations, upgrading will just require an agent restart with the new binary.
More details are available on the upgrade process here.
If you experience any issues, please report them on GitHub.