Today we are releasing Serf 0.7. Serf is a decentralized solution for cluster membership, failure detection, and orchestration. Serf is in use in some huge deployments (more than 10,000 machines in a single cluster), and powers Consul and Nomad.
This release brings a major new network tomography subsystem which allows you to build a map of network round trip times for your cluster. It also includes a number of smaller improvements around better handling of misconfigured and misbehaving networks.
Read on to learn more about the major new features in 0.7.
Serf's underlying gossip protocol has all nodes in a cluster perform random probes of other nodes at regular intervals to detect node failures. A nice side effect of these probes is that nodes get a measurement of the network round trip time to a different node each probe period. Serf 0.7 takes advantage of these measurements by feeding them into a network tomography subsystem, based primarily on an algorithm from academic research called Vivaldi.
The Vivaldi algorithm works in a manner that's similar to a physics simulation of a system of nodes connected by springs. Nodes start out bunched up at the origin, but as they learn information about the distances to their peers over time, they adjust their positions in order to minimize the energy stored in the springs. The end result of this simulation is a set of "network coordinates" that allow the RTT to be estimated between any two nodes in the cluster by performing a simple calculation.
Serf 0.7 adds new commands and API endpoints related to network coordinates.
Here are some examples using the new
serf rtt command which allows operators
to interactively query RTT estimates:
# Get the estimated RTT from current node to another $ serf rtt nyc3-server-1 Estimated nyc3-server-1 <-> nyc3-server-2 rtt: 1.091 ms
» Get the estimated RTT between two other nodes from a third node
$ serf rtt nyc3-server-1 nyc3-server-3 Estimated nyc3-server-1 <-> nyc3-server-3 rtt: 1.210 ms
Serf 0.7 also exposes raw network coordinates for use in any external application via its RPC protocol. For more details on how to use raw network coordinates, see the Serf internals guide.
TCP Fallback Probes
Serf 0.7 adds a TCP fallback probe to the gossip protocol's node failure detector in order to help operators diagnose a common misconfiguration where TCP traffic is allowed between nodes but not UDP. A log message will alert the operator to the problem, but the node will still be probed successfully, preventing a flappy failure detection. This also helps ride out brief periods of high packet loss by providing a more reliable alternate path to probe another node.
Serf 0.7 has been designed to "upshift" automatically as nodes are upgraded to start taking advantage of network tomography features and new protocol features such as the TCP fallback probe. For most configurations, upgrading will just require an agent restart with the new binary.
More details are available on the upgrade process here.
The next release will be focused on more sophisticated algorithms for node failure detection, as well as a number of improvements that are under development from the community.
If you experience any issues, please report them on GitHub.