From the First Photocopy to Modern Failure Detection in Distributed Systems

Jul 11, 2019

Learn about the Gossip & SWIM protocols for managing group membership and failure detection in a distributed system, and learn how HashiCorp Consul & Nomad build on Gossip with "Lifeguard" extensions from HashiCorp Research.


  • Nic Jackson

    Nic Jackson

    Developer Advocate, HashiCorp

The challenge of distributed systems has been around ever since companies have had n+1 computers in a network. The scenario of managing membership for a group of nodes and detecting any failures has challenged engineers and been optimized for decades. However, many of the core algorithms still used today date back over 30 years.

One of the more well-known distributed systems challenges came in the 1980s when Xerox was experimenting with distributed email servers. From their experiments came the research paper: Epidemic Algorithms for Replicated Database Maintenance. From that initial algorithm, SWIM and the Gossip protocol were born, and at 5000+ machines, Gossip has still not been beaten to date in terms of performance.

In this session from Istanbul Tech Talks 2019, HashiCorp developer advocate Nic Jackson will look at how Xerox's epidemic algorithm became the Gossip protocol used in SWIM to manage group membership and failure detection in modern distributed systems.

Gossip is the protocol used in HashiCorp Consul for service discovery and configuration. HashiCorp Nomad, a cluster scheduler, also uses Gossip.

Both of these tools received novel improvements in the last year by implementing extensions from a HashiCorp Research project called Lifeguard which makes the Gossip protocol more robust. Jackson will explain how these extensions work and where their inspiration came from.

Stay Informed

Subscribe to our monthly newsletter to get the latest news and product updates.

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now