From the First Photocopy to Modern Failure Detection in Distributed Systems
Jul 11, 2019
Learn about the Gossip & SWIM protocols for managing group membership and failure detection in a distributed system, and learn how HashiCorp Consul & Nomad build on Gossip with "Lifeguard" extensions from HashiCorp Research.
Developer Advocate, HashiCorp
The challenge of distributed systems has been around ever since companies have had n+1 computers in a network. The scenario of managing membership for a group of nodes and detecting any failures has challenged engineers and been optimized for decades. However, many of the core algorithms still used today date back over 30 years.
One of the more well-known distributed systems challenges came in the 1980s when Xerox was experimenting with distributed email servers. From their experiments came the research paper: Epidemic Algorithms for Replicated Database Maintenance. From that initial algorithm, SWIM and the Gossip protocol were born, and at 5000+ machines, Gossip has still not been beaten to date in terms of performance.
In this session from Istanbul Tech Talks 2019, HashiCorp developer advocate Nic Jackson will look at how Xerox's epidemic algorithm became the Gossip protocol used in SWIM to manage group membership and failure detection in modern distributed systems.
Both of these tools received novel improvements in the last year by implementing extensions from a HashiCorp Research project called Lifeguard which makes the Gossip protocol more robust. Jackson will explain how these extensions work and where their inspiration came from.