You. Must. Build. A. Raft! Consul's Consensus Protocol Explained
If you've used Consul, you may have heard of the Raft consensus algorithm it uses. This talk will explain Raft in plain english.
Consensus is a fundamental problem in our distributed systems to achieve system reliability. The Raft consensus algorithm sets out to solve this problem, in a different approach than the first solution, Paxos. But how does Raft make decisions, and why should anyone trust it?
This talk will break down Raft and perhaps give you some more understanding as to why HashiCorp chose it for Consul's consensus protocol.
Cloud systems engineer, Cloudreach
I subtitled this, Your Tidal Wave of Information, because most of what I'm going to tell you isn't really going to stick. But you might fall in love with an idea here and you can go home and research it yourself.
There's something really important I want to say before we get into the tech: Right now a lot of engineers I look up to are going through a hard time. I think that we remember, no matter what, that all of us are people, and everyone just wants to be treated as a person—that we have value, and that we're all human. Yesterday, Kris Nova, an engineer I look up to who's smart and funny and awesome, she shared this at All Things Open. She wanted to see the support from the community, from people who look up to her, share this on Twitter. I think if you can take a picture of this, Tweet it, share it at some of your friends and family. Just let them know you care and you support them. That'd really mean a lot. You just want to show people they're not alone. If you want to hear more about this, there's the diversity breakfast on Thursday. I see you guys pulling out your phones, and again, I really appreciate it.
All these slides will be on my Twitter, which is @schristoff25. If you have any questions or comments, concerns, there's a Google form you can submit things anonymously or not anonymously. Also, my slides are on my website.
My name's Sarah Christoff. I am a Kuberbaby, and I'm a systems engineer at Cloudreach, which means whenever anyone says containers or Kubernetes, I am summoned out of my cave, and I go and help them. All right, and this talk is based off a steam game called you must build a boat. It's like a boat, but it's simpler because it's a Raft. You should buy this game, it's only $5.
Now we're going to get into the fun part. The reason I'm interested in this is because I wanted to build the perfect system—something redundant, not too complex, with little latency. I think it's something we all try and do. But the more I would try to add in one circle the more complex it would become. If I wanted lower latency I would reduce stuff in redundancy. If I wanted it to be less latent it would be more complex. This led to a debate between me and my boss—who isn't actually a cat. A couple months ago I told my boss about a problem I was having, where I wanted a multi region, multi master setup because everything has to be connected everywhere and we want so much redundancy. He basically told me, "Well, because of the latency between health checks, there might be false positives." Because of that, it's a problem with Raft that we couldn't do my idea.
I didn't really believe him, but I don't want to tell him that, because he's my boss. I made a talk about it instead and so now you're going to find out if I'm right or if I'm wrong and if I have a job after. He probably won't fire me. We're going to go over to a lot, so here's something just to prep your brains.
We'll start by building a base. I'm pretty sure a lot of you know about distributed systems but just in case.
Then from there we'll go into the Byzantine General's problem,
Byzantine Fault Tolerance mainly because I just didn't know what Byzantine meant. That's why we cover it.
Then from there we do consensus, Paxos and Raft. A lot of distributed drama happens, so just be prepped.
All right, so a centralized system is just that. It's central. Although it's easier to manage, it's a single point of failure and there's a lot of restrictions, and you lose a lot of flexibility. However, a distributed system allows us to gain more flexibility with our resources, but it adds complexity. Just so you know why we're dealing with this. This leads us into the…
» Byzantine General's problem
In 1982, Leslie Lamport and his bros, who's names I forget, wrote a paper called the Byzantine General's Problem. In this paper they tried to abstract a way some kind of the tech talk so that normal people can understand consensus and why we have these problems.
Let's talk about the Byzantine General's Problem. Imagine a city and the Byzantine empire decided they didn't really like this city. I don't know, it looked weird. They were like, "We're going to take this over. We're going to surround it." There are four other battalions that must coordinate and attack on this city and they all must do it at the same time to succeed. If they don't then the lose the city, everyone fails, it's bad. Flying torches signals, that can't be used. Enemy could see them. I don't know, they could figure it out. The generals have employed a messenger to communicate and coordinate the attack.
All the messages need to be sent to the other four battalions and confirm understanding. But that means all the generals have to be loyal, right? We can't have any bad actors. You could send messages on horseback to the other battalions, but how could you validate those messages? How can you make sure no one has intercepted your message, and your messenger is lying to you? Lamport says this is a solution, and it's really hard to read on a slide and in normal English, so I made pictures for it. In this algorithm, Lamport and his friends say the generals can reach a consensus if two thirds of the generals are loyal. If more than one third, are just traitors then consensus will not be met, and the enemy will win. Let's say lieutenant three being a traitor, all of the lieutenants are copying each commander's order and then confirming with one another, lieutenant two, will figure out the majority.
Since it received attack, attack eat from this previous slide ... See, attack, attack, eat then the attack is the majority and it will pass. Everyone will go attack. Let's say our general, our big boy, orange cat up there, he's a bad guy and he sends everyone different things to do. I want the nap time one. The lieutenants will still reach a consensus amongst themselves. It will all just be different things. Because of that there's probably a default command they can retract to, which is retreat, because they can't actually reach a specific consensus amongst themselves. They say, "All right, I got attack." "I got nap." "I got eat. Do any of these match each other? No, all right, so we can't reach a consensus. We'll just not do anything. We'll retreat." I guess that's doing something. This flows into…
» Byzantine Fault Tolerance
…which I think is super cool and also super sad. A Byzantine Failure in your system must be able to tolerate a class of failures that belong to the Byzantine General's problems. They're the most difficult and unpredictable failure modes to deal with. This is needed in airplane systems, NASA really likes it, power plants, all that important stuff.
Think of it like this: Whenever two of four processors showed different outcomes to different observers that can be considered as a Byzantine Fault. This is really specific slide in a slide deck that I copied from NASA. That makes me sound really cool. They said it was important so I'm going to make it to you guys. They said a Byzantine Fault is basically this. You are observing the Byzantine Fault like an action. I'll have a slide that shows an actual Byzantine Fault. Every time you go and check something it shows a different outcome, it's a Byzantine Fault. But a Byzantine Failure is when the system is gone because of a Byzantine Fault. Just know the difference, I guess. NASA said to.
Here's another really cool slide I took. Byzantine Fault Tolerance is crucial for companies like NASA where failure could result in life or death. NASA shares what they believe is the first picture of a Byzantine Fault where four computers disagreed and they could not reach a consensus, continuing to show different results to observers. I think if you guys want to dig into this deeper I have a bunch of sources, but basically it was a hardware issue. Some bus they had cracked and so each computer they had observing this was showing different results and no one knew how to fix it.
Now we're going to get into…
» Practical Byzantine Fault Tolerance
This was a paper written by two people we'll cover later. Basically what happened is they stated that this only uses one message round trip to execute read only operations and two messages round trip to do write and read. As you can see from this photo here it gets a little fun.
To cope with the system actively running processes that may be trying to mislead the system, EPFT requires more replicas than the traditional 2F + 1 replicas that are needed in normal consensus problems. If you want to be Byzantine Fault Tolerant you need more stuff.
How does this work? Say we want to withstand two failures. We need seven nodes. First the client sends the request to the primary. The primary multicasts this message to all of its backups. Each backup takes that request and it replies to the client after executing it. The client waits for the failure, plus one, so three replies from different replicas with the same result. Say the client doesn't receive enough replies or it doesn't receive any replies at all. Kind of implying that our primary is a bad guy. If the client does not receive the reply soon enough it broadcasts the request to all replicas. If the request has already been processed the replicas will simply resend the reply. Basically what will happen is if the primary isn't sending these out it will be voted out, which is really cool. It means we can withstand a bad leader.
I forget his name, but he works at Microsoft, and he writes this article called, The Saddest Moment. It's about speakers that give talks about reliability and distributed systems. Basically, what he tries to show is there no way to be completely reliable, completely Byzantine Fault Tolerant without actually losing your mind. This is a quote. Oh, his name was James Mickens. I have it on the slide. This is a quote by James Mickens. It became very true to me as I dove deeper into these white papers. I'm starting to believe, basically that distributed systems reliability consensus is a lie and we'll just keep adding more nines into things. I don't know. This is a good paper. You should read it.
Now we're going to talk about the consensus problem and all of this will come full circle in the end.
» The consensus problem
…ensures that a distributed system makes the correct decision even if one or more processes have failed. You may need your application to use consensus when utilizing distributed database architecture, using multi-master setup of your favorite orchestrator, or using both.
I don't have to do this, but I keep doing it. It's like a TV remote. Say you have five nodes and a client wants set a value. First, how do we know which node handles or receives the request? Second, how do we get them all to agree? Then what if one of them catches on fire? Can they still agree? This is my favorite one. There are some guidelines for a consensus protocol to make sure they are used in a variety of systems. These look really fun and easy to read, but the more you think about them the more your brain hurts. Here's how I understood them:
For termination; If a process is correct it will decide something.
Incorrect processes may or may not decide things.
Once a correct answer is decided then that correct answer to all loyal or correct process, that is the correct answer. There we go.
If a process decides something then it must have been proposed by a process that is correct .
Every correct process must agree on the same value.
This can be really confusing. But guess what? Paxos is more confusing. Paxos comes in many different flavors and is a family of protocols. We will be focusing on single decree Paxos because someone told me it was the easiest to understand. I hope. Lynch and Liskov wrote a paper saying the consensus problem could be solved. I hope no one knows Leslie Lamport in here, because he will come and he will find me for saying this, but because of this Lamport decided, out of spite, to prove them wrong. So he wrote a paper in 1989 that was like, "This is actually right. The consensus problem can be solved." Then it took him 10 years to get that paper published because no one understood it. It was rejected three times. A while later after everyone continued to say it's really hard to understand he wrote Paxos Made Simple, and no one understood that either.
I'll break it down like this: Say for Christmas you parents say you could get a cat or a dog. Just for clarification I love both animals equally. These aren't my animals at all. You and your two other siblings need to agree amongst each other if you're going to get a cat or a dog. Your parents don't want to deal with you, so you got to figure it out yourselves. First, you approach each of your siblings and one of you are like, "What's happening." They're surfers, okay. Then you would approach each of your siblings and see if they made up their mind if they want a cat or a dog. For simplicity sake we'll say they're undecided. Then you propose getting a cat. That's a really cool cat. Your sibling agrees and you do this to each one of your siblings until you have the majority of them agree that they would like a cat. But since there's only three of you, you only need to get one of your other siblings. Sorry little sibling.
This is how Paxos works, but we'll dive a little bit deeper. In Paxos there are three main roles. Each node can be all roles at once if they decide:
A proposer sends a proposed value to a set of acceptors.
An acceptor may accept the proposed value. The value is chosen when a majority of acceptors have accepted it. We're going to say accept a lot. Acceptors are allowed to accept more than one proposal, but we're not going to get into that too deep.
Also, learners: They just chill out. When everything's done and decided people send stuff to learners so they remember. Paxos has four primary phases we'll cover. Prepare, promise, accept and accepted. It starts with this.
Prepare: The proposer sends a message to all acceptors. The important bit is that proposal number up there.
Promise: An acceptor responds to this with a promise never to accept any proposal with a number less than 9,001, so that old proposals don't suddenly get ratified. It also sends the highest number proposal to the acceptor, that the acceptor has accepted so the proposer can substitute this value for its own. It's basically saying, "I promise never to accept anything old and also if your value isn't the highest value here's the highest value."
Accept: When the acceptor has gotten the acknowledge by the majority of acceptors it sends out the command
Acceptplease. At this time the acceptors can choose to accept or deny.
Accepted: Once it accepted then it sends all of this to all the learners and the original proposer to basically say, "This is what we did today."
Some really hard algebra, I just want to prep you. Paxos needs 2m+1 servers to tolerate m failures. If I want to tolerate two failures I need five servers. I did this because when I look at algebra my brain turns off.
Let's go over some failures:
If a proposer fails in the Prepare phase, what would typically occur is the proposer can wait indefinitely for the acceptors' response. Nothing really happens.
If a proposer sends something and it's waiting for the acceptors to send their accepts back the proposer can either wait indefinitely or if it dies no one cares, sorry.
If it fails during the accept phase then eventually another proposer will come around with a higher identifier. When it gets to the accept phase one of the nodes will tell the new proposer about the previous proposer's proposal. The new proposer will update their value to the previous proposer's proposal. Say you got 9,002 come in, because 9,001 died, and 9,002 say, "We're going to get a dog." By the time the acceptors were going to send back, "Yeah, we're not going to accept any other proposals," it also sent like, "Hey, the last guy who was here, we don't know where he is, but he said we were going to get a cat." 9,002 will be like, "Sick, we're going to get a cat," and that's it. It's more complex than that, but trust me, you don't want to know.
If an acceptor fails no one really ... there's more of them. Unless the majority of them fail then that's a problem, but if one or two fail, you all good.
» Leaderless Byzantine Paxos
This is where we get into some fun drama. In 2011, Lamport released a white paper called, Leaderless Byzantine Paxos. It's short. It's sweet. It's two pages. It rolls off the tongue. What it says is practical Byzantine fault tolerance is bad and wrong, because it uses a leader. That was actually the summary of the paper, but we can get into some fun parts.
In leaderless Byzantine Paxos, everyone has a virtual leader. When an incoming message is sent out to each server all leaders then have to make the same decision at the same time to reach a consensus. The main point of failure in this is it relies on a perfect synchronous system. Because of that nodes that might be lagging behind or experiencing some latency will cause false positives.
This is Leslie Lamport and he said basically if it falls behind or it's latent then it's bad. That was about it. You don't have to read the whole thing.
This is from the Google Chubby Authors. You can read that. I'll say some other stuff. Paxos can be thought of as the "choose your own adventure" way of implementing consensus. It is a base and on top of it you have to decide how you're going to handle a whole slew of things. Every white paper that is out there, not just Leslie's, but Google has written a ton. Anyone who's implemented Paxos has written a ton. It shows you a piece of the puzzle on how to implement their own flavor of Paxos.
It's pretty complex and especially if you want to put this in production. You're going to be looking at multi decree Paxos. It's hard for people to understand. There was another paper where someone said, "I'm pretty sure there's only five people who actually understand Paxos." One of those people is Leslie Lamport, so he doesn't count.
We're going to get into something that's a little less complex and that's Raft. Raft was started to make consensus algorithms more understandable. Paxos ramps up in complexity and the information you can get on it varies by user case. Between its cute marketing and friendly "it wasn't Paxos" approach Raft became widely used. As Paxos had a choose your own adventure play style, Raft assists you with setting up a production ready distributed system. It started off in 2013 or 2012. I don't know. I Googled it a lot and I couldn't find anything. Basically, some college students were like, "We don't understand Paxos. It's really hard. We're going to make something that's a little bit more understandable."
How does Raft work? In the beginning when you start up Raft there's no leader. Each node is assigned a randomized timeout. They're counting down and they're waiting for something (we'll cover what later). The first node to reach the end of its timeout—it's the orange node, because he had the shortest time out—will basically ask to be the leader. At this point, as in Candidate status. It asks everyone, "Hey, please vote for me. I'm really cool."
The elected leader will continually check up on the other nodes. Oh yeah, the slide is funny too. The elected leader will check up on the other nodes to make sure they're okay one he's elected. Basically you have a timer and every couple of seconds the leader is like, "Hey, 'sup guys? You still alive?" Every time they get that, "'Sup guys?" they gain more hearts and their timers reset. It sounds better if you call it a heartbeat and their heart grows each time the leader messages them. It's like a nice budding romance. This is how you make tech fun, guys.
In Raft, like Paxos, we have three different states:
The Follower, which is where most of our servers are,
The Candidate where they're trying to be like the leader, and
The actual Leader.
Raft is divided into terms and basically there's only one leader per term or there is no leader. Maybe because there was a conflict. You can only be leader if you have an up to date log. (We'll discuss that in log replication—how that's handled.)
Just to cover the basics, there are three end goals we can have during leader election. Say you don't hear from your leader. You're really anxious and sad and you hit your timeout and you decide I'm the captain now. You request others to vote for you. Either…
You become the leader and you send out heartbeats to everyone,
Somebody else becomes the leader and you're back to where you were, or
Nobody wins and a new term starts and you try again.
Whoever hits their timeout first gets to do this.
For the safety of the log one of the primary times the leader will be rejected is if the candidate log is out of sync. If you see in the bottom ... Where is it? Yeah, right there. That's really important, so just know that the different color is a new term. Then the index is the top number. The value you don't really have to care about, but it's just there. In this image we can see that the log for the orange server is out of sync. Therefore he will be denied. If the index is lower than the rest of the candidates it will also be denied. See how he doesn't have a blue one? That's why. That's the tl;dr on that slide.
Raft is designed around the log and keeping the log up to date. They say that and throw Lamport to the wind, because they say he didn't do that. This is important because Raft will automatically repair inconsistent logs, just by normal operations.
This is going to get a little complex. Here entries one through six are all committed, because we have three of them in each node. Does that make sense? Because we have a majority of them in each node appended then they're committed. A committed entry is one that is in a majority of the servers. If there was two of six it wouldn't be committed. However, the seventh entry isn't committed like that, like what I just said. It's like I planned this out. The seventh entry isn't committed because only two servers have it.
Say the leader is trying to append something and there's an inconsistent follower like that third guy right there. When a leader tries to append the inconsistent follower log will check at the preceding index and term. Basically, the leader is like, "Hey dude, you should be on the blue term. You should have index six, which is blue, and you should have nothing in seven, because I'm trying to add stuff in seven." This guy is like, "No thank you, friend. Not my thing." Then he's like, "All right dude, I'm going to send you more entries to see where we can match up, so then we can have a more consistent log." Say his five was different and they didn't match up; The leader would keep sending this dude a ton of data and be like, "Please fix your log," until the follower figured it out and was like, "I can fix this now."
Since we covered failures with Paxos I feel like I have to do it with Raft, or it's not nice. Say a client sends a message to the leader in transit. It's in transit or the leader dies when it arrives immediately, so basically it's in cyberspace. In this case the message is gone. How do you deal with this data loss? How do you ensure and guarantee that these messages will be read by a leader? Usually you just have to resend the message. When I gave this talk to my friends that's what they asked and they told me to put it in my talk, so I did, so they can't be mad.
We'll briefly cover Tangaroa, which is Byzantine Fault Tolerant Raft. Full circle. It uses digital signatures to authenticate messages. This prevents men in the middle attacks. Each actor in the cluster needs to have each other keys to decrypt the message. While this adds a level of complexity it lets each actor trust one another—it is inspired by practical Byzantine Fault Tolerance from 1999, yes.
And the leader can be overthrown. This allows for disloyal leaders not to take advantage of the system. This is possible by disallowing one leader to be in charge of the consecutive turns. By introducing a round-robin election we can remove the bad leaders from power. Basically one leader can't be a leader each term forever. We just have to keep switching those guys up. Since the nodes can broadcast every entry to each other it adds another layer of trust due to the sharing of information, but also another layer of complexity.
This is really cool. I didn't read too much of it, because I was like, "I just need more Byzantine stuff in my talk." I would check this out.
Here's a brief summary in a graph. I said I would cover the drama. I feel like I didn't do it justice. Just to cover,
Basically, a long time ago someone wrote a paper that the consensus problem could be solved.
Lamport was like, "No way dude." Then he solved it by accident. Probably on purpose.
Then he spent 10 years trying to get his like, "I solved this," published. Three people denied it. Probably because they didn't understand it. Then it was finally accepted.
Then Lamport wrote some other stuff. Oh yeah, Paxos Made Simple, because no one understood the thing that they accepted and no one understood Paxos Made Simple either.
In 2011, leader at Byzantine Paxos was written, two pages by Leslie Lamport basically saying he hated this paper that was written in 1999, or 1998, which is really cool.
Then in 2012, 2013, Raft is born. In that paper everyone was really happy. That was my favorite paper to read.
What are you going to do with this information when you get home? You can use it to sound really cool in meetings. What else do you need it for? If anyone says, "Let's use Paxos," you can be like, "No, sir. No thank you." If someone tells you that X problem is happening because of Raft you may be able to tell them that's wrong, like me with my boss, but not really because he's kind of right. But I'm kind of more right.
The application we were using that I wanted multi region, multi master, was an application I didn't really like. That's where I'm biased. Basically, Raft should be able to self heal if there's problems with latency. Also, because this is because of the randomized timers, but because most of our multi region, multi master setup would be distributed in the same kind of US zone. Nothing is going to be that latent. Network latency between regions, I think, someone who knows more about networks can come yell at me, I don't think it's going to be that bad. I don't think we're going to have a bunch of false positives. You could have false positives, but Raft would handle it. I'm going to call it a tie. Mainly because I don't want to get fired.
All right, almost done. These are slides are really cool, right? Super adult and stuff. Check out my bro, @cosmicmeoww. She will do slides for you and you will look as cool as I do.
Then I think this is important to check out. Here's all the stuff I had to read to do this talk. A lot of stuff. When you see speakers and they're nervous and they're weird like me up on stage, you give them a high five or a Red Bull and be like, "Thanks for reading all these papers. Wow. I bet that was a lot." I have a Red Bull. I just can't drink it.
Anyways, thank you. Check out my Tweet. We're all done here folks.