Practical Distributed Consensus using HashiCorp/raft
Sep 19, 2017
James Nugent from Joyent shows us how HashiCorp/raft can help maintain consensus in a distributed system.
Consensus in a distributed system is a hard computer-science problem. For example, if you have two clients that “simultaneously” try to update the same store with different data, but each writes to a different store replica, then who wins?
The standard computer-science solution to the problem, Paxos, is notoriously hard to implement correctly. Luckily, HashiCorp includes a Raft library, which does it for you. It’s the same library that HashiCorp Consul uses, for example.
Watch James unpick the most complex of problems to show how HashiCorp/raft can help you solve them. He shows how to use the library with Go, and then demos consensus in a three-node cluster.
- James NugentCo-Founder, Event Store Ltd.
I'm going to show a lot of code today, which is also Opensource, so there's no point in trying to follow along too closely with the code, but it's all available, the whole thing's Opensource using the HashiConf libraries.
So the first thing we're gonna talk about before we look at some code is What is Distributed Consensus? Let's say we have a database or a server, which has a single node. We have clients which can talk to it to say, for example, "Save this piece of information for us". We have [00:00:30] a server. It has a value associated with it. In this case it's just an integer. Our initial state is 0. Our client can send over commands like, store 11 in the system.
If we only have one node, consensus is really easy. Client says, "Store 11." Server stores 11 then says, "Okay, done." What if you have two clients concurrently? Then a single node system can just queue the request, supply them in order, deal with them in order and then reply [00:01:00] and you nave the same basic system. The consensus can be achieved very easily with one node.
The problem is, if we have stateful systems we prefer not to run single nodes because all kinds of things can happen to these single nodes. The hardware can fail. AWS can destroy your incidence. You could accidentally buy dull dims and have the box kennel panic, not that I'm bitter about this this morning.
[00:01:30] The problem is, as soon as you've got more than one server, getting them to agree becomes a hard problem. The client sends a request for example, store 11, and it talks to a particular server. How do we ensure that the servers do the correct thing? It becomes a much bigger problem when we have two sets of clients talking at the same time, because even if one server was coordinating every request, how do we guarantee that these things get applied to every server in the same order? This is particularly important if you have non-commutative request types. It's fair enough [00:02:00] if you're going to say, "Set this value to this." If you're going to say, "Read the value. Update it to add three, and then store it back," then the order of operations is very important, and it has to be the same everywhere in order for the system to maintain state.
That's the fundamental problem. There's a little bit more to it than that, specifically there's some problems with fault tolerance. It's a classic distributed systems problem that we're not going to get into today, because it's far too early for that, called the Byzantine Generals [00:02:30] problem, one of the fundamental aspects of distributed systems, where distributed consensus tries to address.
The first prior op for distributed consensus is an algorithm called Paxos, which was devised in the late 90s by Lesley Lamport, who I believe was at DC at the time, now at Microsoft research, has been for a long time. The paper was published in 98, something like that. The system was in use since the late 80s. [00:03:00] This is some excerpts from the paper. The paper was called the Part Time Parliament and the consensus model was hypothetically modeled around a Greek island and their system of government.
I don't know how big the text is there, but given that it's an academic paper about computer science that starts with the words, recent archeological discoveries on the island of Paxos reveal that the parliament function, despite the, whatever. [00:03:30] It's not the type of typical computer science paper that you start with.
It continues in that vein. Consequently, Paxos is notoriously difficult to implement properly. Lesley himself, the original author then published another paper a few years later called Paxos Made Simple and goes on to say it was regarded as difficult to understand, perhaps because the presentation was Greek to many readers. It was literally Greek in many cases.
[00:04:00] That lead to a whole rash of other papers about Paxos. There's one here called Paxos Made Practical. This one is a hub into what's to come here. Starts with Paxos is a simple protocol, which is technically correct until you actually try to implement it.
The problem is it actually works and is complete and proven. Consequently, because all other distributed consensus problems are reducible to the same set of primitives, it led to the quote from [00:04:30] Mike Burrows who designed Chubby for Google, saying that, "Distributed consensus algorithms fall into three categories: Paxos, Paxos with unnecessary extra cruft, or broken."
Raft falls into the second category, but it's designed with a different purpose. Also, the attribution may be wrong. Late 90s Google probably not the most reliable source for exactly who said what.
This is actually my favorite one. This was written by the Google guys that did Spanner eventually. [00:05:00] Despite the existing literature in the field, building such a database using Paxos, proved to be non trivial, which may be the understatement of the millennium. Will be an interesting problem.
A research student at I believe Stanford, Diego, did a research problem specifically to design an algorithm that had the same guarantees and benefits as Paxos, but could actually be explained. [00:05:30] The paper, which came out of that is called In Search of an Understandable Consensus Algorithm, and the algorithm is called Raft.
Luckily we haven't yet had a paper called, Raft Made Simple, so hopefully it's simple enough. Let's take a quick look at how Raft works and then we'll go into looking at how HashiConf's library implements it and how you can use it in your own systems.
Servers end up in one of three states. Every server [00:06:00] has one of three states at a given time. They're either leaders, followers, or candidates. When you start up a cluster of servers, every server starts in the follower state. There's no leader of this cluster. At some point, if they don't hear from a leader, which they won't because there isn't one, they become candidates and solicit votes from all the other members of the cluster. These things know how to talk to each other by some transport mechanism.
[00:06:30] The candidate will solicit votes from all the other members in the cluster, and other members of the cluster will reply. When we reach a simple majority, then it's time for the electoral. Oh wait, hold on. Then, once we have a majority, then we have a leader. A cluster is actually formed and we have a leader node and we have two follower nodes in this case. You need a majority of nodes to agree on this thing, so any overlapping majority is [00:07:00] guaranteed to have the correct value for everything in there.
Once we have a formed cluster, then all changes to the state of the system go through one node, which is the leader. Now, many systems allow you to go and talk as a client to any node in the system, and they will forward requests around to make it a bit more usable, but practically what happens is everything gets forwarded through the leader, and the leader is responsible for state management.
What happens when the client says, "Store 11," and you distributed [00:07:30] integer store, possibly the most over engineered system possible. The leader says, "Add 11 stored and event to the log," which is replicated out to the followers. Bear in mind we don't update any actual state values here. They always have their initial state until a bit later on.
The followers come back to the leader when they've received and persisted this commit log entry, and say, "Okay, [00:08:00] we're done now." At that point the leader changes its own state to match the value that was just committed, and then sends commit messages out to all the followers. The followers go and update their state and then they reply saying, "Okay." Then you can reply with an okay to the client. Obviously this is the happy path and there are many different failure scenarios that can occur here. That's the effective flow every rank through a Raft system.
[00:08:30] Actually implementing it is not that hard. It's a fairly straightforward algorithm. There is some nuance to it. Luckily, especially for people who are using Go, there are lots of high quality implementations. Two in particular are commonly used. The first one is etcd Raft, which is used in edcd and therefor cubineti and that kind of thing. The second one is HashiCorp Raft which is used for Console, nomad and probably other things as well. This is the one we're going to look at.
[00:09:00] HashCorp's Raft, apart from having a failing build right now is a library that manages the replicated log and it manages transport for you. It effectively gives you a very simple programming interface over a distributed system.
Let's take a look at a minimal example. The example we've got to go look at is we have three servers. [00:09:30] We're going to form them into a cluster. They talk over HTTP to clients and they talk over TCP among themselves. The library deals with that for you. You don't have to deal with intra cluster communication if you use this library.
A client, we're just going to use Carl because it's easy. If we go digging inside this thing, effectively the architecture is straightforward. There's also a missing line. We have an HTTP process and then [00:10:00] we have a server node process and we have the Raft log itself, so we'll use every instance of this library and then we have some storage.
Let's get to the practical bit and actually go look at some code. Before we do that, we can actually go run this thing. We have this cluster that's been running for a while. I have three nodes here. If we go start up this demo, we can just give it some ports to use, [00:10:30] and we'll see what happens here. Where are we. Okay cool. We have one node up. Then we can go and join the second node to it and the third node to it. This worked a second ago so hopefully it will. Okay. That's slightly [00:11:00] embarrassing. Okay, I do know what's wrong, however we'll just go look at some code instead. It's a lot easier.
The thing is implemented in Go. For people who have not used Go, it should be fairly straightforward to let me make the text size here a little bigger. [00:11:30] Nope. Sorry, did not check the font size of this. Okay.
Our program is fairly straightforward. There are some options we need to pass it. They live in this confix structure. [00:12:00] The basic options we need to pass through to use a consensus library here to build a system, we need to give it an IP address to use. We need to give it a port to use for Raft communication. We need to give it the address of at least one other node in the cluster. From there we can go and find all the other nodes that it needs to go talk to.
Actually, the vast majority of this little application is just plumbing around the library. We do a lot of things like validating configuration, that kind of thing, all [00:12:30] fairly straight forward. The actual meat of the thing is in this finite state machine. Raft itself and the library deals with replicating a commit log. It says nothing about how to actually use the entries in this commit log. It's effectively a log of events that have happened to the system and we need some way of taking an event and applying it to the state so we can actually make use of the state.
In this case, the structures that we're replicating [00:13:00] around this log are events. Because we only have a single integer to store, we have two things we can do. We have a store and we have a delete. To read, we don't need to go through the Raft system, although you could if you wanted to. Every time we get a new commit applied to this log, and by the time we get it, we will guarantee that it's replicated to all of the nodes. Every node has that data. [00:13:30] We need to apply it to our state. We have this interface to apply it. It's fairly straightforward. We get a lot entry, which gives us some information about some meta data about the actual message in the Raft log. We have our actual state structures. We need to protect our state with a lock, and then we have the actual state that's just an integer.
We can go through and say, "Every time we get a Raft log entry, we're going to go and [00:14:00] apply it to our state structure, but first we're going to get it serialized as [inaudible 00:14:05] for convenience." First thing we need to do is go unmarshal it so that we can actually read what the value is. Then we need to go and update our state. We take a lock. We defer unlocking it and then we put the value into our state. Now, this is a fairly straightforward example, because I was trying to keep the focus on the library plumbing and not the actual state management. This is exactly the same process the Console uses to say go add a new service to the catalog. Console will get [00:14:30] an event replicated around it's log that says, okay so a new service has joined. It's on this node with these ports and these tags and that kind of thing.
Console will go through, unmarshal the event, and then go and update the local cast log in exactly the same way. It's obviously work to be done, but the pattern is exactly the same.
The way we actually use this, again is fairly straightforward. [00:15:00] All we have to do to configure the entire replicated state machine system is effectively this. We construct object with a couple of parameters, and bolt them together. In this case, the different components we need, we need a transport, so we need TCP. We need somewhere to keep the data, so we need to keep the logs and then we need to keep the actual state as it's being written.
Then, for larger scale [00:15:30] systems, there's a concept of a snapshot. We have to implement them, but for the kind of scale we're talking about here, it makes no difference to us.
The other thing we need is a list of where our other loads are. There's a simplistic implementation of that here, which serves quite well I believe, which just stores nodes in adjacent file next to the directory.
Having constructed all these dependencies and configured it with our options, we can create a new [00:16:00] rough mode and run it, and it kind of just works, except it didn't work. Let's go figure out why it didn't work.
Let's go bootstrap this cluster. Bootstrapping is an interesting phase. Remember we started off with the idea that every candidate, sorry every [00:16:30] node starts in a follower state, and then there's timeouts which cause them to become candidates and then to have a leader election, to determine who's going to lead the cluster from that point forward. In this particular case, it's kind of wasteful. What we'd rather is that we know one node starts in the leader state already and other nodes can go join it, so that we don't have like a bootstrapping dependency problem.
In this case, what we do is effectively enable a mode where if this instance of the service doesn't hear [00:17:00] from another leader, then it will become the leader automatically. Bootstrap expect I believe in the Console configuration when you configure that does basically the same thing.
Okay. Here. This one enters the leader state, and here's the problem. As I moved around the conference my IP address changed. Actually we can solve that using another little HashiCorp library. Does anybody here write network utilities in Go in a regular basis? [00:17:30] Cool, this is a library that you should know about then. I built it in specifically because I knew this would happen. There's a library that HashiCorp has called Go [inaudible 00:17:40]. If you run Console, this is really useful to you.
You can use this kind of template syntax in Console configuration everywhere and it will go through and figure out. In this case it will just take the first private IP of the box, but you can do things like take the first IP address in this system that has a route to this network [00:18:00] and take the largest subnet of that and things. There's a lot you can do with it. That should solve our problem here.
Oh okay. Okay. That didn't solve my problem. That should solve my problem. What we've done is said, this node here is going to come up. It's going to go and join this other node. It's going to talk to the HTTP [00:18:30] port of this first node. It's going to add itself as a peer. Anyone that's run Console, these log messages are probably quite familiar because they're the same start up sequences you get with Console, in a slightly different log format, but they're effectively the same thing.
What we now have is a two node cluster. If we go and start the third one, where [00:19:00] are we. Got that. Okay. Wrong one in the wrong place. Then, the third node will join. We can go and see the logs show us that we updated our peer set so we now know about two other nodes and every one of these things, just automatically joins.
[00:19:30] The Raft library is managing all of that for us. There actually very little of our code has been run at this point. At this point we have a replicated fate system.
Let's look at how we make requests to it. If I go over to a new window here, and I of course need to go change this. I use curl and just go get key. [00:20:00] We get back a value zero. We've never actually set a value here, so this is just the value that we set, a default value.
Let's go make another test request that sets the value instead. If I go look at this, we're going to set a new value of 11. You can imagine this is a web API that's doing it instead. Something like the service catalog for Console. For now, we're just going to say, "Set this new value." If we go to curl dash D [00:20:30] at test request and post this URL, then as we press enter on this, we can watch the logs here and see we get a user request logged here, which was a post and it went to key.
There's no actual logging indicating that the value changed, so let's go check that it did what we expected here by going and retrieving the value. Now, we're at value 11. We can go and talk to any of these [00:21:00] servers, and they'll all come up with value 11. The way this is implemented from a client's perspective is just as a simple HTTP server. Using Go standard library, the way we handle a write request is to decode the request [00:21:30] from Jason. We go create the message that we're going to go and put in our replicated log. We say it's a set operation, and it has the new value here. We go serialize that to go an apply to our log. Then this single line will go and insure that this value is replicated across our entire system.
It will give us back an object which will at some point in time, either give us an error, if there's a problem applying the log, and we can just replace the client with a [00:22:00] that didn't work kind of response.
If it did work, we can just reply with an okay, which is what happened. If you go look in the API layer of Console, this is all it does really. It goes through. It serializes requests and it goes and puts entries on the commit log. Then the FSM section of it, after the data has made it to every node, the commit phase, that was our [00:22:30] prepare phase if you like, where we distribute the data to every node. The commit phase takes the data that we replicated and transforms the state representation to represent the new value. Because we have a lot and not any other kind of structure, it's serialized. You're guaranteed that every operation will be applied in the same order on every node.
If we have some kind of non-commutative operation, [00:23:00] that will still work. We don't have to do anything. We don't have to make any special affordances for that kind of behavior, unlike if we were doing our own coordination.
Now, this code is open source. There is no documentation for it really. As far as I'm aware, the most minimal example of using HashiCorp Raft without spelunking through the Console code and without involving any third party dependencies.
There are a couple of really good examples around that connect [00:23:30] HashiCorp's Raft to a key value source, something like [inaudible 00:23:36] to make a distributed transaction, to make a distributed key value store that's consistent and tolerant. Another popular database these days is Cockroach DB. That uses Raft under the hood as well, within shots to replicate data around its own internal representation of sequel.
When it comes down to it though, all of those things are just doing [00:24:00] what this little demo does. They're applying log messages to a log, waiting for the library to replicate the log around, and then they're mutating their own state in accordance with the instruction of the log message.
There are a lot of use cases for this. Many, not that many, but a few years ago, there's a bank in New York that uses Raft effectively as its security master distribution system. It gets stock market data in. [00:24:30] It replicates it around to different regions, and to different systems within the bank. Using Raft effectively, not HashiCorp Raft in this case, but the model is there.
This is a very powerful primitive and a really very powerful primitive ramped in an easy to use library. If you go try and implement this thing yourself, then as I noticed the Console team are down here, it's challenging. [00:25:00] Not because of the easy path. You can go implement the easy path in like half a day or something. This implementation of it on the other hand, is scaled up to tens of thousands of nodes in the cluster if not more.
If you have this problem, and I'm aware that probably very few people actually have this problem. Most people are more appropriately using Consoles on primitives versus building this themselves. If [00:25:30] you actually have this problem, this is a very nice library solution. It's one of the overlooked things that HashiCorp has done that's just generally useful to everybody.
If it's your sphere of, if you have a need for it, then I would absolutely recommend going and checking out the source code for this for a minimal example of how it works. Then, mapping the primitives from this example on to the Console database. Go download the source for Console, [00:26:00] and go look for the same primitives that are present in this demo so go look for the finite state machine. Go look for the Raft log. Go look for the server mode and the HTTP server. You'll get a much deeper insight into how Console actually works. None of them works in exactly the same way. Nomad is a bit more, slightly different but it works. Use it he same library underneath and it's effectively the same.
Now, I'm aware that there's limit to how practical a talk can be when it's firstly about distributed consensus, [00:26:30] be it 10 in the morning the day after the party, and see at a conference which is clearly not an academically focused conference. Hopefully, even if it's not directly useful to you, hopefully going through this small example here and reading through the code that's published and looking at the block diagrams of how it works, will give you if northing else, a deeper understanding of how console works so that when you're operating Console, you can understand its failure modes in a somewhat more understandable [00:27:00] manner.
The code for this is on Get Hub. Wow, that really doesn't want me to end huh. That's not even the right presentation. The code for this talk and I will make [00:27:30] the slides available as well, is all up on my Get Hub HashiConf Raft. The whole thing is less than lines of my own code, not counting the libraries for things like the logging and Raft itself, is well worth if you want to understand Console or Nomad's consensus on a more detailed level. Going and checking that out, mapping it on to the code base.
With that, thanks for listening. Enjoy the coffee.