RAFT: - An Understandable Consensus Algorithm
Introduction to Consensus
At work, we use a lot of products that guarantee consistency in a distributed system. For example, we use Apache Kafka and Apache NiFi. We use these internally for message queues, to stream large amounts of data, and to transfer large amounts of data. We also use Kubernetes. All these systems need to store data in a distributed manner. For example, cluster configuration, membership details current state needs to be stored. For this, Kafka and NiFi use Apache Zookeeper and Kubernetes uses etcd. Both Zookeeper and etcd are implementations of consensus algorithms to help store this data.
Storing data in a distributed system is hard. Every piece of information must be accurately replicated across all the machines, in a consistent manner. This means the system must account for network issues, server downtime and related issues. Consensus algorithms help solve these issues.
Consensus Algorithms typically deal with replicated state machines. For a given series of inputs, a state machine always returns the same output. So, if you replicate the state machine across a bunch of nodes and can maintain a consistent log of the inputs to the state machine, you have achieved consensus.
Raft is a consensus algorithm with the goal of being understandable. The authors of the paper argue that even though Paxos has existed for a while, there is no true to spec implementation of it because once you add all the real world factors (for example, changes in cluster configuration), the algorithm becomes very difficult to understand. Also, Paxos allows for lot of variations in its implementation. If there is no “true to spec” implementation, it becomes difficult to formally prove that your implementation does consensus correctly.
Glossary of Terms
- Client: The machine that issues an instruction. An instruction consists of the operation and a unique serial number that can identify the instruction.
- Leader: The server that can take requests from clients and is responsible in determining the order of inputs. There is only one leader.
- Follower: The servers that replicate the data based on requests from the inputs.
- Candidate: Servers that want to become leader. This is a transitionary state, and either become a leader or follower.
- Term: This is a number that increments every single time the leader changes.
- Entry: A persisted input in the log. An entry is uniquely identified by the index and the term number at time of persistence. I.e., all the entries with same index number and term are the same command.
- Committed Entry: An entry that has been recreated in most of the servers.
How Raft Works
Raft works on a leader-follower basis. The leader serves the clients, leader sends the data to the followers to replicate. It's the leader's resposibility to ensure that the consistency is maintained.
To explain how Raft works, I will split this section into multiple scenarios. In each scenario I will explain the current “setup” of the Raft system and then explains what happens in that scenario. These scenarios cover every possible the state the system can be in, and some states resolve into older states.
The Start Scenario
Setup: The system has just been deployed. The system has not yet received a request from a client.
Process: In this scenario, the system must decide on a system to become a leader. The process by which a leader is selected is called Leader Election. This process is the same at any point in time.
Each server (follower) has a randomized timer in which it must get a heartbeat from the leader. If it does not, it increments the term number, becomes a candidate and votes for itself. The candidate then sends a call with its term number requesting for a vote from other servers.
Vote Acceptance:
A vote is granted to the candidate if:
- If it has not already given its vote in that term.
- The committed index number in the candidate is greater than or equal to the follower
- The term number is equal to or greater than the follower’s own term number.
The candidate that receives the most votes is the leader. The system can now serve requests via this leader.
The Normal Operation Scenario
**Setup:**In this scenario, there is an established leader and all the servers are connected to the leader and have up to date logs.
Process: Only the leader can accept requests. If any other server receives a request, it rejects it and forwards the client to the leader. The leader receives a request, then
- The leader adds that entry to its log.
- The leader sends the log to all the servers.
- Most of the servers respond with a successful entry into the log
- The leader commits the entry.
- The leader applies the entry to its state machine, and then sends the output to the client.
In this situation, all the servers have the exact same log as the leader. This is the ideal scenario, and the leader continues to perform requests.
The Lagging Follower
Setup: In this scenario, there is an established leader and all the servers are connected to the leader. This follower has not received all the requests to replicate the last two requests.
Process: The follower needs to catch up to the leader.
- The leader sends a request to replicate a new entry. Along with it, it sends the previous index and the term number of the previous committed entry.
- The follower checks if the previous entry matches the request. The match fails, so it rejects the request due to inconsistency.
- Upon receiving a reject due to inconsistency, the leader sends the previous log just for that follower (The leader maintains a state for each follower what index needs to be served)
- The above steps repeat until the previous entry matches what the leader sends. Then the follower commits the requested entry. Then it again receives the subsequent commands and commits them into its log.
The Partitioned Follower Type A
Setup: An up to date follower has lost connectivity from the other servers. In this time its timer runs out and turns into a candidate. At this point it also regains connectivity with the other servers. No new request has been made at this time.
Process:
- The partitioned follower turns into a candidate.
- It requests a vote from the other followers. They accept because they are at a lower term.
- Leader gets a request for a vote from a candidate with a higher term. It becomes a follower and accepts the vote.
- Candidate becomes leader.
The Partitioned Follower Type B
Setup: An up to date follower has lost connectivity from the other servers. In this time its timer runs out and turns into a candidate. At this point it also regains connectivity with the other servers. New requests have been made and committed by the leader.
Process:
- The partitioned follower turns into a candidate.
- It requests a vote from the other followers. They reject because the committed index is lower.
- The servers that reject the vote, however, increment their term. A new election process will start, and a new leader will be elected.
The Partitioned Leader Type A
Setup: The leader has lost connectivity from the other servers. The leader has received 2 requests in this time and adds them to its log. No follower times out, and connectivity is regained.
Process:
- A heartbeat is sent by the leader. A heartbeat message is a no op request to append entries.
- The followers receive the heartbeat where the previous entry does not match. This resolves into The Lagging Follower situation.
The Partitioned Leader Type B
Setup: The leader has lost connectivity from the other servers. The leader has received 2 requests in this time and adds them to its log. A new leader is elected among the other clusters. The partitioned leader accepts a request and add it to the log. Partitioned leader now regains connectivity. The actual leader has also accepted a request.
Process:
- The partitioned leader sends a heartbeat.
- Heartbeat is rejected due to a stale term number and immediately becomes a follower.
- It receives a request to append an entry in the same index that it has already written. It compares term numbers, and since the new one is greater it overwrites the entry.
Note: This is one scenario. The partitioned leader may also receive a call, which it accepts as it has a higher term. Either way, it becomes a follower.
Leader Crash
Setup::: The leader crashes after committing the result but before returning the result to the client.
Process:
- The client resends the same command with the same serial number.
- The new leader has the serial number in it is committed log so immediately returns the result.
Partitioned Candidates
A partitioned candidate will resolve to the same scenarios as a partitioned follower.
Additional Notes
- The leader always appends data. The overwrite only happens if it has converted to a follower.
- The leader always has the highest committed index.
The Raft paper also talks about how to handle configuration changes and log compaction and snapshotting. I am not going into those details here as they are extensions to these basic operating steps. The fact that these are only so few possible states for the system is why Raft is understandable. The authors of the paper call this the reduction of state space.
Further Reading
As mentioned, the paper itself talks about log compaction, cluster membership changes and configuration changes.
Yugabyte DB is a new distributed database that internally uses Raft. Here’s a link to a guest lecture where they talk about the challenges of using Raft in a DB and how they overcome it using Leader Leases and other interesting solutions.