Cryptocurrencies such as Bitcoin and smart contract platforms like Ethereum are recent examples of the evolution of consensus algorithms blockchain represents. Consensus, getting distributed processes to agree on a single value, is a fundamental problem in computer science. Distributed processing is difficult. In fact, there are logical proofs that show pretty conclusively that there won’t be a single perfect algorithm for handling consensus in an asynchronous system made of imperfect nodes. As long as there is the possibility of a fault, there is a certainty of a problem.
It’s important to have a solid understanding of the fundamentals since we are tasked with identifying an imperfect consensus algorithm sufficient for complex distributed use cases. First, we’ll discuss consensus algorithms. Next, we’ll look into current algorithms addressing byzantine fault tolerance, followed by emerging algorithms. With that overview of the landscape, we can start to get a handle on selecting an algorithm for a particular use case. We’ll wrap up by looking through a glass darkly about what might be.
The Problem with Consensus
All consensus algorithms address termination, agreement and integrity properties. Termination, a liveness property, means all correct (non-faulty) processes eventually produce a value. Agreement, a safety property, means that all correct processes select the same value and these values are valid within the dictates of the protocol. In other words, the state of being “correct” is determined by how many other processes share your value (all, some quorum or even one). Integrity addresses the ability to recover from the failure of a participating node. Failure can refer to either a fail-stop or a byzantine fault. When a process stops communicating in the case of failure, this is called a fail-stop. A byzantine fault refers to a process where a process continues to communicate but sends faulty, possibly malicious. data. Spoiler alert: this is not good.
In a synchronous network, a message m is sent from node x to node y with the understanding there exists an upper bound of t representing the maximum amount of time a message can spend traversing the network and a maximum amount p for the amount of time the slowest machine could reasonably process the message. A message m is sent from x to y and an ack is returned to x. It’s easy to reason about a fail-stop condition by setting a timeout for the round-trip communication; 2x (t + p(x) + p(y)). In an asynchronous network, we have no t and p. Instead, the concept of a callback is used; x will send m to y and expect a reply at some unbounded future time f. This unbounded f effectively places an upper limit on the reliability of distributed asynchronous systems and, by extension, prevents there from being a single, totally effective consensus algorithm, aka the FLP Impossibility.
In 1985, Michael Fischer, Nancy Lynch and Michael Paterson published a cheery little paper showing that in an asynchronous network where messages may be delayed but not lost, there is no consensus algorithm that is guaranteed to terminate in every execution for all starting conditions if at least one node may fail-stop. Since there is no upper bound to f, there is no guarantee that a failure has occurred; it may just be a very long f. There cannot exist a deterministic consensus protocol that can guarantee safety, liveness and fault tolerance in an asynchronous system where fail-stop is a possibility. FLP further implies that no agreement can be guaranteed in an asynchronous system with byzantine faults, since they are more difficult. There has been an enormous amount of work in this area in the decades since this paper and you’ll see all the good solutions referencing back to this limitation.
The Journey Begins
In 1990 at Cornell, Fred Schneider proposed replicated state machines as a mechanism for implementing a fault-tolerant asynchronous network by replicating servers and coordinating client interactions with server replicas. Posit a deterministic state machine, which reads a fixed sequence of deterministic commands. Deterministic commands produce deterministic outputs (write set values) for a given input (read set values). Now imagine a network where the servers are deterministic state machines. A client can issue a command to a network of these replicated state machines and expect that any one of them will return the same value as long as three conditions are met. First, all replicas must share the same initial state. Two, determinism must hold. Finally, there must be coordination among nodes to ensure that all operations are carried out in the same order.
This deterministic state machine model makes it very simple to reason about replication, which is why this is such a foundational concept. First, you can assume the minimum number of machines required to manage fault is three: one to have a fault (a = 321) and the other two to have the correct answer (a = 123). If there are two failures, one server is reporting a = 321 and another server is claiming that a = 567, we can see that our tiny network is insufficient. (At this point, we will not address an attack where the majority of machines are corrupted and a = 321 is the lone voice in the wilderness, but we will get there.) In order to realistically provide for faults, an RSM network must be composed of 2f + 1 nodes, where f represents the upper bound of nodes assumed to be faulty.
If Jorge Luis Borges was great at math and knew about RSMs, he would have written The Part-Time Parliament. Instead, it fell to LaTeX inventor and distributed systems guru Leslie Lamport. While trying to demonstrate the impossibility of an effective consensus algorithm for distributed asynchronous systems, he ended up writing a great one. Win some, lose some. Assume three classes of agents: proposers, acceptors and learners. Any node in the network can play the role of any type of agent. Agents can send messages to each other. Agents can operate at any speed, can fail by stopping and can restart. The messages can take any amount of time to be delivered, can be duplicated and lost, but cannot be corrupted. This is a basic asynchronous, non-byzantine network model. Paxos is a leader based protocol. Leaders, also called proposers, in response to a command sent by a client, will create a separate instance of the Paxos protocol on each server in the network and assign a different, sequential number to each instance. If a leader fails, a new leader is elected using a randomness selection algorithm as implied by FLP. Most implementations use a leader who functions as both distinguished learner and distinguished proposer. Coming to a consensus on a client request requires 1 .. n rounds.
Each round is broken up into two phases: prepare and accept. A message contains two components: a value sent from the client called a proposal and proposal number. The leader sends a prepare message containing a proposal plus the proposal number to a majority of acceptors. If the proposal number of the current proposal is larger than any previously accepted proposals, if any, then the acceptor promises to deny any earlier proposal requests and sends the last accepted proposal back to the proposer. Once the proposer received a response from a majority of acceptors it chooses the highest numbered (most recent) proposal and sends an accept message to the majority of acceptors along with that chosen proposal number. An acceptor will accept the proposal assuming it has not already responded to a prepare message with a higher proposal number. If a proposer does not receive a majority of responses or if multiple proposers send conflicting messages, the round will fail and another round will begin with a higher proposal number. Otherwise, the round ends with consensus achieved by the majority.
Acceptors only need to remember the highest-numbered proposal that it has accepted and the highest-numbered prepare request to which it has responded. A learner must discover the value that has been accepted by the majority of acceptors either by having every learner querying every acceptor or by having some distinguished learner do the work and communicate the result. Most implementations of the Paxos algorithm use a leader who serves as both distinguished learner and distinguished proposer, using a randomness selection algorithm as implied by FLP.
Similar algorithms followed after Paxos. State machine replication can be considered an active replication strategy while primary-backup replication is a passive replication strategy. ZAB, the Zookeeper Atomic Broadcast protocol, is used by Apache Zookeeper and is often compared to Paxos. While both ZAB and Paxos share the same leader election strategy, ZAB opted to go with a passive replication strategy since correct ordering was more important than performance. Zookeeper stores data in znodes, which may or may not have a hierarchy of children. This is similar to a filesystem. Kafka, in >= 0.9, stored it’s message offsets in Zookeeper. If these message offsets were out of sequence due to a new leader election in Zookeeper, you would not be able to correctly look up your messages in Kafka because you would be unable to traverse the znode hierarchy. I use this as an example since it happened to me. According to a comparative study done by Dr. Schneider, systems based on passive replication strategies are good for compute-intensive with moderate payload sizes while active replication systems like Paxos provide predictable low-latency performance for short operations. Raft is similar to Paxos, providing some additional boundaries and clarification around who can become a leader and when and is considered to be easier to understand.
Up Next
A basic understanding of fail-stop tolerant consensus algorithms forms a basis for understanding byzantine-fault tolerant algorithms in the next post.