Cryptocurrencies such as Bitcoin and smart contract platforms like Ethereum are recent examples of the evolution of consensus algorithms blockchain represents. Consensus, getting distributed processes to agree on a single value, is a fundamental problem in computer science. Distributed processing is difficult. In fact, there are logical proofs that show pretty conclusively that there won’t be a single perfect algorithm for handling consensus in an asynchronous system made of imperfect nodes. As long as there is the possibility of a fault, there is a certainty of a problem.
It’s important to have a solid understanding of the fundamentals since we are tasked with identifying an imperfect consensus algorithm sufficient for complex distributed use cases. First, we’ll discuss consensus algorithms. Next, we’ll look into current algorithms addressing byzantine fault tolerance, followed by emerging algorithms. With that overview of the landscape, we can start to get a handle on selecting an algorithm for a particular use case. We’ll wrap up by looking through a glass darkly about what might be.
What Else is Out There
We have discussed algorithms that are currently being used in production systems that address fail-stop and byzantine fault tolerance in an asynchronous network. But as we know from FLP Impossibility, there are no perfect solutions. There are necessarily weaknesses as well as strengths in both PBFT and PoW and production has a way of making weaknesses clear. We’ll address some of the alternative algorithms in both the Proof of Work and BFT space. The analysis can be straightforward since the algorithms have so many different characteristics is direct opposition.
PoW provides a permissionless, decentralized node identity management framework that allows for a very large number of nodes and clients. BFT allows for only permissioned nodes where all of nodes know each other’s ID’s which results for a limited number of nodes (<= 20) but still allows for a large number of client. The performance contrast from a throughput and latency perspective is also stark. The throughput of PoW is limited by chain forks while BFT can process 10K tx/ sec, which is comparable to transaction systems that only allow for fail-stop tolerance. The performance latency of a BFT algorithm is limited only by network latency while PoW has a multi-block confirmation process that results in high latency. PoW is a probabilistic algorithm so there is no true consensus finality while the deterministic characteristic of BFT allows for finality.
PoW is great for a building a highly scalable pemissionless network suitable for transactions where consensus finality can be probabilistic and latency can be high. The block frequency is about ten minutes (based on this calculation) and it can take an hour or two for consensus to be reached. BFT is great for building a small permissioned network suitable for providing a very low latency transaction platform with true consensus finality. It’s also worth noting that PoW is horrible related to power consumption and relies on physical clock timestamps for block validity. Naively if you want the whole world to be able to process 10 transactions a second, use PoW; if you want to enable a consortium to process 10,000 transactions a second, use BFT. If you want to build a new consensus algorithm, move either of those dials. Before that, we need to expand the use cases to include smart contracts.
There are two clarifications that should make it easier to envision a smart contract within the contest of a consensus algorithm. A smart contract is a program designed to be executed exactly as defined by a developer so the terms of the contract relationship can be enforced by cryptographic code rather than lawyers, Imagine that bitcoin is a smart contract in that it is a value transfer mechanism where validation is contingent on a network of nodes agreeing on a predefined set of condition. Cryptocurrency can be considered a simple implementation of a smart contract. Ethereum is an attempt to build a generalized technology that provides developers a platform to implement smart contract on a distributed, transaction-based state machine. One major distinction you’ll find as a developer is that Ethereum provides a Turing-complete language (ex Solidity) so that a developer can bring their own use case while Bitcoin uses Script, which is a Forth-like language that is not Turing-complete. For a gross oversimplification, you use a Forth-like language when you want to do something very specific correctly and at a low-level and you use a higher level language like C++, Python or JavaScript (Solidity’s influences) if you want to be able to tack generalized use cases where correctness is more … elusive.
Beyond BFT
PBFT is fast, but suffers from node scaling issues since communication between nodes is quadratic or O(n^2 ). Fail-stop tolerant algorithms like Paxos suffer from the same limitation, but since these are usually confined to an enterprise it was not considered a practical limitation. The Linux Foundation, with the support of IBM, developed Hyperledger Fabric. Hyperledger Fabric provides a pluggable consensus model and currently supports PBFT and a PBFT-derivative called SIEVE as well as a Paxos-based algorithm that has BFT built in called XFT. All of these implementations are made to support a relatively small number of known clients (<=20 nodes); Both Ripple and Stellar have taken BFT and introduced the concept of federation.
In BFT, all nodes must be trusted, which means there is a mechanism for registration and identification. In Federated BFT, only some nodes are trusted and each node must include a reference to at least one trusted node in its node list. Stellar limits the amount of nodes that must be considered trusted by introducing the concept of a quorum slice. A quorum refers to a set of nodes that are needed to reach agreement while a quorum slice is a subset of a quorum need to reach agreement in just one particular node.In this system, each node chooses its own quorum slices, allowing for open participation. The Ripple protocol implements proof of consensus. Ripple requires each node to define a Unique Node List (UNL) which refers to other nodes that are trusted by the given node not to collude against it. Consensus in the Ripple network is achieved by each node by consulting other nodes in its UNL. When an 80% super-majority is reached, the candidate set becomes a valid block and this Last Closed Ledger is added to the Ripple blockchain.
PoX
There are a variety of algorithms for the blockchain that attempt to prove some x using cryptography that fall into the PoX family; Proof of Work, Proof of Stake, Proof of Importance, Proof of Publication, Proof of Elapse Time, Proof of Burn and likely others. Proof of Stake makes for a good candidate to contrast with Proof of Work since it’s currently in use as a technology (Ethereum) and demonstrates a practical implementation based on a philosophy (cypherpunk). I think that Proof of Elapsed Time is compelling because of the tight integration between hardware and software.
Proof of Stake (PoS) brings an underlying philosophy as Vitalik Buterin explains. The basic premise is that modern cryptography places the defender in a stronger position than the attacker and he feels that at a fundamental level Proof of Work places the cost of attack and the cost of defense on equal footing. By relying on penalties to enforce security rather than reward, he argues the system is closer to the cypherpunk spirit and, consequently, more soundly engineered. Critics feel that he over-emphasizes the threat of the 51% attack and under-emphasizes the true cost of an adversary’s attack. As you may have guessed from my first paragraph, I am not convinced that “on medium to long time scales, humans are quite good at consensus.” I definitely agree that “making systems that are easier to defend than they are to attack is also simply sound engineering”. And there’s really no argument that PoW take a lot of energy. So what is meant by rewarding validators and punisher perpetrators?
PoS attempts to address the larger electric consumption issues of PoW as well as some of the inequality around GPU-centric large mining operations by replacing miners with validators. A validator purchases ownership (stake) in the virtual currency, buying a block creation chance. Instead of a user spending say $5000 buying mining equipment to win a mining reward, $5000 worth of cryptocurrency is used to buy proportionate block creation chances in the blockchain system by becoming a validator. The PoS algorithm pseudo-randomly selects validators for block creation, thereby ensuring that no validator can predict its turn in advance. These validators are rewarded slightly for their potential lost opportunity cost plus the burden of maintaining nodes and keeping the private key private and other maintenance issues. The bulk of the costs are born from penalties that are orders of magnitude higher than the validator’s costs. Using a pseudo-random selection algorithm rather than relying on brute-force calculation levels out the power consumption but exposes different vulnerabilities. PoW suffers from the 51% vulnerability where more than half the users are corrupt. PoS suffers from the Nothing-At-Stake vulnerability, where there is no incentive to choose the correct block. Casper uses security deposits and the threat of losing said security deposit to punish such shenanigans to address this vulnerability.
While PoS definitely addresses the power consumption issues of PoW, it does so by radically altering the algorithm, which has shown itself to be successful in production. Bitcoin exists, therefore PoW works. But does PoW need to kill trees? Intel may have an alternative.
Intel’s Proof of Elapsed Time (PoET) uses a randomized process to finalize the block by electing a leader from an open, permissionless network. All nodes in the network validators) generate a random wait time and go to sleep, freeing up CPU cycles. The validator with the shortest wait time for a particular transaction block is elected the leader; waking up, committing a block to the blockchain and notifying the rest of the network.. There are two functions that authenticate this election process. The first function creates a time for the transaction block that is guaranteed to have been created by that particular enclave while the other function verifies this guarantee and creates an attestation that the validator did wait the allotted time.
Leader selection is a potential vulnerability, so Intel developed a specification for a Trusted Execution Environment (TEE) as well as an implementation called Software Guard Extension that uses new secure CPU instructions. These secure CPU instructions can be used to ensure the safety and randomness of the leader election while eliminating the costs associated with power and specialized GPU processing required by PoW. As a side-effect, wider adoption will free up GPUs for high end gaming video cards. In PoET, every validator requests a wait time from a trusted function running in a TEE called an enclave.
Up Next
Now that we have covered the landscape of distributed consensus algorithms, we can start to see how they can be practically applied categories of use cases.