Cryptocurrencies such as Bitcoin and smart contract platforms like Ethereum are recent examples of the evolution of consensus algorithms blockchain represents. Consensus, getting distributed processes to agree on a single value, is a fundamental problem in computer science. Distributed processing is difficult. In fact, there are logical proofs that show pretty conclusively that there won’t be a single perfect algorithm for handling consensus in an asynchronous system made of imperfect nodes. As long as there is the possibility of a fault, there is a certainty of a problem.
It’s important to have a solid understanding of the fundamentals since we are tasked with identifying an imperfect consensus algorithm sufficient for complex distributed use cases. First, we’ll discuss consensus algorithms. Next, we’ll look into current algorithms addressing byzantine fault tolerance, followed by emerging algorithms. With that overview of the landscape, we can start to get a handle on selecting an algorithm for a particular use case. We’ll wrap up by looking through a glass darkly about what might be.
How to Narrow it Down
The good news is there are a few very obvious non-functional features that will help narrow down which algorithms are appropriate. We need to understand some new terms to evaluate the appropriateness of one algorithm over another. A network can be said to be permissioned or permissionless, meaning one is open only to known participants while to other is fully open. Trust refers to whether or not the nodes that are participating specifically in consensus are known or unknown. Node scalability of the peer network is determined by the maximum number of nodes that can be involved in consensus while client scalability refers to the number of clients that can request services of the nodes. Transaction rate is a latency measurement; how long does it take to reach consensus and confirm transactions. Transaction finality refers to how long it takes for a transaction to be considered final by all nodes. There may or may not be an explicit cost of participation, which affects the voting investment measurements.
These non-functional features are often best considered in context with their interrelationships. In today’s interconnected world, the best place to start is with scalability. Nothing kills the network effect quite like barriers to participation. Literally. Client scalability is the key here. Because math. But it’s not the only factor. Permissioned versus permissionless networks are a factor in how scalable your network is but mostly when you take into consideration how node scalability is provided. In a permissionless network where anyone can be a node, it’s easy to see why some have a bias towards permissionless as the main factor in scale. But it does depend on how data and process are distributed. Mastercard, for example, provides payment services to companies, not consumers. They have a relatively small set of nodes compared to clients. This example just represents how you need to decouple scalability from permission and trust in your thought process.
If your use case involves a consortium, a network of registered broker-dealers, hotel chains or any other finite number of large legal entities you can imagine, you should consider a permissioned network. Seriously, why build out a highly scalable permissionless model for something when the legal department is going to require contracts. I have a post that recommends starting your PoC as a Proof of Compliance if you work in a regulated industry. Not every decision is driven by technical concerns alone. Trust is related to permissioning and applies to BFT derivatives: are all or some nodes are trusted. Again, this is as much about regulation as it is about architecture. If you just want, and can legally provide for, a global network of individual consumers. then permissionless is the way to go. You probably heard about this technology via a permissionless network like Bitcoin because their goal is to be ubiquitous. Technically, any permissionless network can also be permissioned but this makes little sense in practice.
Whenever you are considering a distributed persistence model as opposed to a relational database, you need to consider when a transaction will be considered final. We are always going to be dealing with eventual consistency, but is that eventuality measured in milliseconds or minutes? Is the result deterministic or probabilistic?
Service Level Agreements are almost as fundamental to your use case as permissioning. If your network is completely open but so slow that no one can use it, you haven’t built something useful. Knowing how your algorithm reaches consensus will give you insight into latency so you can align with your SLA.
At this point, you can make some informed decisions based on your high-level requirements.
How Do They Measure Up
PoW-based systems use a permissionless network based on an untrusted participation model so scalability is very high. Anyone can join the network and it must scale appropriately. Solving a cryptographic puzzle is a resource-intensive process, so there is a cost of participation measured in the electricity used to power computation on GPUs and the transaction rate is low. The transaction finality is probabilistic since multiple blocks can be mined simultaneously due to network latency. PoET uses a hardware-based solution for leader election that would likely make it more performant that PoW, but not as fast as BFT.
PoW requires a miner to solve a complex cryptographic puzzle while BFT uses voting by replicas for state changes. Adding a chain to a blockchain is a probabilistic function in all of the PoX algorithms, just for different reasons. In PoW and PoET, multiple blocks can be mines simultaneously due to network latency. In PoS, validators can vote in parallel on multiple chains. While PoW and PoET are probabilistic due to external factors and PoS could technically minimize this behavior through disincentivization mechanisms, like Ethereum’s Casper, all three ultimately exhibit probabilistic transaction finality.
PBFT-based systems use a permissioned network based on a trusted participation model so scalability is very low, 20 or less. There is no explicit cost of participation although there is no doubt a membership fee. Using state change logic rather than a blockchain allows for immediate transaction finality and a high transaction rate.
Paxos and BFT algorithms are only appropriate for permissioned networks and so, by extension, is Hyperledger Fabric. PoS and PoET can be reasonably used for both permissioned and permissionless networks. While technically any implementation that supports permissionless can support permissioned membership models, it makes little sense to do so with PoW and Federated BFT. Trust is related to permissioning and applies to BFT. In PBFT, all nodes in the permissioned network must be trusted, or registered and identifiable. In Federated BFT, some nodes must be trusted. BFT, which uses state change logic rather than a blockchain, offers immediate transaction finality.
You should be evaluating systems that use BFT variants if you have a strictly permissioned network. PoW and Federated BFT are appropriate for permissionless networks. PoS and PoET are options for either type. Your SLA will help narrow down your options further. If you have a permissionless network that requires fast transactions, you need Federated BFT. Ripple and Stellar are each fintech companies participating in the payment processing space. Big, fast, open networks make sense in that space. Bitcoin took the PoW route since decentralization was valued more highly than transaction rate in the cryptocurrency space. If you could reasonably describe your user base as a consortium that has smart contracts as a use case, you should do a deep dive on Hyperledger. If you are looking at smart contracts that may include a permissionless use case, then PoS based systems like Ethereum may be appropriate.
This is a good time to double-check and make sure your use case can’t be solved with a relational database. I’m serious; only go down the asynchronous distributed network path when the well-worn, comfortable, efficient models have demonstrably failed. The same is true for managing byzantine fault tolerance. There are a great many use cases where a combination of fail-stop and reasonable logic at the protocol level is sufficient. This stuff is not as fun as vendors make it seem and the internet is full of reference architectures and best practices made by people who have awareness but not experience. Just saying.
Up Next
If you read all the where here and come to the conclusion that your use case does not justify this amount of architectural overhead, then this has been time well spent. The same is true if you now have a better idea of which tools to bring with you to tackle some of these big problems. However, if you still have needs that have not been identified by any of the existing algorithms, the next and last post may be helpful.