Table of Contents

Introduction

In a distributed system, such as distributed database, computing, blockchain, and more, multiple nodes need to archive consensus to ensure the smooth and reliable operation of the entire system. Based on fault tolerance capabilities, consensus algorithms can be divided into Crash Fault Tolerance (CFT) and Byzantine Fault Tolerance (BFT).

Crash Fault Tolerance (CFT): Assumes all nodes are honest and consensus can be achieved despite occasional failures.
Byzantine Fault Tolerance (BFT): Consensus can be reached even in the presence of a certain proportion of malicious nodes.

Classic CFT algorithms, such as Paxos and Raft, are widely used in distributed storage systems. These algorithms offer faster processing capabilities and can tolerate the failure of more than half of the nodes.

While BFT algorithms tend to have lower performance and can tolerate no more than one-third of faulty nodes, they are particularly well-suited for blockchain scenarios. This blog will focus on BFT and its practical implementation-PBFT algorithm.

Byzantine Generals Problem

So, what is Byzantine fault tolerance? let’s look at the problem first.

Let’s say there are N generals and each general controls an army
The generals communicate with each other through message passing
They need to agree on a common decision, either to attack or retreat
Some m of these generals may bt traitors who send inconsistent or false information to prevent the loyal generals from reaching consensus.

The core difficulty of the Byzantine problem lies in the potential for traitorous generals to send conflicting information, making it hard for loyal generals to reach an agreement. For example, a traitor might tell one group of generals to “attack” and another group to “retreat,” causing confusion.

Byzantine fault tolerance

Byzantine Fault Tolerance (BFT) refers to the ability of a distributed system to correctly handle Byzantine faults. Its core concepts are as below:

Fault Tolerance: The system can continue to operate correctly and reach consensus despite a certain number of nodes exhibiting Byzantine faults.
Consensus Mechanism: All honest nodes in a distributed system can reach a unanimous decision even in the presence of some malicious nodes.
Redundancy and Verification: The system uses redundancy and multiple verification mechanisms to identify and disregard incorrect or malicious information.

PBFT

PBFT is an implementation of Byzantine fault tolerance. It has high performance and low latency, and can solve the problem of untrusted nodes.

Participant

Client
- Sends requests to the primary node, expecting the operations to be executed
- Also participates in detecting the behavior of the primary node; if the primary node fails to process the request within a reasonable time, the client can trigger a view change.
Primary Node(also known as Leader Node)
- Responsible for initiating and coordinating the consensus process in each round.
- Each round has one and only one Primary node
- If the primary node is deemed dishonest or performs poorly, it can be replaced through a view change mechanism.
Replica Nodes:
- Receive requests from the primary node and achieve consensus through a series of message exchanges and voting
- Each replica node validates the requests sent by the primary node and exchanges validation information with other replica nodes
- The number of replica nodes is typically $n≥3f+1$ where f is the maximum number of Byzantine faulty nodes the system can tolerate.

Why 3f+1? There are two paragraphs in the paper that can give a clear explanation

Even so, there must still be enough responses thatthose from non-faulty replicas outnumber those from faulty ones, i.e., $n-2f>f$ . Therefore $n>3f$ .

For simplicity, we assume $3f+1$ where is the maximum number of replicas that may be faulty; although there could be more than $3f+1$ replicas,the additional replicas degrade performance (since more and bigger messages are being exchanged) without providing improved resiliencygi

Process

First, the client sends a message to the leader node 0, and then the leader node starts the PBFT three-phase protocol include pre-prepare, prepare and commit.

pre-prepare phase

After receiving the request from client, the primary node sends a pre-prepare message to all the replica nodes. The pre-prepare message includes the informations as below.

PRE_PREPARE : Protocol phase identification
View Number: Current view
Sequence Number: The unique increasing sequence number of a broadcast message from the primary node
Message Digest: Hash digest of the message to prevent tampering

prepare phase

Each replica node will receive the pre-prepare message and verify its validity. After successful verification, the replica nodes will send a prepare message to all the nodes. When the replica node collects a sufficient number (at least $2f+1$) of prepare message, it proceeds to the next phase.

commit phase

After collecting enough prepare messages in the prepare phase, the replica node sends a commit message to all other nodes.
The replica node collects a sufficient number (at least $2f+1$) of commit messages before executing the client request and replying with the result.

Leader selection

The selection of the leader follows a deterministic approach based on a round-robin mechanism or through view changes when the current leader is deemed faulty.

If there are n replica nodes, the view number is v, the use the formula $leader = mod(v,n)$ to select the leader.

View change

The view-change protocol provides liveness by allowing the system to make progress when the primary fails.

When?

The view change is triggered by Timer. When the replica receives a request, it starts a timer, and resets the timer if there is a timer running at this time, but when the master node goes down, the replica i will time out in the current view v, and at this time the replica i will trigger the view-change operation to switch the view to v+1.

How?

Initiating
- When the timer of a replica node expires, it stops accepting messages (except for checkpoint, view-change, and new-view messages) and multicasts a VIEW-CHANGE message to all replicas.
New View preparation
- The new primary node for view v+1 waits to receive 2f valid VIEW-CHANGE messages from other replicas.
- Upon receiving these messages, the primary multicasts a NEW-VIEW message to all other replicas.
New View
- a replica node accepts the NEW-VIEW message if it is correctly signed, contains valid view-change messages.

During the view-change, the system is unavailable. However, messages that are not agreed upon during the view-change will be re-prepared.

Conclusion

PBFT ensures high fault tolerance, energy efficiency, and low latency in distributed systems, making it particularly well-suited for permissioned blockchain environments. Compared with traditional BFT algorithm, it lowers the systematic complexity from the exponential level to the polynomial level.

However, PBFT faces significant challenges in scalability due to its quadratic communication overhead and complex view change mechanisms. It also requires a strong trust model, making it more appropriate for controlled, permissioned networks rather than open, permissionless environments.

The PoS+PBFT-like algorithm used by some public chain platforms, such as Aptos-BFT and Tendermint, improves upon the PBFT algorithm, making it well-suited for blockchain systems.

Understand pBFT consensus algorithm