6.824 Practical Byzantine Fault Tolerance (2012 modified notes)

We’ve considered many fault-tolerance protocols

Can one handle a larger class of failures?

The PBFT paper’s approach:

Let’s assume the worst case:

What are the attacker’s powers?

What faults can’t happen?

Example use scenario:

RM:
  echo A > grade
  echo B > grade
  tell YM "the grade file is ready"
YM:
  cat grade

A faulty system could:

Bad BFT designs

Let’s try to design our own byzantine-fault-tolerant RSM

Design 1: Wait for all servers

What’s wrong with design 1?

Design 2: Wait for \(f+1\) out of \(2f+1\)

What’s wrong with design 2’s 2f+1?

Design 3: Wait for \(2f+1\) out of \(3f+1\)

What about handling multiple clients?

Let’s have a primary to pick order for concurrent client requests

What can a faulty primary do?

  1. send wrong result to client
  2. different ops to different replicas
  3. ignore a client op

General approach to handling faulty primary

  1. replicas send results direct to client
  2. replicas exchange info about ops sent by primary
  3. clients notify replicas of each operation, as well as primary each replica watches progress of each operation if no progress, force change of primary

Can a replica execute an operation when it first receives it from primary?

Design 4: Almost PBFT (no view change)

[??]

   REQ  PRE-P  PREPARE  REPLY
 C
S0
S1
S2
S3

Remember our strategy:

If the primary is non-faulty, can faulty replicas prevent correct progress?

If the primary is faulty, will replicas detect any problem? Or can primary cause undetectable problem?

Results of the primary sending diff ops to diff replicas?

How to resume operation after faulty primary?

When does a replica ask for a view change?

Is it OK to trigger a view change if just one replica asks?

For now, let’s defer the question of how many replicas must ask for a view change.

Who is the next primary?

View change design 1 (not correct)

Will all non-faulty replicas agree about operation numbering across view change?

Problem:

Can new primary ask all replicas for set of operations they have executed?

Solution:

Final design: PBFT operation protocol

View change design 2 (correct)

Can the new primary omit some of the reported recent operations?

Paper also discusses

What are the consequences of more than \(f\) corrupt servers?

What if the client is corrupt?

Suppose an attacker can corrupt one of the servers