6.824 2009 Lecture 18: Security: Byzantine Fault Tolerance so far we have been assuming fail-stop behavior computer/network either executes protocol correctly or halts (and maybe repaired after a while) even fail-stop is hard to cope with did the server crash, or is the network broken? is the network broken, or just delaying packets? is the network partitioned? what if fail-stop failures/repair so frequent we can do no work? if system is quiet for a while, ping+paxos will produce a useful configuration now we're going to be more ambitious use replication to help with security what is Byzantine behavior? a larger set of ways in which a computer can misbehave includes fail-stop faults plus bugs (i.e. incorrect execution) plus intentional malice plus conspiracies -- multiple bad nodes cooperating to trick the good ones (devious, surreptitious) what kinds of Byzantine behaviour could cause trouble in the labs? lock server primary gives away same lock to multiple clients lock server backup responds to pings, but does nothing else will be kept in Paxos config, but will prevent progress backup might fake a double-grant from primary, trick system administrator into thinking primary is faulty, thus trick admin into replacing good primary with malicious backup backup DDoSs primary, Paxos causes backup to take over as primary, then gives away locks to multiple clients specific assumptions in the paper's Byzantine fault model network can delay, duplicate, re-order, or drop any message "faulty" vs "non-faulty" replicas faulty replicas may be controlled by a malicious attacker the attacker: supplies the code that faulty replicas run knows the code the non-faulty replicas are running can read network messages cannot guess non-faulty replicas' cryptographic keys knows the faulty replicas' keys can force messages to be delayed how can we possibly cope with Byzantine replicas? assume no more than some fraction of replicas are faulty the basic Byzantine fault tolerance approach replicated state machines ask them all to do each operation compare answers from replicas if enough agree, perhaps that's the correct answer straw man 1: [client, n servers] n servers client sends request to all of them waits for all n to reply only proceeds if all n agree why might this not work well? faulty replica can stop progress by disagreeing straw man 2: let's have replicas vote 2f+1 servers, assume no more than f are faulty if client gets f+1 matching replies, success otherwise, failure why might this not work well? client can't wait for replies from the last f replicas they might be faulty, never going to reply so must be able to make a decision after n-f replies, or f+1 if n=2f+1 but f of the first f+1 replies might be from faulty replicas! i.e. f+1 is not enough to vote also waiting for f+1 of 2f+1 doesn't ensure that majority of good nodes executed straw man 3: 3f+1 servers, of which at most f are faulty client waits for 2f+1 replies client takes majority of those 2f+1 why does this work? informally... client will get at least 2f+1 replies: all non-faulty replicas eventually respond at most f of 2f+1 are faulty the f+1 non-faulty replicas will agree on the answer what about handling multiple clients? non-faulty replicas must process operations in the same order! let's use a primary to choose operation order picks an order and assigns sequence numbers for now, assume primary is non-faulty straw man 4: with primary pick sequence # send *signed* message to each replica (including itself) from here on, every message signed by sender using public key crypto wait for 2f+1 replies f+1 must match reply to client REQ MSG REP REP C 0 1 2 3 what if the primary is faulty? it might send different operations to different replicas or send them in different orders to different replicas or send a wrong result to the client or do nothing general approach to handling faulty primary clients notify replicas of each operation, as well as primary each replica watches progress of each operation if no progress, force view change -> new primary view change must sort out state of last operation can a replica execute an operation when it first receives it from primary? no: maybe primary gave different ops to different replicas if we execute before we're sure, we've wrecked the replica's state so we need a second round of messages to make sure all good replicas got the same op straw man 5: 3f+1 servers, one is primary, f faulty, primary might be faulty client sends request to primary AND to each replica primary sends PRE-PREPARE(op, n) to replicas each replica sends PREPARE(op, n) to all replicas if replica gets PREPARE(op, n) from 2f+1 replicas (incl itself) (with same op and n) execute the operation, possibly modifying state send reply to client client is happy when it gets f+1 matching replies REQ PRE-P PREPARE REPLY C 0 1 2 3 does this cope with the primary sending different ops to different replicas? yes: replicas won't see 2f+1 matching PREPAREs also handles primary assigning different sequence #s result: system will stop does this cope with the primary lying to the client about the reply? yes: each replica sends a reply, not the primary result: system will stop does this cope with the primary not sending operations to replicas? yes: replicas receive request from the client but don't see a PRE-PREPARE, or don't see a PREPARE result: system will stop how to resume operation after faulty client? need a view change to choose new primary view change does *not* choose a set of "live" servers 2f+1 of 3f+1 deals with faulty servers when to trigger a view change? if a replica sees a client op but doesn't see 2f+1 matching PREPAREs if faulty replicas try to trigger a view change to shoot down non-faulty primary? require view change requests from 2f+1 replicas who is the next primary? need to make sure faulty replicas can't always make themselves next primary view number v primary is v mod n so primary rotates among servers at most f faulty primaries in a row view change straw man replicas send VIEW-CHANGE requests to *new* primary new primary waits for 2f+1 view-change requests new primary announces view change w/ NEW-VIEW includes the 2f+1 VIEW-CHANGE requests as proof that enough replicas wanted to change views new primary starts numbering operations at last n it saw + 1 will all non-faulty replicas agree about operation numbering across view change? problem: I saw 2f+1 PREPAREs for operation n, so I executed it new primary did not, so it did not execute it maybe new primary didn't even see the PRE-PREPARE for operation n since old primary only waited for 2f+1 replies or old primary may never have sent PRE-PREARE to next primary thus new primary may start numbering at n, yielding two different op #n can new primary ask all replicas for set of operations they have executed? doesn't work: new primary can only wait for 2f+1 replies faulty replicas may reply, so new primary may not wait for me solution: don't execute operation until sure a new primary will hear about it add a third phase: PRE-PREPARE, PREPARE, then COMMIT only execute after commit operation protocol: client sends op to primary primary sends PRE-PREPARE(op, n) to all all send PREPARE(op, n) to all after replica receives 2f+1 matching PREPARE(op, n) send COMMIT(op, n) to all after receiving 2f+1 matching COMMIT(op, n) execute op view change: each replica sends new primary 2f+1 PREPAREs for recent ops (if it has them) new primary waits for 2f+1 VIEW-CHANGE requests new primary sends NEW-VIEW msg to all replicas with complete set of VIEW-CHANGE msgs list of every op for which some VIEW-CHANGE contained 2f+1 PREPAREs i.e. list of final ops from last view informal argument why if a replica executes an op, new primary will know of that op replica only executed after receiving 2f+1 COMMITS maybe f of those were lies, from faulty replicas, who won't tell new primary but f+1 COMMITs were from replicas that got matching PREPAREs from 2f+1 replicas new primary waits for view-change requests from 2f+1 replicas at least 1 of those f+1 will report PREPARE set to the new primary can a replica trick new primary into believing a manufactured op? no -- replica must present 2f+1 PREPAREs for that op the PREPAREs must have the original signatures so the non-faulty nodes must have seen that op can the new primary omit some of the reported recent operations? replicas must check that the primary announced the right set compare against NEW-VIEW PREPARE info included in the VIEW-CHANGE paper also discusses checkpoints and logs to help good nodes recover various cryptographic optimizations optimizations to reduce # of msgs in common case fast read-only operations what are the consequences of more than f corrupt servers? can the system recover? suppose a single server has a fail-stop fault (e.g. powered off) can the system still survive f additional malicious faulty nodes? will it be live? can bad nodes trick it into executing incorrect operations? what if the client is corrupt? suppose we had a technique to provide Byzantine fault tolerance (BFT) would it be useful to apply it to a set of identical servers? how about non-identical servers run by same organization?