Note: These lecture notes were slightly modified from the ones posted on the 6.824 course website from Spring 2015.
Consistency so far:
Sequential and release consistency are slow:
Paxos:
A simple example -- optimistic peer-to-peer chat
Diagram:
m0 m1 m2
\ /\ /\
\------------/ /
\ /
\------------------------/
Do we care about message ordering for chat?
What went wrong in this example?
Suppose this is an auction chat program:
Joe Fred Alice
$10 -->
20
<-- -->
<-- winner is $20
If there were a 4th person, Sam:
Joe Fred Alice Sam
$10 --> sees $10
20
<-- --> does not see $20
<-- winner is $20 --> sees winner is $20
So to Sam this might not make sense. His problem is that Sam didn't know what Alice knew when she sent her message.
Definition: x causally precedes y
x precedes y if:
x, then M0 does yx, M0 sends msg to M1, M1 does yx and y are generally writes, or msgs, or file versionsy causally depends on x"Definition: causal consistency
x causally precedes y, everyone sees x before yPros, cons:
m, send current history set, toom until has received everything in m's setHistory sets will grow huge -- can we abbreviate?
VT[i]=x => sender had seen all msgs from node i up through #xVT comparisons
a and b denote the VTs associated with msgs A and Ba < b or are they concurrent a || b)a < b, a || ba < b if two conditions hold:
i:
a[i] <= b[i]
a summarizes a proper prefix of bb's sender and a's sender have both seen the same # of messages from host ib's sender has seen more recent message from host i than a's sender has seenj, s.t. a[j] < b[j]
a causally precedes b
b's sender has definitely seen more recent message from host i than a's sender has seena || b if:
a[i] < b[i] and a[j] > b[j]Many systems use VT variants, but for somewhat different purposes
x preceded event ya causally precedes b, CBCAST delivers a first[diagram: node, msg buf, VC, chat app]
APP ^
| |
-----|------|-----------
\ / | CBCAST
.
--------- vector
| m3 | clock
--------- VT
| wait |
---------
| m1 |
VC
VCi[j] = k means node i has seen all msgs from j up through message ksend(m) at node i:
VCi[i] += 1broadcast(m, i, VCi)receive(m, i, mv) at node j:
j's CBCAST library buffers the messagemv <= VCj, except mv[i] = VCj[i] + 1j has seen every msg that causally precedes m
VCj[i] = mv[i]mCode:
on receive(message m, node i, timestamp v):
release when:
this node's vector clock VT >= v EXCEPT FOR v[i] = VT[i] + 1
Example:
All VCs start <0,0,0>
M0 sends msg1 w/ <1,0,0>
M1 receives msg1 w/ <1,0,0>
M1 sends msg2 w/ <1,1,0>
M2 receives msg2 w/ <1,1,0> -- must delay because don't have msg1
M2 receives msg1 w/ <1,0,0> -- can process, unblocks other msg
Why fast?
M0 sends <1,0>M1 sends <0,1>Causal consistency still allows more surprises than sequential
=> if CBCAST present x and then y that does not imply x happened before y necessarilyTreadMarks uses VTs to order writes to same variable by different machines:
M0: a1 x=1 r1 a2 y=9 r2
M1: a1 x=2 r1
M2: a1 a2 z=x+y r2 r1
Could M2 hear x=2 from M1, then x=1 from M0?
How does M2 know what to do?
VTs are often used for optimistic updating of replicated data
Scenario:
H1 for a while, sync changes to H2H2, sync changes to H3H3, sync to H1Constraint: No Lost Updates
x2 over version x1 if
x2 includes all updates that are in x1.Example 1:
Focus on a single file
H1: f=1 \----------\
H2: \-> f=2 \ /--> ???
H3: \-> tell H2 --/
What is the right thing to do?
Is it enough to simply take file with latest modification time?
Yes in this case, as long as you carry them along correctly.
I.e. H3 remembers mtime assigned by H1, not mtime of sync.
Example 2:
mtime = 10 | mtime = 20 | mtime = 25
H1: f=1 --\ f=2 /-->
H2: \--> f=0 --/
H3:
H2's mtime will be bigger.
Should the file synchronizer use "0" and discard "2"?
No! They were conflicting changes. We need to detect this case.
Modification times are not enough by themselves
What if there were concurrent updates?
How to decide if one version contains all of another's updates?
H2: H1/T1,H2/T2 H3: H1/T1H1: H1/T1,H1/T2 H2: H1/T1,H2/T3x supersedes version y:
y's history is a prefix of x's history.We can use VTs to compress these histories!
VT[i]=x => file version includes all of host i's updates through #xVTs for Example 1:
v1=<1,0,0>v2=<1,1,0>v1 < v2, so H2 ignores H3's copy (no conflict since <)v2 > v1, so H1/H3 would accept H2's copy (again no conflict)VTs for Example 2:
v1=<1,0,0>v2=<2,0,0>v3=<1,1,0>< nor > v1
What if there are conflicting updates?
How to think about VTs for file synchronization?
What about file deletion?
H1: f=1 ->H2H2: del ->H1H1:<1,0> H2<1,1>, so delete wins at H1H1: f=1 ->H2 f=2H2: del ->H1H1:<2,0> vs H2:<1,1> -- conflictHow to delete the VTs of deleted files?
Is it enough to wait until all hosts have seen the delete msg?
"Wait until everyone has seen delete" doesn't work:
H1: ->H3 forgetH2: f=1 ->H1,H3 del,seen ->H1 ->H1H3: seen ->H1H2 needs to re-tell H1 about f, deletion, and f's VT
Diagram:
| Phase 1 | Phase 2 | Phase 3 (forget f's VT)
H1: del f \ | seen f -\-> | done f -\-> |
H2: \--> | seen f -/-> (bcast) | done f -/-> (bcast) |
H3: |--> | seen f -\-> | done f -\-> |
Working VT GC scheme from Ficus replicated file system
A classic problem with VTs:
Many file synchronizers don't use VTs -- e.g. Unison, rsync