Note: These lecture notes were slightly modified from the ones posted on the 6.824 course website from Spring 2015.
Consistency so far:
Sequential and release consistency are slow:
Paxos:
A simple example -- optimistic peer-to-peer chat
Diagram:
m0 m1 m2
\ /\ /\
\------------/ /
\ /
\------------------------/
Do we care about message ordering for chat?
What went wrong in this example?
Suppose this is an auction chat program:
Joe Fred Alice
$10 -->
20
<-- -->
<-- winner is $20
If there were a 4th person, Sam:
Joe Fred Alice Sam
$10 --> sees $10
20
<-- --> does not see $20
<-- winner is $20 --> sees winner is $20
So to Sam this might not make sense. His problem is that Sam didn't know what Alice knew when she sent her message.
Definition: x
causally precedes y
x
precedes y
if:
x
, then M0 does y
x
, M0 sends msg to M1, M1 does y
x
and y
are generally writes, or msgs, or file versionsy
causally depends on x
"Definition: causal consistency
x
causally precedes y
, everyone sees x
before y
Pros, cons:
m
, send current history set, toom
until has received everything in m
's setHistory sets will grow huge -- can we abbreviate?
VT[i]=x =>
sender had seen all msgs from node i
up through #x
VT comparisons
a
and b
denote the VTs associated with msgs A
and B
a < b
or are they concurrent a || b
)a < b, a || b
a < b
if two conditions hold:
i
:
a[i] <= b[i]
a
summarizes a proper prefix of b
b
's sender and a
's sender have both seen the same # of messages from host i
b
's sender has seen more recent message from host i
than a
's sender has seenj, s.t. a[j] < b[j]
a
causally precedes b
b
's sender has definitely seen more recent message from host i
than a
's sender has seena || b
if:
a[i] < b[i]
and a[j] > b[j]
Many systems use VT variants, but for somewhat different purposes
x
preceded event y
a
causally precedes b
, CBCAST delivers a
first[diagram: node, msg buf, VC, chat app]
APP ^
| |
-----|------|-----------
\ / | CBCAST
.
--------- vector
| m3 | clock
--------- VT
| wait |
---------
| m1 |
VC
VCi[j] = k
means node i
has seen all msgs from j
up through message k
send(m)
at node i
:
VCi[i] += 1
broadcast(m, i, VCi)
receive(m, i, mv)
at node j
:
j
's CBCAST library buffers the messagemv <= VCj
, except mv[i] = VCj[i] + 1
j
has seen every msg that causally precedes m
VCj[i] = mv[i]
m
Code:
on receive(message m, node i, timestamp v):
release when:
this node's vector clock VT >= v EXCEPT FOR v[i] = VT[i] + 1
Example:
All VCs start <0,0,0>
M0 sends msg1 w/ <1,0,0>
M1 receives msg1 w/ <1,0,0>
M1 sends msg2 w/ <1,1,0>
M2 receives msg2 w/ <1,1,0> -- must delay because don't have msg1
M2 receives msg1 w/ <1,0,0> -- can process, unblocks other msg
Why fast?
M0 sends <1,0>
M1 sends <0,1>
Causal consistency still allows more surprises than sequential
=>
if CBCAST present x
and then y
that does not imply x
happened before y
necessarilyTreadMarks uses VTs to order writes to same variable by different machines:
M0: a1 x=1 r1 a2 y=9 r2
M1: a1 x=2 r1
M2: a1 a2 z=x+y r2 r1
Could M2 hear x=2 from M1, then x=1 from M0?
How does M2 know what to do?
VTs are often used for optimistic updating of replicated data
Scenario:
H1
for a while, sync changes to H2
H2
, sync changes to H3
H3
, sync to H1
Constraint: No Lost Updates
x2
over version x1
if
x2
includes all updates that are in x1
.Example 1:
Focus on a single file
H1: f=1 \----------\
H2: \-> f=2 \ /--> ???
H3: \-> tell H2 --/
What is the right thing to do?
Is it enough to simply take file with latest modification time?
Yes in this case, as long as you carry them along correctly.
I.e. H3 remembers mtime assigned by H1, not mtime of sync.
Example 2:
mtime = 10 | mtime = 20 | mtime = 25
H1: f=1 --\ f=2 /-->
H2: \--> f=0 --/
H3:
H2's mtime will be bigger.
Should the file synchronizer use "0" and discard "2"?
No! They were conflicting changes. We need to detect this case.
Modification times are not enough by themselves
What if there were concurrent updates?
How to decide if one version contains all of another's updates?
H2: H1/T1,H2/T2 H3: H1/T1
H1: H1/T1,H1/T2 H2: H1/T1,H2/T3
x
supersedes version y
:
y
's history is a prefix of x
's history.We can use VTs to compress these histories!
VT[i]=x
=> file version includes all of host i
's updates through #x
VTs for Example 1:
v1=<1,0,0>
v2=<1,1,0>
v1 < v2
, so H2 ignores H3's copy (no conflict since <
)v2 > v1
, so H1/H3 would accept H2's copy (again no conflict)VTs for Example 2:
v1=<1,0,0>
v2=<2,0,0>
v3=<1,1,0>
<
nor >
v1
What if there are conflicting updates?
How to think about VTs for file synchronization?
What about file deletion?
H1: f=1 ->H2
H2: del ->H1
H1:<1,0> H2<1,1>
, so delete wins at H1H1: f=1 ->H2 f=2
H2: del ->H1
H1:<2,0> vs H2:<1,1> -- conflict
How to delete the VTs of deleted files?
Is it enough to wait until all hosts have seen the delete msg?
"Wait until everyone has seen delete" doesn't work:
H1: ->H3 forget
H2: f=1 ->H1,H3 del,seen ->H1 ->H1
H3: seen ->H1
H2 needs to re-tell H1 about f, deletion, and f's VT
Diagram:
| Phase 1 | Phase 2 | Phase 3 (forget f's VT)
H1: del f \ | seen f -\-> | done f -\-> |
H2: \--> | seen f -/-> (bcast) | done f -/-> (bcast) |
H3: |--> | seen f -\-> | done f -\-> |
Working VT GC scheme from Ficus replicated file system
A classic problem with VTs:
Many file synchronizers don't use VTs -- e.g. Unison, rsync