6.824 2015 Lecture 17: PNUTS

Note: These lecture notes were slightly modified from the ones posted on the 6.824 course website from Spring 2015.

PNUTS

Diagram:

Region R1                        Region R2
---------                        ---------

 W1 Mesage broker                 W1 Message broker
 W2     (replicated)              W2     (replicated)
 W3                               W3
 ..         Tablet controller     ..         Tablet controller
                (replicated)                     (replicated)

    Router1 Router2 ...              Router1 Router2 ...     

    SU1 SU2 SU3 ...                  SU1 SU2 SU3 ...

Updates

Write semantics

Write order to single records

Name        Where       What
----        -----       ----
Alice       home        asleep
Bob

When would you care about stale data?

Reads vs. staleness

    read-any(key) -> fast read, just executes the read on the SU and does
                   not wait for any writes to propagate

    read-critical(key, ver) -> returns the read record where ver(record) >= ver
     - useful for reading your own writes
     - true when you have one webpage in a single tab
     - if you update your shopping cart in one tab, then the other tab
       will not be aware of that version number from the first tab

    read-latest(key) -> will always go to the master copy and read the latest
                      data there

Writes, atomic updates

Example: increment a counter in a record

    test-and-set-write(ver, key, newvalue) -> always gets sent to the master
        region for the key. look at the version and if it matches provided
        one then update the record with the new value

        // implementing v[k]++
        while true:
            (x, v) = read-latest(k)
            if test-and-set-write(k, v, x+1)
                break

Question of the day

Alice comes back from spring break and she:

Can her mom see her photos due to out-of-order writes?

If Alice has all the photos her mom can see in a single record, then no.

Alice   |   ACL     | List of photos
-------- ----------- ----------------
            mom         p7, p99

Assuming the code her mom is executing reads the full record (ACL + photos) when doing the check, and doesn't first read the ACL, wait a while and then read the photos

Failures

If webapp server fails in the middle of doing a bunch of writes, then only partial info would have been written to PNUTS, possibly leading to corruption.

If SU crashes and reboots, it can recover from disk and MB can keep retrying it

What happens when SU loses its disk? It needs to recover the data.

Performance

Evaluation mostly focuses on latency and not on throughput. Maybe this is specific to their needs.

Not clear how they can support millions of users with MBs that can only do hundreds of writes per second.

Why is it taking them 75ms to do a local update, where everyone is in the same region?

6.824 notes

Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving Platform. Proceedings of VLDB, 2008.

Why this paper?

What is PNUTS' overall goal?

Diagram:

[world, browsers, data centers]

Overview

Diagram:

3 regions, browsers, web apps, tablet ctlrs, routers, storage units, MBs]

Why replicas of all data at multiple regions?

What are the drawbacks of a copy at each region?

What is the data and query model?

How do updates work?

PNUTS has a "record master" for each record

So the complete update story (some guesswork):

App wants to update some columns of a record, knows key

  1. app sends key and update to local SU1
  2. SU1 looks up record master for key: SI2
  3. SU1 sends update request to router at SI2
  4. router at SI2 forwards update to local SU2 for key
  5. SU2 sends update to local Message Broker (MB)
  6. MB stores on disk + backup MB, sends vers # to original app how does MB know the vers #? maybe SU2 told it or perhaps SU2 (not MB) replies to original app
  7. MB sends update to router at every region
  8. every region updates local copy

Puzzles:

All writes are multi-region and thus slow -- why does it make sense?

How does a read-only query execute?

What if app needs to increment a counter stored in a record?

test-and-set-write(version#, new value) gives you atomic update to one record - master rejects the write if current version # != version# - so if concurrent updates, one will lose and retry

TestAndSet example:

  while(1):
    (x, ver) = read-latest(k)
    if(t-a-s-w(k, ver, x+1))
      break

The Question

How to change record's master if no failures?

What if we wanted to do bank transfers?

Is lack of general transactions a problem for web applications?

What about tolerating failures?

App server crashes midway through a set of updates

SU down briefly, or network temporarily broken/lossy

SU loses disk contents, or doesn't automatically reboot

MB crashes after accepting update

MB is a neat idea

Record's master region loses network connection

Evaluation

Evaluation focuses on latency and scaling, not throughput

5.2: time for an insert while busy

What is 5.2 measuring? from what to what?

Why 75 ms?

Is it 75 ms of network speed-of-light delay?

Is the 75 ms mostly queuing, waiting for other client's operations?

End of 5.2 suggests 40 ms of 75 ms in in SU

But only 33 ms (not 75) for "ordered table" (MySQL/Innodb)

5.3 / Figure 3: effect of increasing request rate

Stepping back, what were PNUTS key design decisions?

  1. replication of all data at multiple regions
  2. relaxed consistency -- stale reads
  3. only single-row transactions w/ test-and-set-write
  4. sequence all writes thru master region

Next: Dynamo, a very different design