6.824 2015 Lecture 4: "Flat Datacenter Storage" Case Study

Note: These lecture notes were slightly modified from the ones posted on the 6.824 course website from Spring 2015.

Flat datacenter storage

Flat Datacenter Storage, Nightingale, Elson, Fan, Hofmann, Howell, Suzue, OSDI 2012

Why are we looking at this paper?

What is FDS?

High-level design -- a common pattern

Why is this high-level design useful?

Motivating app: MapReduce-style sort

The abstract's main claims are about performance.

Q: Does the abstract's 2 GByte/sec per client seem impressive?

Q: Abstract claims recover from lost disk (92 GB) in 6.2 seconds

What should we want to know from the paper?

API

Q: Why are 128-bit blob IDs a nice interface? - Why not file names?

Q: Why do 8 MB tracts make sense? - (Figure 3...)

Q: What kinds of client applications is the API aimed at? - and not aimed at?

Layout: how do they spread data over the servers?

Example four-entry TLT with no replication:

  0: S1
  1: S2
  2: S3
  3: S4
  suppose hash(27) = 2
  then the tracts of blob 27 are laid out:
  S1: 2 6
  S2: 3 7
  S3: 0 4 8
  S4: 1 5 ...
  FDS is "striping" blobs over servers at tract granularity

Q: hy have tracts at all? Why not store each blob on just one server? - What kinds of apps will benefit from striping? - What kinds of apps won't?

Q: How fast will a client be able to read a single tract?

Q: Where does the abstract's single-client 2 GB number come from?

Q: Why not the UNIX i-node approach?

Q: Why not hash(b + t)?

Q: How many TLT entries should there be?

The system needs to choose server pairs (or triplets &c) to put in TLT entries

Q: How about:

   0: S1 S2
   1: S2 S1
   2: S3 S4
   3: S4 S3
   ...

Q: why is the paper's n^2 scheme better?

Example:

   0: S1 S2
   1: S1 S3
   2: S1 S4
   3: S2 S1
   4: S2 S3
   5: S2 S4
   ...

Q: Why do they actually use a minimum replication level of 3?

Adding a tractserver

Extending a tract's size

How do they maintain n^2 plus one arrangement as servers leave join?

Unclear.

Q: How long will adding a tractserver take?

Q: What about client write's while tracts are being transferred?

Q: What if a client reads/writes but has an old tract table?

Replication

Q: Why don't they send writes through a primary?

Q: What problems are they likely to have because of lack of primary?

What happens after a tractserver fails?

Example of the tracts each server holds:

  S1: 0 4 8 ...
  S2: 0 1 ...
  S3: 4 3 ...
  S4: 8 2 ...

Q: Why not just pick one replacement server?

Q: How long will it take to copy all the tracts?

Q: If a tractserver's net breaks and is then repaired, might srvr serve old data?

Q: If a server crashes and reboots with disk intact, can contents be used?

Q: When is it better to use 3.2.1's partial failure recovery?

What happens when the metadata server crashes?

Q: While metadata server is down, can the system proceed?

Q: Is there a backup metadata server?

Q: How does rebooted metadata server get a copy of the TLT?

Q: Does their scheme seem correct?

Random issues

Q: Is the metadata server likely to be a bottleneck?

Q: Why do they need the scrubber application mentioned in 2.3?

Performance

Q: How do we know we're seeing "good" performance? What's the best you can expect?

Q: Limiting resource for 2 GBps single-client?

Q: Figure 4a: Why starts low? Why goes up? Why levels off? Why does it level off at that particular performance?

Q: Figure 4b shows random r/w as fast as sequential (Figure 4a). Is this what you'd expect?

Q: Why are writes slower than reads with replication in Figure 4c?

Q: Where does the 92 GB in 6.2 seconds come from?

How big is each sort bucket?