6.824 2015 Lecture 15 Spanner

Note: These lecture notes were slightly modified from the ones posted on the 6.824 course website from Spring 2015.

Intro

Spanner paper, OSDI 2012

Spanner and 'research'

Historical context

Bigtable paper, OSDI 2006

Why Spanner?

Megastore, started ca. 2006, built on top of Bigtable

Dremel, data analysis at Google, started ca. 2008

Transactions

Percolator, general purpose transactions

Spanner

TrueTime came along... (story about how they found out about a guy in NY who was working on distributed clocks and they realized it could be useful for their concurrency control)

Globally synchronized clocks

Were we wrong with bigtable

Yes, and no:

Imagine you are running a startup. What long-term issues can be postponed?

Startup dilemma:

What do you have the skill/ability/will/vision to do?

Interesting questions

Why has the Bigtable paper had arguably a bigger impact on both the research communities and technology communities?

Why do system-researchers insist on building scalable key-value stores (and not databases)?

Lessons

Lesson 0

Timing is everything. Except luck trumps timing.

You can't plan timing when the world is changing: design the best you can for the problems you have in front of you

TrueTime happened due to fortuitous confluence of events and people (i.e. luck). Same with Bigtable. Spanner's initial design (before 2008) was nowhere near what Google has now: they had anti-luck until the project was restarted in 2008.

Lesson 1

Build what you need, and don't overdesign. Don't underdesign either, because you'll pay for it.

Lesson 2

Sometimes ignorance really is bliss. Or maybe luck.

If you have blinders on, you can't overreach. If we had known we needed a distributed replicated database with external consistency in 2004, we would have failed.

Lesson 3

Your userbase matters.

Wrap up

You can't buy luck. You can't plan for luck. But you can't ignore luck.

You can increase your chances to be lucky:

What Spanner lacks?

Maybe disconnected access: Can we build apps that use DBs and can operate offline?

Disconnected operation in Coda file system work.

6.824 notes

Spanner: Google's Globally-Distributed Database, Corbett et al, OSDI 2012

Why this paper?

What are the big ideas?

This is a dense paper! I've tried to boil down some of the ideas to simpler form.

Sharding

Idea: sharding

Simplified sharding outline (lab 4):

Q: What if a Put is concurrent w/ handoff?

Q: What if a failure during handoff? - e.g. old group thinks shard is handed off + but new group fails before it thinks so

Q: Can two groups think they are serving a shard?

Q: Could old group still serve shard if can't hear master?

Idea: wide-area synchronous replication

Considered impractical until a few years ago

What's changed?

Actual performance?

Spanner reads from any paxos replica

Q: Could we write to just one replica?

Q: Is reading from any replica correct?

Example of problem:

Order of events:

  1. W1: I write ACL on group G1 (bare majority), then
  2. W2: I add image on G2 (bare majority), then
  3. mom reads image -- may get old data from lagging G2 replica
  4. mom reads ACL -- may get new data from G1

This system is not acting like a single server!

This problem is caused by a combination of

How can we fix this?

  1. Make reads see latest data
  2. Make reads see consistent data

Here's a super-simplification of spanner's consistency story for r/o clients

How does that fix our ACL/image example?

  1. W1: I write ACL, G1 assigns it time=10, then
  2. W2: I add image, G2 assigns it time=15 (> 10 since clocks agree)
  3. mom picks a time, for example t=14
  4. mom reads ACL t=14 from lagging G1 replica
  5. mom reads image from G2 at t=14

Q: Is it reasonable to assume that different computers' clocks agree?

Q: What may go wrong if servers' clocks don't agree?

A performance problem: reading client may pick time in the future, forcing reading replicas to wait to "catch up"

A correctness problem:

Sequence of events:

  1. W1: I write ACL on G1 -- stamped with time=15
  2. W2: I add image on G2 -- stamped with time=10

Now a client read at t=14 will see image but not ACL update

Q: Why doesn't spanner just ensure that the clocks are all correct?

TrueTime (section 3)

Q: How does TrueTime choose the interval?

Q: Why are GPS time receivers able to avoid this problem?

Spanner assigns each write a scalar time

The danger:

So what we want is:

How does spanner assign times to writes?

Does this work for our example?

Why the "Start" rule?

Why the "Commit Wait" rule?

Q: Why commit wait; why not immediately write value with chosen time?

Q: How long is the commit wait?

This answers today's Question: a large TrueTime uncertainty requires a long commit wait so Spanner authors are interested in accurate low-uncertainty time

Let's step back

Why is this timestamp technique interesting?

There's a lot of additional complexity in the paper