Deferred Index Updates and concurrency (wb)

2.2.3 Deferred Index Updates and concurrency

The basic strategy is to allow a key insert/delete that causes a block to split/become empty to complete, but to then queue up the parent-update/block delete on some process that will retry deferred updates in the background until they succeed.

The problem is that deletes contain two separate operations that can wait: the parent update and the predecessor update. The two updates can be done separately if the block is first marked "dead" so that no insertions into it can occur. Unfortunately, there is a class of rare and complex screw cases involving the correct ordering of deletes and inserts that happen to involve the same key. One might for example delete a block, deleting its key, and then subsequent splits can reintroduce and re-delete it. If operations at the index level are deferred, the ordering of these deferred operations determines whether the resulting tree is correct.

Making block-delete atomic (as defined above) greatly simplifies this process. The block being inserted/deleted can then serve as its own semaphore.

We have decided to adopt a lazy update strategy. That is, rather than keeping around queues of pending parent-updates, we just throw them away if they can’t be executed immediately. We can get away with this because the updates for level N+1 can be reconstructed from the chain at level N.

Now, a deferred update only affects performance if the affected path is actually encountered: if we find we have to chain across two blocks, it means a parent update hasn’t occurred yet; if we encounter an empty block, it means a delete update hasn’t occurred. Our idea is to fix these "inefficiencies" on demand, that is, only when we run into one will we expend the effort to fix it. For the moment, assume ALL block deletes and insert updates are deferred.

The basic algorithm is:

If we have to chain forward in searching for a key, there must be pointer missing at the next level (deferred insert). So we attempt to insert it. Forward chaining to reach the NEXT and PREV of a key is normal and hence shouldn’t cause update attempts.
Whenever we reach an empty block, we attempt to delete it, except for the case where we are doing an insert which should go in that block. The delete attempt will fail UNLESS the relevant parent updates are complete, that is, if and only if the block is within a fully reduced span.

One major concern has been that an I/O failure during a block delete can leave a "dead" block, that is, one which can’t be reached from its parent level. It is still in a chain but there is no down pointer to it and no search can terminate at that node.The problem is that once a block becomes dead we need to prevent it from being inserted into or restored into use, because that could result in entries being out-of-order (suppose that, while the block’s dead, some inserts of keys less than its split key occur. They’ll be directed into the next block by the index, and be posted there. If we then restore the block, the key ordering will be incorrect.) But it turns out that we can prevent this by observing which B-tree operations can encounter which types of deferred-update situation, as follows:

If we think of the tree as a set of nested SPANS, where the SPAN of an index entry is the set of entries in the blocks it SPANS, we note that during operations using search only - GET,PUT,REM – the locus of search stays strictly within the SPAN of some entry in the root. We enforce that a block not be deleted until its parent pointers are completely updates, that is, its pointer has a span of exactly one block. Now suppose a block delete fails halfway, leaving a empty block without any parent pointer. Such a block is unreachable by FIND-NODE and hence by INSERT! This means that if INSERT encounters an empty block, it must be valid to insert into it.

This has several interesting consequences:

This means that only the operations that can possibly chain ACROSS spans have to worry about dead blocks: the NEXT and PREV operations. (What exactly is the effect on PREV?)
This means that "dead" blocks can’t be deleted (unless we look for them specially). But they (a) are only created by a rare kind of event, and (b) won’t hurt anything, so long as we arrange that NEXT and PREVs ignore the empty blocks (ie. ignore the value of their split keys).
The way we have defined deferred-delete reconstruction, if the block isn’t within a fully-reduced span, the delete simply can’t succeed. This means that there is no point in trying to delete a blank block when we’re chaining though a span, that is, we should only try to delete empty blocks encountered (1) when following a DOWN pointer in FIND-ENT, or (2) due to a NEXT operation (or PREV?).

[Having realized this, we can actually let block-deletes be two-part operations (again). The only additional complexity is that then we’d need to implement a method for detecting dead blocks, that is, differentiating them from empty blocks in the middle of large spans, which we mustn’t try to delete. We just check if the block is continued in some span! HOW TO DO THAT EFFICIENTLY?]

In detail, the method of detecting deferred operations goes like this:

RECONSTRUCTING DEFERRED PARENT-UPDATES (missing DOWN pointers):
Whenever we have to follow the NEXT pointer of a block B (in FIND-NODE), we should attempt a PARENT-INSERT-UPDATE using the (key,value) pair (split(B),next(B)). FIND-NODE needs to be fixed to chain right rather than down when the "dead zone" is encountered; and it should not attempt a PARENT-INSERT-UPDATE in this case.

PARENT-INSERT-UPDATE can fail if:
1. some other process is already doing the update (the first one to lock the parent wins, the other will fail out);
2. the key split(B) is already present (means a DELETE is pending – this shouldn’t really occur, though);
3. the update requires a block split but no blocks are available.
RECONSTRUCTING DEFERRED BLOCK DELETES:
Whenever we reach an empty block in FIND-NODE or a NEXT/PREV operation, we’ll attempt a PARENT-DELETE-UPDATE.

PARENT-DELETE-UPDATE should fail when
1. the key is found but points to a different block (meaning the containing span isn’t fully reduced);
2. the block is NAME-locked (this means its safe to leave name-locks around, the worst that can happen is they cause block deletes to be deferred some);
3. someone else is doing a PARENT-DELETE-UPDATE on this block (this can be guaranteed by having the delete process first name-lock the block being deleted;
4. some block needing to be modified (parent or prev) is locked.
We need to fix the code that does the NEXT-KEY operation to not stop at potentially-dead blocks. CHAIN-FIND need NOT ignore blank blocks.
In practice, we’ll want to trigger the update routines at the normal times as well, ie. try insert-updates after block splits and delete-updates after a block becomes empty.

There are a number of new statistics we should keep; these include:
- deferred PARENT-INSERT-UPDATEs (PUDs)
- insert-updates that succeed
- insert-updates that fail (measures overhead of the method)
- insert-update failure rate
- deferred block deletes
- deferred block deletes that succeed
- deferred block deletes that fail (measures overhead of the method)
- delete failure rate
- count of dead blocks found (requires extra work)
(The number of chain-forwards is also a measure of the overhead.)

[Note: this mechanism also supports a simple queuing method: keep a list of the block numbers at which updates were deferred, and in times of low usage do FINDs to them, which will update them as a side-effect …]

————

[Other strategies were considered but were either significantly more complex or over-heady, or introduced unnecessary delays:

One hack is to serialize the queue of postponed index INSERT and DELETE operations by brute force: to do them in exactly the order they arrived in. Possibly simpler alternative method: sort by key and timestamp them!

I think we also decided that the interpreter could simply devote a process to each deferred operation, since we want to shift resources toward accomplishing the deferred operations if too many queue up.

[WANG] maintains a queue of deleted operation on the DESTINATION block. This has the disadvantage that whenever a block is split or merged the deferred-operation queues have to be split or merged as well – ugh!.]