The basic strategy is to allow a key insert/delete that causes a block to split/become empty to complete, but to then queue up the parent-update/block delete on some process that will retry deferred updates in the background until they succeed.
The problem is that deletes contain two separate operations that can wait: the parent update and the predecessor update. The two updates can be done separately if the block is first marked "dead" so that no insertions into it can occur. Unfortunately, there is a class of rare and complex screw cases involving the correct ordering of deletes and inserts that happen to involve the same key. One might for example delete a block, deleting its key, and then subsequent splits can reintroduce and re-delete it. If operations at the index level are deferred, the ordering of these deferred operations determines whether the resulting tree is correct.
Making block-delete atomic (as defined above) greatly simplifies this process. The block being inserted/deleted can then serve as its own semaphore.
We have decided to adopt a lazy update strategy. That is, rather than keeping around queues of pending parent-updates, we just throw them away if they can’t be executed immediately. We can get away with this because the updates for level N+1 can be reconstructed from the chain at level N.
Now, a deferred update only affects performance if the affected path is actually encountered: if we find we have to chain across two blocks, it means a parent update hasn’t occurred yet; if we encounter an empty block, it means a delete update hasn’t occurred. Our idea is to fix these "inefficiencies" on demand, that is, only when we run into one will we expend the effort to fix it. For the moment, assume ALL block deletes and insert updates are deferred.
The basic algorithm is:
One major concern has been that an I/O failure during a block delete can leave a "dead" block, that is, one which can’t be reached from its parent level. It is still in a chain but there is no down pointer to it and no search can terminate at that node.The problem is that once a block becomes dead we need to prevent it from being inserted into or restored into use, because that could result in entries being out-of-order (suppose that, while the block’s dead, some inserts of keys less than its split key occur. They’ll be directed into the next block by the index, and be posted there. If we then restore the block, the key ordering will be incorrect.) But it turns out that we can prevent this by observing which B-tree operations can encounter which types of deferred-update situation, as follows:
If we think of the tree as a set of nested SPANS, where the SPAN of an index entry is the set of entries in the blocks it SPANS, we note that during operations using search only - GET,PUT,REM – the locus of search stays strictly within the SPAN of some entry in the root. We enforce that a block not be deleted until its parent pointers are completely updates, that is, its pointer has a span of exactly one block. Now suppose a block delete fails halfway, leaving a empty block without any parent pointer. Such a block is unreachable by FIND-NODE and hence by INSERT! This means that if INSERT encounters an empty block, it must be valid to insert into it.
This has several interesting consequences:
[Having realized this, we can actually let block-deletes be two-part operations (again). The only additional complexity is that then we’d need to implement a method for detecting dead blocks, that is, differentiating them from empty blocks in the middle of large spans, which we mustn’t try to delete. We just check if the block is continued in some span! HOW TO DO THAT EFFICIENTLY?]
In detail, the method of detecting deferred operations goes like this:
Whenever we have to follow the NEXT pointer of a block B (in FIND-NODE), we should attempt a PARENT-INSERT-UPDATE using the (key,value) pair (split(B),next(B)). FIND-NODE needs to be fixed to chain right rather than down when the "dead zone" is encountered; and it should not attempt a PARENT-INSERT-UPDATE in this case.
PARENT-INSERT-UPDATE can fail if:
Whenever we reach an empty block in FIND-NODE or a NEXT/PREV operation, we’ll attempt a PARENT-DELETE-UPDATE.
PARENT-DELETE-UPDATE should fail when
In practice, we’ll want to trigger the update routines at the normal times as well, ie. try insert-updates after block splits and delete-updates after a block becomes empty.
There are a number of new statistics we should keep; these include:
(The number of chain-forwards is also a measure of the overhead.)
[Note: this mechanism also supports a simple queuing method: keep a list of the block numbers at which updates were deferred, and in times of low usage do FINDs to them, which will update them as a side-effect …]
[Other strategies were considered but were either significantly more complex or over-heady, or introduced unnecessary delays:
One hack is to serialize the queue of postponed index INSERT and DELETE operations by brute force: to do them in exactly the order they arrived in. Possibly simpler alternative method: sort by key and timestamp them!
I think we also decided that the interpreter could simply devote a process to each deferred operation, since we want to shift resources toward accomplishing the deferred operations if too many queue up.
[WANG] maintains a queue of deleted operation on the DESTINATION block. This has the disadvantage that whenever a block is split or merged the deferred-operation queues have to be split or merged as well – ugh!.]