We use the B-link tree format (see [SAGIV]). Each leaf block contains some number of (key,value) pairs; each non-leaf (ie. index) block contains (key, down-pointer) pairs. Each block contains one additional key value, the SPLIT key, that is strictly greater than any other key in the block. The tree is augmented by chaining together the nodes of each level of the tree in split-key order (the NEXT pointers).
We allow the tree to be missing any index-insertion or block-deletion updates so long as certain key-order invariants are maintained:
Blocks are chained in order of increasing split key at each level, and all valid blocks appear in the chain.
Split keys are unique at each level.
Every DOWN pointer from level N+1 points to a block B at the next lower level N whose split-key S is less than or equal to the key S1 associated with the pointer.
If the split key for block B (at level N) is strictly less than the key associated with the ptr from level N+1 (S < S1), then it must be the case that
It is invalid for a down pointer to contain a key not present as a split key at the next level.
Note that a key in an index must match its block’s split key exactly; if a key K is less than the split-key S of the block B it points to, searches for intervening keys will be misdirected (to the next block); and if K is greater than S, then splits of the block after B at key values K’ where S<K’<K will be mis-inserted in B’s paren, because K’ should logically go AFTER K, but K’<K.
The notion of the span of an index entry is useful. We note that each block split can be thought of as an EXTENSION of some span at the next-higher level, while each PARENT-INSERT-UPDATE can be thought of as a corresponding span REDUCTION. A span that has only one block in it will be called FULLY REDUCED; a b-tree is fully reduced when all its spans are fully reduced, meaning that all pending/deferred insert-updates have been performed. Lastly, we can express rule C1 more succinctly in terms of spans:
SPANs must be well-formed (span must be closed; keys must match exactly)
The SPANS of the entries at any index must not overlap.
We originally had index nodes in (value,key) order because it made insertions simple, but abandoned that because it made all the insertion/deletion propagation code special-case. In fact, because of the asymmetry of INSERT and DELETE, either one or the other will require that a single operation sometimes modify two blocks (see Insertion method). However, using (key,value) ordering at all levels of the tree simplified the code considerably.