We use the B-link tree format (see [SAGIV]). Each leaf block contains some number of (key,value) pairs; each non-leaf (ie. index) block contains (key, down-pointer) pairs. Each block contains one additional key value, the SPLIT key, that is strictly greater than any other key in the block. The tree is augmented by chaining together the nodes of each level of the tree in split-key order (the NEXT pointers).
We allow the tree to be missing any index-insertion or block-deletion updates so long as certain key-order invariants are maintained:
Note that a key in an index must match its block's split key exactly; if a key K is less than the split-key S of the block B it points to, searches for intervening keys will be misdirected (to the next block); and if K is greater than S, then splits of the block after B at key values K' where S<K'<K will be mis-inserted in B's paren, because K' should logically go AFTER K, but K'<K.
The notion of the span of an index entry is useful. We note that each block split can be thought of as an EXTENSION of some span at the next-higher level, while each PARENT-INSERT-UPDATE can be thought of as a corresponding span REDUCTION. A span that has only one block in it will be called FULLY REDUCED; a b-tree is fully reduced when all its spans are fully reduced, meaning that all pending/deferred insert-updates have been performed. Lastly, we can express rule C1 more succinctly in terms of spans:
We originally had index nodes in (value,key) order because it made insertions simple, but abandoned that because it made all the insertion/deletion propagation code special-case. In fact, because of the asymmetry of INSERT and DELETE, either one or the other will require that a single operation sometimes modify two blocks (see Insertion method). However, using (key,value) ordering at all levels of the tree simplified the code considerably.