Another cool, but frequently overlooked, data structure in the tree family is called the B-tree. A B-tree is a search tree, very similar to a BST in concept, but optimized differently.
BSTs provide logarithmic time operations, where the performance
is fundamentally bounded by the number of comparisons. B-trees also
provide logarithmic performance with a logarithmic number of
comparisons – but the performance is worse by a constant factor. The
difference is that B-trees are designed around a different tradeoff. The B-tree is designed to minimize the number of tree nodes that need to be examined, even if that comes at the cost of doing significantly more
Why would you design it that way? It’s a different performance tradeoff.
The B-tree is a da
ta structure designed not for use in memory, but instead for
use as a structure in hard storage, like a disk drive. B-trees are the
basic structure underlying most filesystems and databases: they provide
an efficient way of provide rapidly searchable stored structures. But
retrieving nodes from disk is very expensive. In comparison to
retrieving from disk, doing comparisons is very nearly free. So the design
goal for performance in a B-tree tries to minimize disk access; and when disk access is necessary, it tries to localize it as much as possible – to minimize the number of retrievals, and even more importantly, to minimize the number of nodes on disk that need to be updated when something is inserted.
An order-N B-tree is a search tree with the following properties:
- each node contains no more than N children.
- Every node (except the root and the leaves) contains at least
- All leaves are empty nodes, and occur in the same level of the tree.
- Every node with N children contains N-1 keys.
- Given a node N1, with a K’th child node NK, the keys in
NK are greater than the K-1th key in N1, and less that
the Kth key in N1.
The brilliance of the B-tree is that it’s a structure that minimizes access to disk while maintaining a nearly perfect balance. It’s really an amazingly beautiful and elegant structure.
The key to it is the beautiful insert operation. The tree grows from the leaves – but it only adds levels at the root. That makes the balancing work – and work in a way that guarantees that you’ll be able to keep the tree in balance doing nothing but modifying nodes in a single linear chain from the leaf up to the root.
To insert a value, you start by picking the correct place to insert. The set of non-empty nodes at the bottom of the tree is guaranteed to be a sorted list of values,
running from the left-most node over to the right, in order.
You find its place in that list – and that gives you the insertion node. If the
insertion node already has N values, that you split the node. You take the
median value in the node, and use it as a pivot. You take the nodes smaller than the
pivot, and turn them into a new node named “L”; and the nodes larger that the pivot,
and make them into a new node named “R”. Then you move the pivot into the parent
of the insert node, and make L its left child, and R its right. So now the former leaf node is now two nodes, with one of its values promoted up into its parent.
If the parent node already had N values, then you split it, and so on. You keep bumping up the tree, until you get to the root – if you need to split the root, then you take the split pivot node, and turn it into a new root with only one value. (That’s why the second invariant makes an exception for the root node.)
As usual, examples help to understand. Look at the order-3 B-tree in the figure below.
In an order-4 B-tree, every node (except the root) must have at least two values. Suppose we wanted to insert the value “20”. That would go in the middle child node, between 18 and 20. But that would result in a new node with 4 values. But the maximum
number of values per node is three. So we need to split it. We’ll pick 18 as the pivot. That will give us two new nodes – one with only one value (15), and one with two (20, 21). The pivot node, 18, will be inserted into the parent, giving us a parent with the three nodes 10, 18, 30, and four children. The resulting B-tree is shown below.
Now, suppose we wanted to insert the value 7. That would split the leftmost child. Using 8 as a pivot, we’d get two new nodes: (3, 7) and (9), and we’d insert 8 into the root node. But that would result in the root node having four values. So we split the root node. The root pivot is 18, so we get two new nodes, (8, 10) and (30), and
a new root, containing only (18), as shown below.
Deletes are very similar, except that they involve contractions. When a node shrinks to be too small because of deletions, it gets merged with a neighbor; after the contraction, if the resulting merged node has too many values (for example, if the neighboring node was full before the merge), then you split the newly merged node. So either you end up removing a node from the tree, or you redistributing the children between the two nodes in a balanced manner.
As usual for my data-structure posts, there’ll be an example implementation in another post, to follow as soon as I have time to write it.