Category Archives: Data Structures

B-Trees – Balanced Search Trees for Slow Storage

Another cool, but frequently overlooked, data structure in the tree family is called the B-tree. A B-tree is a search tree, very similar to a BST in concept, but optimized differently.

BSTs provide logarithmic time operations, where the performance
is fundamentally bounded by the number of comparisons. B-trees also
provide logarithmic performance with a logarithmic number of
comparisons – but the performance is worse by a constant factor. The
difference is that B-trees are designed around a different tradeoff. The B-tree is designed to minimize the number of tree nodes that need to be examined, even if that comes at the cost of doing significantly more
comparisons.

Why would you design it that way? It’s a different performance tradeoff.
The B-tree is a da
ta structure designed not for use in memory, but instead for
use as a structure in hard storage, like a disk drive. B-trees are the
basic structure underlying most filesystems and databases: they provide
an efficient way of provide rapidly searchable stored structures. But
retrieving nodes from disk is very expensive. In comparison to
retrieving from disk, doing comparisons is very nearly free. So the design
goal for performance in a B-tree tries to minimize disk access; and when disk access is necessary, it tries to localize it as much as possible – to minimize the number of retrievals, and even more importantly, to minimize the number of nodes on disk that need to be updated when something is inserted.

Continue reading

The Basic Balanced Search Tree: Red-Black trees.

During my Haskell tutorial, I used balanced binary search trees as an example. I’ve had people write to me asking me to write about that in a non-functional programming post, because the Haskell one got too tied up in how to do things without assignments, and with tail recursion.

Binary search trees (BST) are another one of the really fundamental simple data structures. They’re conceptually similar to heaps – they’re also a kind of size-ordered binary tree with O(lg n) performance – but they’ve got a different set of basic invariants to represent the different performance goals.

The goal of a structure like a BST is to have a variable-size stucture where
you can cheaply and easily increase or decrease the size of the structure, and where insert, delete, and searching for values are all inexpensive. BSTs are a good choice for implementing things like dictionaries, where you have a key associated with a value, or where you want to be able to search for an arbitrary member quickly.

BSTs are very widely used – for example, the default ordered collections, sets, and maps in the C++ standard libraries are all implemented as a kind of BST (in fact, the very kind that I’m going to describe below.)

If you have a meaningful ordering operation, then a BST is a good choice: expanding and contracting the structure happen automatically every time you do an insert or remove; and insert, delete, and search are all bounded by lg(n), where N is the number of values in the tree.

Continue reading

Implementing Compact Binary Heaps

Last post, I described the basic idea of the binary heap data
structure. But I didn’t talk about how to implement it – and there’s
a very cool way of implementing it – optimally space efficient,
very fast, and with respectable cache locality for moderate sized
datasets.

Continue reading

Binary Heaps

One of the most neglected data structures in the CS repertoire is
the heap. Unfortunately, the jargon is overloaded, so “heap” can mean
two different things – but there is a connection, which I’ll explain
in a little while.

In many programming languages, when you can dynamically create
objects in memory, the space from which you allocate the memory to
create them is called the heap. That is not what I’m talking about.

What I am talking about is an extremely simple but powerful structure
that is used for maintaining a collection of values from which you want
to be able to quickly remove the largest object quickly. This may sound
trivial, but this turns out to be an incredibly useful structure. You can use this basic structure to implement a very fast sorting algorithm; to maintain a prioritized set of values/tasks; to manage chunks of memory; etc. (The
prioritized set is the most common case in my experience, but that’s probably because I’ve spent so much of my career working on distributed systems.)

Continue reading