Category Archives: Data Structures

Mathematical Data Structures Part 1: Binary Heaps

Being a PhD in computer science, it’s only natural that I love data structures. In particular, I’m fascinated by the way that the math factors in to the way we structure data. Data structures fit into a beautiful intersection between information theory and algorithms: the way that a good data structure is built is a reflection of what information it really needs to maintain. The best data structure encodes exactly the information it needs in order to do it’s job – no more, and no less. The mathematical impacts of that are beautiful, and sometimes surprising. To try to demonstrate that, I’m going to take a couple of posts, and work my way through one of my favorite examples of a surprising outcome in a structure called a fibonacci heap.

A heap is a structure designed to solve a common problem. You’ve got a collection of objects, each of which has an associated numeric value. You want, at any time, to be able to find and remove the largest value in the collection, and to be able to add new elements to it. Those two operations are the core of the heap. Some variations also allow you to increase the value of objects inside the heap, or to remove values other than the maximum.

There are a lot of different ways to implement a heap. One obvious one is to just maintain a sorted sequence of objects. The problem with that is performance: some of the common operations are painfully slow!

Using the sorted sequence approach, removing the largest value is easy: you just remove the last element of the sequence. That’s very fast: it’s constant time. But you also need to be able to add values to the heap, and that’s not so good.

There’s two basic ways of doing a sequence: an array, or a linked list. In both cases, the performance isn’t acceptable. If we used an array,then in order to add a new object to the collection, we’d need to:

  1. Find the correct position for it in the array. We can do that by doing a binary search, which takes time O(lg n) where n is the length of the array. This step isn’t bad – in general, we’re pretty happy with O(lg n) operations.
  2. Insert the value into the array – which means shifting all of the elements that come after it one place to the right. That’s O(n) time, which is pretty crappy.

In the linked list approach, inserting the value isn’t a problem – it’s a constant time operation. But finding the position where it should be inserted is linear time. So we’re still talking about linear time.

Similarly, we could use a linked list, where inserting the element is constant time, but then finding its position is O(n) – again, unacceptable.

The problem with the sorted sequence approach isn’t really related to the kind of structure we use to maintain the sorted list; the problem is that we’re maintaining more information that we need. At any time, we want to be able to find the largest element of the heap quickly – we don’t care about the relative positions of any pair of values that don’t include the largest element of the collection! But if we keep a sorted list, every time we insert an element, we’re spending a lot of time comparing things whose comparison we don’t really care about!

To be able to make it faster, we need to build a data structure that doesn’t waste time and effort computing and maintaining information that we don’t want.

So our goal is to find ways of building structures that always let us both find the largest element quickly, and add new elements quickly, without maintaining more information that is really necessary. We’ll start off with a simple but good version, and then work our way through to better ones.

The most basic implementation of a heap is called a binary heap. A binary heap is a binary tree with two key properties:

  1. Every node in the tree is larger than its children.
  2. The tree is left-full: every level of the tree is full except for the last; and the last level is filled in from left to right.

The left-full property might seem a bit strange, but it turns out to be pretty straightforward. A binary heap can be implemented using an array. The root node is stored in the first position of the array; its children are in positions 2 and 3; the children of node 2 are stored in positions 4 and 5; the childen of position 3 are stored in positions 6 and 7. Using one-based indices, for any node N, it’s children are stored in positions 2N and 2N+1. Adding a new leaf to the tree can always be done by just appending one value to the array. The left-full property just means that you always extend the array by adding an element onto the left.

Implementing a heap this way is simple:

  1. To get the maximum value, you just look at the first element of the array – O(1).
  2. To remove the largest element from the array, you get the value from the first element of the array, and save it. Then you remove the last element from the array, and bubble it down – swapping it with one of its children if they’re bigger than it. We’ll look at this in more detail, but the bubble down process is O(lg n) in the worst case.
  3. Inserting a new element is done by adding it to the end of the array, and then bubbling up, by comparing it to its parent, and swapping if it’s bigger than its parent. Again, it’s O(lg n).

I’m going to show code for this. For fun, I wrote the code in a language called xtend. Xtend is a Java extension that cleans up the syntax, gets rid of semicolons, improves the type system, adds lambdas, and does a few other really neat things.

The whole beast is just a wrapper around an array:

class BinHeap> {
  val ArrayList _contents

  new() {
    _contents = new ArrayList()
  }

  ...
}

If you know Java, this is mostly clear. In xtend, you write constructors using the name “new” instead of the name of the class being constructed.

Then we’ll set up some utilities to make other stuff easier to write.

  def leftChildPosition(int pos) {
    2 * (pos + 1) - 1
  }

  def rightChildPosition(int pos) {
    2 * (pos + 1)
  }

  def int parentPosition(int pos) {
    if (pos == 0) {
      throw new MaxHeapException()
    } else {
      (pos + 1)/ 2 - 1
    }
  }

  def void swap(int one, int two) {
    val T first = _contents.get(one)
    _contents.set(one, _contents.get(two))
    _contents.set(two, first)
  }

Again, these should be straightforward. The only tricky thing is that the JVM uses zero-based arrays – so the left child of the node in position N is (2*(N+1) - 1): we need to add one to the node number to shift to one-based position; and then subtract one from the result to switch back to zero-based position. We do a similar thing for each of the other position computations.

Now we can get to the interesting bits. How do we get values into the heap?

def insert(T v) {
  val idx = _contents.size()
  _contents.add(v)
  bubbleUp(idx)
}

Insert is exactly what I described in prose above: append the new value onto the end of the array, and then bubble it up. Bubbling is the interesting part:

  private def void bubbleUp(int pos) {
    if (pos > 0) {
      val parentPos = parentPosition(pos)
      if (_contents.get(pos) > _contents.get(parentPos)) {
        swap(pos, parentPos)
        bubbleUp(parentPos)
      }
    }
  }

Bubbling up from a position P compares P to its parent. If it’s bigger than its parent, it swaps positions with the parent, and then tries to continue bubbling up from its new position.

For example, imagine we had a tree like:

9
  8
    5
      4
      0
    6
      3
  7
    2
    1

Now, suppose we wanted to add the value “10” to this. We’d add 10 to the end of the array, which would make it a child of 6. That would give us:

  9
    8
      5
        4
        0
      6
        3
        10
    7
      2
      1

So, we’d compare 10 to its parent – it’s bigger, so we’d swap:

  9
    8
      5
        4
        0
      10
        3
        6
    7
      2
      1

Then we’d compare 10 to its new parent, 8. It’s bigger, so we swap:

  9
    10
      5
        4
        0
      8
        3
        6
    7
      2
      1

And finally, we’d compare 10 to its new parent, 9. It’s bigger so we swap, and then we’re done.

  10
    9
      5
        4
        0
      8
        3
        6
    7
      2
      1

Appending to the end of the array is constant time, so the dominant time cost is the bubbling. The maximum possible number of swaps in the doubling process is depth of the tree minus 1 – and the depth of a full binary tree with N members is \lceil ln N \rceil. So it’s O(lg n) swaps, and the overall cost of inserts is O(lg n).

Getting the largest value is trivial:

def getMax() {
  _contents.get(0)
}

Removing the largest value is a lot like adding a value: we really play with the last element of the array, and then do a bubbling process – only this time we’ll bubble in the opposite direction:

  def removeMax() {
    if (_contents.size == 0) {
      throw new MaxHeapException()
    } else {
      val result = getMax()
      val last = _contents.remove(_contents.size() - 1)
      if (_contents.size() > 0) {
        _contents.set(0, last)
        bubbleDown(0)
      }
      result
    }
  }

Bubbling down is similar to bubbling up, but it’s a bit more complicated, because we need to look at both children.

  private def void bubbleDown(int pos) {
    val rightChildPos = rightChildPosition(pos)
      val leftChildPos = leftChildPosition(pos)
      if (leftChildPos >= _contents.size) {
        return
      }
      // Try to bubble left if there is no right child, or if the lift child is
      // bigger than the right.
      if (rightChildPos >= _contents.size || _contents.get(leftChildPos) > _contents.get(rightChildPos)) {
        if (_contents.get(pos) < _contents.get(leftChildPos)) {
          swap(pos, leftChildPos)
          bubbleDown(leftChildPos)
        }

      } else {
        // Try to bubble right
        if (_contents.get(pos) < _contents.get(rightChildPos)) {
          swap(pos, rightChildPos)
          bubbleDown(rightChildPos)
        }
      }
  }

The process is almost the same as bubbling up, but moving in the opposite direction. We're starting with a parent node, and comparing it to its children. If it's bigger than either of its children, then we swap it with the largest child, and then continue bubbling down.

For example, let's look at the same heap we looked at for insert:

  9
    8
      5
        4
        0
      6
        3
    7
      2
      1
  

If we want to remove 9, we set the value 9 aside, and then remove 3 from the end of the array, and put it at the root of the tree:

  3
    8
      5
        4
        0
      6
    7
      2
      1

Then we'd compare 3 against its two children, 8 and 7. Since 8 is the larger child, we swap 8 for 3:

  8
    3
      5
        4
        0
      6
    7
      2
      1

Now we compare 3 with its new children, 5 and 6. 6 is bigger, so we swap 6 with 3:

  8
    6
      5
        4
        0
      3
    7
      2
      1

3 has no children, so we're done: it's bubbled down as far as it can go.

Note: I messed up this example in the original version of the post. Thanks to John Armstrong for pointing it out.

The cost here is the same as insert, for the same reason. The dominant cost is the bubbling, and the bubbling is bounded by the depth of the tree. So removing the maximum is also O(lg n).

It's worth noting that heaps can be used to build a very reasonable sorting algorithm. To sort a collection, just insert all of the elements of the collection, and then remove them one by one. It's O(n lg n), and it's conceptually quite simple. It's not widely used, because the old classic quicksort is faster - not in big(0) notation, but it ends up with a smaller constant. (In big-O notation, something that takes 3(lg n) steps and something that takes 6(lg n) steps are both O(lg n), but the one whose constant is 3 is still twice as fast as the one whose constant is 6.)

Basic Data Structures: Hash Tables

I’m in the mood for a couple of basics posts. As long-time readers might know, I love writing about data structures.

One of the most important and fundamental structures is a hashtable. In fact, in a lot of modern programming languages have left hashtables behind, for reasons I’ll discuss later. But if you want to understand data structures and algorithmic complexity, hashtables are one of the essentials.

A hashtable a structure for keeping a list of (key, value) pairs, where you can look up a value using the key that’s associated with it. This kind of structure is frequently called either a map, an associative array, or a dictionary.

For an example, think of a phonebook. You’ve got a collection of pairs (name, phone-number) that make up the phonebook. When you use the phonebook, what you do is look for a person’s name, and then use it to get their phone number.

A hashtable is one specific kind of structure that does this. I like to describe data structures in terms of some sort of schema: what are the basic operations that the structure supports, and what performance characteristics does it have for those operations.

In those schematic terms, a hashtable is very simple. It’s a structure that maintains a mapping from keys to values. A hashtable really only needs two operations: put and get:

  1. put(key, value): add a mapping from key to value to the table. If there’s already a mapping for the key, then replace it.
  2. get(key): get the value associated with the key.

In a hashtable, both of those operations are extremely fast.

Let’s think for a moment about the basic idea of a key-value map, and what kind of performance we could get out of a cople of simple naive ways of implementing it.

We’ve got a list of names and phone numbers. We want to know how long it’ll take to find a particular name. How quickly can we do it?

How long does that take, naively? It depends on how many keys and values there are, and what properties the keys have that we can take advantage of.

In the worst case, there’s nothing to help us: the only thing we can do is take the key we’re looking for, and compare it to every single key. If we have 10 keys, then on average, we’ll need to do an average of about 5 steps before we find the key we’re looking for. If there are 100 keys, then it’ll take, on average, about 50 steps. If there are one million keys, then it’ll take an average of half a million steps before we can find the value.

If the keys are ordered – that is, if we can compare one key to another not just for equality, but we can ask which came first using a “less than or equal to” operator, then we can use binary search. With binary search, we can find an entry in a list of 10 elements in 4 steps. We can find an entry in a list of 1000 keys in 10 steps, or one in a list of one million keys in 20 steps.

With a hashtable, if things work right, in a table of 10 keys, it takes one step to find the key. 100 keys? 1 step. 1000 keys? 1 step. 1,000,000,000 keys? Still one step. That’s the point of a hashtable. It might be really hard to do something like general a list of all of the keys – but if all you want to do is look things up, it’s lightning.

How can it do that? It’s a fairly simple trick: the hashtable trades space for time. A hashtable, under normal circumstances, uses a lot more space than most other ways of building a dictionary. It makes itself fast by using extra space in a clever way.

A hashtable creates a bunch of containers for (key, value) pairs called buckets. It creates many more buckets than the number of (key, value) pairs than it expects to store. When you want to insert a value into the table, it uses a special kind of function called a hash function on the key to decide which bucket to put the (key, value) into. When you want to look for the value associated with a key, it again uses the hash function on the key to find out which bucket to look in.

It’s easiest to understand by looking at some actual code. Here’s a simple, not at all realistic implementation of a hashtable in Python:

  class Hashtable(object):
    def __init__(self, hashfun, size):
      self._size = size
      self._hashfun = hashfun
      self._table = [[] for i in range(size)]

    def hash(self, key):
      return self._hashfun(key) % self._size

    def get(self, key, value):
      self._table[self.hash(key)].append((key, value))

    def get(self, key):
      for k,v in self._table[self.hash(key)]:
        if k == key:
          return v
      return None

If you’ve got a good hash function, and your hashtable is big enough, then each bucket will end up with no more than one value in it. So if you need to insert a value, you find an (empty) bucket using its hashcode, and dump it in: one step. If you need to find a value given its key, find the bucket using its hashcode, and return the value.

There are two big problems with hashtables.

First, everything is dependent on the quality of your hash function. If you hash function maps a lot of values to the same bucket, then your performance is going to suck. In fact, in the worst case, it’s no better than just searching a randomly ordered list. Most of the time, you can come up with a hash function that does pretty good – but it’s a surprisingly tricky thing to get right.

Second, the table really needs to be big relative to the number of elements that you expect to have in the list. If you set up a hashtable with 40 buckets, and you end up with 80 values stored in it, your performance isn’t going to be very good. (In fact, it’ll be slightly worse that just using a binary search tree.)

So what makes a good hash function? There are a bunch of things to consider:

  1. The hash function must be deterministic: calling the hash on the same key value must always produce the same result. If you’re writing a python program like the one I used as an example above, and you use the value of the key objects fields to compute the hash, then changing the key objects fields will change the hashcode!
  2. The hash function needs to focus on the parts of the key that distinguish between different keys, not on their similarities. To give a simple example, in some versions of Java, the default hash function for objects is based on the address of the object in memory. All objects are stored in locations whose address is divisible by 4 – so the last two bits are always zero. If you did something simple like just take the address modulo the table size, then all of the buckets whose numbers weren’t divisible by four would always be empty. That would be bad.
  3. The hash function needs to be uniform. That means that it needs to map roughly the same number of input values to each possible output value. To give you a sense of how important this is: I ran a test using 3125 randomly generated strings, using one really stupid hash function (xoring together the characters), and one really good one (djb2). I set up a small table, with 31 buckets, and inserted all of the value into it. With the xor hash function, there were several empty buckets, and the worst bucket had 625 values in it. With djb2, there were no empty buckets; the smallest bucket had 98 values, and the biggest one had 106.

So what’s a good hash function look like? Djb2, which I used in my test above, is based on integer arithmetic using the string values. It’s an interesting case, because no one is really entirely sure of exactly why it works better than similar functions, but be that as it may, we know that in practice, it works really well. It was invented by a guy named Dan Bernstein, who used to be a genius poster in comp.lang.c, when that was a big deal. Here’s djb2 in Python:

def djb2(key):
  hash = 5381
  for c in key:
    hash = (hash * 33) + ord(c)
  return hash

What the heck is it doing? Why 5381? Because it’s prime, and it works pretty well. Why 33? No clue.

Towards the beginning of this post, I alluded to the fact that hashtables have, at least to some degree, fallen out of vogue. (For example, in the Go language standard library, the map type is implemented using a red-black tree.) Why?

In practice, it’s rarely any faster to really use a hashtable than to use a balanced binary tree like a red-black tree. Balanced trees have better worst-case bounds, and they’re not as sensitive to the properties of the hash function. And they make it really easy to iterate over all of the keys in a collection in a predictable order, which makes them great for debugging purposes.

Of course, hash tables still get used, constantly. The most commonly used data structures in Java code include, without a doubt, the HashMap and HashSet, which are both built on hashtables. They’re used constantly. You usually don’t have to implement them yourself; and usually system libraries provide a good default hash function for strings, so you’re usually safe.

There’s also a lot of really fascinating research into designing ideal hash functions for various applications. If you know what your data will look like in advance, you can even build something called a perfect hash function, which guarantees no collisions. But that’s a subject for another time.

Introducing Algebraic Data Structures via Category Theory: Monoids

Since joining foursquare, I’ve been spending almost all of my time writing functional programs. At foursquare, we do all of our server programming in Scala, and we have a very strong bias towards writing our scala code very functionally.

This has increased my interest in category theory in an interesting way. As a programming language geek, I’m obviously fascinated by data structures. Category theory provides a really interesting handle on a way of looking at a kind of generic data structures.

Historically (as much as that word can be used for anything in computer science), we’ve thought about data structures primarily in a algorithmic and structural ways.

For example, binary trees. A binary tree consists of a collection of linked nodes. We can define the structure recursively really easily: a binary tree is a node, which contains pointers to at most two other binary trees.

In the functional programming world, people have started to think about things in algebraic ways. So instead of just defining data structures in terms of structure, we also think about them in very algebraic ways. That is, we think about structures in terms of how they behave, instead of how they’re built.

For example, there’s a structure called a monoid. A monoid is a very simple idea: it’s an algebraic structure with a set of values S, one binary operation *, and one value i in S which is an identity value for *. To be a monoid, these objects must satisfy some rules called the monad laws:

  1. \forall s \in S: s * i = s, i * s = s
  2. \forall x, y, z \in S: (x * y) * z = x * (y * z)

There are some really obvious examples of monoids – like the set of integers with addition and 0 or integers with multiplication and 1. But there are many, many others.

Lists with concatenation and the empty list are a monoid: for any list,
l ++ [] == l, [] + l == l, and concatenation is associative.

Why should we care if data structures like are monoids? Because we can write very general code in terms of the algebraic construction, and then use it over all of the different operations. Monoids provide the tools you need to build fold operations. Every kind of fold – that is, operations that collapse a sequence of other operations into a single value – can be defined in terms of monoids. So you can write a fold operation that works on lists, strings, numbers, optional values, maps, and god-only-knows what else. Any data structure which is a monoid is a data structure with a meaningful fold operation: monoids encapsulate the requirements of foldability.

And that’s where category theory comes in. Category theory provides a generic method for talking about algebraic structures like monoids. After all, what category theory does is provide a way of describing structures in terms of how their operations can be composed: that’s exactly what you want for talking about algebraic data structures.

The categorical construction of a monoid is, alas, pretty complicated. It’s a simple idea – but defining it solely in terms of the composition behavior of function-like objects does take a bit of effort. But it’s really worth it: when you see a monoidal category, it’s obvious what the elements are in terms of programming. And when we get to even more complicated structures, like monads, pullbacks, etc., the advantage will be even clearer.

A monoidal category is a category with a functor, where the functor has the basic properties of a algebraic monoid. So it’s a category C, paired with a bi-functor – that is a two-argument functor ⊗:C×C→C. This is the categorical form of the tensor operation from the algebraic monoid. To make it a monoidal category, we need to take the tensor operation, and define the properties that it needs to have. They’re called its coherence conditions, and basically, they’re the properties that are needed to make the diagrams that we’re going to use commute.

So – the tensor functor is a bifunctor from C×C to C. There is also an object I∈C, which is called the unit object, which is basically the identity element of the monoid. As we would expect from the algebraic definition, the tensor functor has two basic properties: associativity, and identity.

Associativity is expressed categorically using a natural isomorphism, which we’ll name α. For any three object X, Y, and Z, α includes a component αX,Y,Z (which I’ll label α(X,Y,Z) in diagrams, because subscripts in diagrams are a pain!), which is a mapping from (X⊗Y)⊗Z to X⊗(Y⊗Z). The natural isomorphism says, in categorical terms, that the the two objects on either side of its mappings are equivalent.

The identity property is again expressed via natural isomorphism. The category must include an object I (called the unit), and two natural isomorphisms, called &lamba; and ρ. For any arrow X in C, &lamba; and ρ contain components λX and ρX such that λX maps from I⊗X→X, and ρX maps from X⊗I to X.

Now, all of the pieces that we need are on the table. All we need to do is explain how they all fit together – what kinds of properties these pieces need to have for this to – that is, for these definitions to give us a structure that looks like the algebraic notion of monoidal structures, but built in category theory. The properties are, more or less, exact correspondences with the associativity and identity requirements of the algebraic monoid. But with category theory, we can say it visually. The two diagrams below each describe one of the two properties.

monoidal-categoy.png

The upper (pentagonal) diagram must commute for all A, B, C, and D. It describes the associativity property. Each arrow in the diagram is a component of the natural isomorphism over the category, and the diagram describes what it means for the natural isomorphism to define associativity.

Similarly, the bottom diagram defines identity. The arrows are all components of natural isomorphisms, and they describe the properties that the natural isomorphisms must have in order for them, together with the unit I to define identity.

Like I said, the definition is a lot more complicated. But look at the diagram: you can see folding in it, in the chains of arrows in the commutative diagram.

Finger Trees Done Right (I hope)

A while ago, I wrote a couple of posts that claimed to talk about finger trees. Unfortunately, I really botched it. I’d read a bunch of data structure papers, and managed to get myself thoroughly scrambled. What I wrote about was distantly related to finger trees, and it was useful to help understand how fingertrees work – but it was not, in any way, shape, or form, actually a description of fingertrees. Since then, I’ve been meaning to write a proper post explaining finger trees – but with the work on my book, and with chaos at work, I just haven’t had the time. This time, in order to do my damnedest to make sure that I don’t screw it up again, I’m basically go to describe finger trees over a couple of posts by walking through the best finger-tree paper that I could find. The paper is “Finger Trees: a simple general-purpose data structure”, by Ralf Hinze and Ross Patterson. This might by the paper that introduced the structure, but I’m not sure.

The point of finger trees is pretty simple. It’s very similar to the point of zippers. Programming in functional languages is terrific. As I’ve described before, there are a lot of advantages to writing functional code. But there are also a lot of places where a naive implementation of an algorithm using a functional data structure is dreadfully inefficient. Functional code may be prettier, more maintainable, and more reusable – but imperative code is frequently much more efficient. When you’re doing an operation that, conceptually, modifies a piece of a complex data structure, then functional code can really suck. Finger trees give you a way around that – for many common updatabale data structures, you can build finger-tree versions that are very close to or fully as good as imperative, updating structures.

Continue reading

Zippers: Making Functional "Updates" Efficient

In the Haskell stuff, I was planning on moving on to some monad-related
stuff. But I had a reader write in, and ask me to write another
post on data structures, focusing on a structured called a
zipper.

A zipper is a remarkably clever idea. It’s not really a single data
structure, but rather a way of building data structures in functional
languages. The first mention of the structure seems to be a paper
by Gerard Huet in 1997
, but as he says in the paper, it’s likely that this was
used before his paper in functional code — but no one thought to formalize it
and write it up. (In the original version of this post, I said the name of the guy who first wrote about zippers was “Carl Huet”. I have absolutely no idea where that came from – I literally had his paper on my lap as I wrote this post, and I still managed to screwed up his name. My apologies!)

It also happens that zippers are one of the rare cases of data structures
where I think it’s not necessarily clearer to show code. The concept of
a zipper is very simple and elegant – but when you see a zippered tree
written out as a sequence of type constructors, it’s confusing, rather
than clarifying.

Continue reading

Finger Tree Update: I forgot something

As an alert commenter pointed out, I left out one really important thing in
my earlier post about finger trees. That’s what I get for trying to write when
I’m sick :-). Seriously, this is a point that’s implied by the post as it stands, but never explicitly stated – and since it’s really important, it
should be stated explicitly.

The monoid-annotated tree structure doesn’t replace the original
data structure: it’s superimposed on it.

So, as I said: cons-cell style lists are ubiquitous and beloved by
functional and non-functional programmers. In finger trees, you’re
not getting rid of them. The point of finger trees is to let
you keep the convenient data structure, with its basic operations
and properties intact, and to augment it with a tree that lets you search it efficiently.

To illustrate: here’s the number list example from yesterdays post. The
original list is at the bottom, with green pointers representing the
original cons-list next-pointers. The monoid-annotated tree is on top,
with red pointers. The combination of the original list with the
monoid annotated tree is a finger tree.

counted-finger-tree-list.png

The point of this is that you’ve still got your cons list. All of the
beautiful recursive/iterative algorithms that walk the list continue to
work exactly the way that they would in the traditional cons-list: for code that walks the list, the fact that there are finger-tree pointers
sitting on top of the list is irrelevant – and, in fact, completely invisible. For algorithms that want to search the list, the tree structure is there,
and allows the searches to be performed much more quickly than they could be on
the traditional list. The superposition of those two structures is the
genius of the finger tree.

Finally: Finger Trees!

stone fingers_478fc739a1585.jpg

For ages, I’ve been promising to write about finger trees. Finger trees are
an incredibly elegant and simple structure for implementing sequence-based
data structures. They’re primarily used in functional languages, but there’s nothing
stopping an imperative-language programmer from using them as well.

In functional programming languages, lists are incredibly common. They show up everywhere; they’re just so easy to use for recursive iteration that they’re ubiquitous. I can’t think of
any non-trivial program that I’ve written in Haskell, Lisp, or OCaml that doesn’t use lists. (Heck, I’ve wound up using cons-cell libraries a couple of times in C++.)

Of course, lists aren’t perfect. In fact, for some things, they absolutely suck. Suppose I’ve got a really long list – like a list of names and phone numbers from a telephone book. My local phone book has 600 pages of names; and the pages are incredibly dense – there’ve got to be at least a couple of hundred names per page. So in a phone book for a suburban region of New York City, the list of phone numbers will have 120,000 names. Now imagine that I want to look
up one name in that list – on average, I’m going to have to chase through 60,000 pointers. 60,000 isn’t a huge number for a computer, but still, chasing 60,000 pointers is a pretty big deal. If I stored them in an array, it would be a lot less convenient for some tasks, but I could use
binary search to find names, and I could find any name in no more than 16 comparisons – and the whole shebang would be contiguous, so I wouldn’t even be chasing pointers.

What finger trees do is give me a way of representing a list that has both the convenience
of the traditional cons list, and the search efficiency of the array based method.

Continue reading

Two-Three Trees: a different approach to balance

23-thumb.png

This post is very delayed, but things have been busy.

I’m working my way up to finger trees, which are a wonderful
functional data structure. They’re based on trees – and like many
tree-based structures, their performance relies heavily on the balance
of the tree. The more balanced the tree is, the better they perform.

In general, my preference when I need a balanced tree is a red-black
tree, which I wrote about before. But there are other kinds of balanced trees,
and for finger trees, many people prefer a kind of tree called a 2/3 tree.

A two-three tree is basically a B-tree with a maximum of two values per node.
If you don’t know what a B-tree is, don’t worry. I’m going to explain all the details.

A two-three tree is a tree where each node contains either one or two values. If
the node isn’t a leaf, then its number of children is one more than its number
of values. The basic value constraints are the same as in a binary search tree:
smaller values to the left, larger values to the right.

The big difference in the 2/3 tree is balance: it’s got a much stronger
balance requirement. In a 2/3 tree, every path from the tree root to the leaves
is exactly the same length.

Continue reading

Gap Buffers, or, Don't Get Tied Up With Ropes?

Last week, I promised to continue my discussion of ropes. I’m going to
break that promise. But it’s in a good cause.

If you’re a decent engineer, one of the basic principles you should
follow is keeping this as simple as possible. One of the essential skills
for keeping things simple is realizing when you’re making something
overly complicated – even if it’s cool, and fun, and interesting.

Working out the rope code for more complex operations, I found myself
thinking that it really seemed complex for what I was doing. What I’m
doing is writing myself a text editor. I’m not writing a string
processing framework for managing gigabytes of data; I’m just writing a
text editor.

So I decided to try, for the purposes of testing, to look at one of
the simpler data structures. A classic structure for implementing editor
buffers is called a gap buffer. I’ll explain the details below.
The drawback of a gap buffer is that large cursor motions are relatively
expensive – the cost of moving the cursor is proportional to the distance
that you move it.

So I wrote a test. I implemented a basic gap buffer. Then I built a
test that did 300,000 inserts of five characters; then slid the cursor
from the end, to the beginning, back to the end, back to the beginning,
and back to the end again. Total execution time? 150 milliseconds. And
that includes resizing the buffer 12 times, because of a deliberately
stupid buffer sizing and copying strategy.

And with that, out the window went the ropes. Because in Java,
using a slapped together, sloppy piece of test code, which does things in a dumb
brute force way, it can do something much more complicated than anything an editor
will encounter in routine use, the gap buffer achieves performance that is more than
adequately fast. (Based on some experiments I did back at IBM, delays of 1/10th
of a second are roughly when people start to notice that an editor is slow. If you can respond is less than 1/10th of a second, people don’t perceive a troublesome delay.)
If the gap buffer can achieve the needed perfomance that easily, then there’s
absolutely no reason to do anything more complicated.

Continue reading

Ropes: Twining Together Strings for Editors

It’s been a while since I’ve written about any data structures. But it just so happens that I’m actually really working on implementing a really interesting and broadly useful data structure now, called a Rope.

A bit of background, to lead in. I’ve got this love-hate relationship with some of the development tools that Rob Pike has built. (Rob is one of the Unix guys from Bell labs, and was one of the principal people involved in both the Plan9 and Inferno operating systems.) Rob has implemented some amazing development tools. The two that fascinate me were called Sam and Acme. The best and worst features of both are a sort of extreme elegant minimalism. There’s no bloat in Rob’s tools, no eye-candy, no redundancy. They’re built to do a job, and do it well – but not to do any more than their intended job. (This can be contrasted against Emacs, which is a text editor that’s grown into a virtual operating system.) The positive side of this is that they’re incredibly effective, and they demonstrate just how simple a programmers text editor should be. I’ve never used another tool that is more effective than Acme or Sam. In all seriousness, I can do more of my work more easily in Sam than I can in Emacs (which is my everyday editor). But on the other hand, that extreme minimalist aesthetic has the effect of strictly eliminating any overlaps: there’s one way to do things, and if you don’t like it, tough. In the case of Acme and Sam, that meant that you used the mouse for damn-near everything. You couldn’t even use the up and down arrows to move the cursor!

This is a non-starter for me. As I’ve mentioned once or twice, I’ve got terrible RSI in my wrists. I can’t use the mouse that much. I like to keep my hands on my keyboard. I don’t mind using the mouse when it’s appropriate, but moving my hand from the keyboard to the mouse every time I want to move the cursor?. No. No damned way. Just writing this much of this post, I would have had to go back and forth between the keyboard and mouse over 50 times. (I was counting, but gave up when I it 50.) A full day of that, and I’d be in serious pain.

I recently got reminded of Acme, because my new project at work involves using a programming language developed by Rob Pike. And Acme would really be incredibly useful for my new project. But I can’t use it. So I decided to bite the bullet, and use my free time to put together an Acme-like tool. (Most of the pieces that you need for a prototype of a tool like that are available as open-source components, so it’s just a matter of assembling them. Still a very non-trivial task, but a possible one.)

Which finally, leads us back to today’s data structure. The fundamental piece of a text editor is the data structure that you use to represent the text that you’re editing. For simplicity, I’m going to use Emacs terminology, and refer to the editable contents of a file as a Buffer.

How do you represent a buffer?

As usual with data structures, you start by asking: What do I need it to do? What performance characteristics are important?

In the case of a text buffer, you can get by with a fairly small set of basic operations:

  • Fast concatenation: concatenating blocks of text needs to be really fast.
  • Fast insert: given a point in a block of text, you need to be able to quickly insert text at that point.
  • Fast delete: given two points in a block of text, you need to be able to quickly remove the text between those points.
  • Reasonably fast traversal: Lots of algorithms, ranging from printing out the text to searching it are based on linear traversals of the contents. This doesn’t have to be incredibly fast – it is an intrinsically linear process, and it’s usually done in the context of something with a non-trivial cost (I/O, regular-expression scanning). But you can’t afford to make it expensive.
  • Size: you need to be able to store effectively unlimited amounts of text, without significant performance degradation in the operations described above.

Continue reading