Advanced Haskell Data Structures: Red-Black Trees

So, we’ve built up some pretty nifty binary trees – we can use the binary tree both as the basis of an implementation of a set, or as an implementation of a dictionary. But our implementation has had one major problem: it’s got absolutely no way to maintain balance.

What that means is that depending on the order in which things are inserted to the tree, we might have excellent performance, or we might be no better than a linear list. For example, look at these trees. As you can see, a tree with the same values can wind up quite different. In a good insert order, you can wind up with a nicely balanced tree: the minimum distance from root to leaf is 3; the maximum is 4. On the other hand, take the same values, and insert them in a different order and you get a rotten tree; the minimum distance from root to leaf is 1, and the maximum is 7. So depending on luck, you can get a tree that gives you good performance, or one that ends up giving you no better than a plain old list. Playing with a bit of randomization can often give you reasonably good performance on average – but if you’re using a tree, it’s probably because O(n) complexity is just too high. You want the O(lg n) complexity that you’ll get from a binary tree – and not just sometimes.

To fix that, you need to change the structure a bit, so that as you insert things, the tree stays balanced. There are several different approaches to how you can do this. The one that we’re going to look at is based on labeling nodes in ways that allow you to very easily detect when a serious imbalance is developing, and then re-arrange the tree to re-balance it. There are two major version of this, called the AVL tree, and the red-black tree. We’re going to look at the red-black. Building a red-black tree is as much a lesson in data structures as it is in Haskell, but along with learning about the structure, we’ll see a lot about how to write code in Haskell, and particularly about how to use pattern-matching for complex structures.

We’ll start with a basic definition. A red-black tree is a normal binary search tree, except that each node is assigned a color, which is either red or black, and there are several invariant properties that must hold about the coloring of the tree:

The root of the tree is always black.
All branches of a tree end in a null which is black.
All children of red nodes are black.
For all nodes in the tree, all downward paths from the node to a leaf contain the same number of black nodes.

If these invariants are maintained, they guarantee that tree is almost balanced: for the entire tree, and every subtree of it, the longest path from the root to a leaf is no more than twice the shortest path from the root to a leaf.

We’ll start by writing the Haskell type declaration for a red/black tree. We’ll just do it as a tree of ordered values to keep things simple.

data Color = Red | Black deriving (Eq, Show)

data (Ord a) => RedBlackTree a = RBTNode a Color (RedBlackTree a) (RedBlackTree a)
           | RBTEmpty deriving (Eq,Show)

Also, for convenience, we’ll write a couple of accessor functions that we’ll use later on. Something interesting to note about these accessors is that they use non-exhaustive patterns: there are values of type RedBlackTree a for which these functions are undefined. If you call any of these accessors on a tree whose value is RBTEmpty, you’ll get a runtime error.

It is, at the very least, considered bad style to write non-exhaustive functions. It’s actually a way of cheating the type system. You’re claiming that you’re writing a function from type T to type U, but in fact, there are values of T for which the function won’t work: the real type of the function is T’ -> U, where T’ is a subset of T. But you can’t say that in Haskell – so you’re cheating. To be more concrete, you’re writing functions like rbtLeftChild which claims that for any red-black tree passed to it, it will return a valid red-black tree. But in fact, that’s only true for the subset of red-black trees that were built with the RBTNode constructor; for other values, the function will fail.

The best solution to make it exhaustive would be to use the Maybe type to allow you to return a valid value for all trees passed as inputs. But that would make the code much more complex, unless we used monads – and we’re not ready for monads yet.

rbtLeftChild :: (Ord a) => RedBlackTree a -> RedBlackTree a
rbtLeftChild (RBTNode _ _ l _) = l

rbtRightChild :: (Ord a) => RedBlackTree a -> RedBlackTree a
rbtRightChild (RBTNode _ _ _ r) = r

rbtValue :: (Ord a) => RedBlackTree a -> a
rbtValue (RBTNode v _ _ _) =  v

rbtColor :: (Ord a) => RedBlackTree a -> Color
rbtColor (RBTNode _ c _ _) = c
rbtColor RBTEmpty = Black

Inserting data into the tree is where things get interesting. It starts off the same as how you insert into a typical BST: search for the correct position, and then insert the value as a new leaf node. But in a red-black tree, the new node needs a color. New nodes are always red – so you’re inserting a red node. Now you need to check to make sure that you’re not violating any of the tree invariants. If you are, then you need to fix it.

To keep things reasonably clean and separate, we’ll use the tail-calling version of tree insert, and then tail-call a rebalance function when the basic insert is complete. Rebalance will fix the balance of the tree, and do the tree re-assembly as it climbs up the tree.

rbtInsert :: (Ord a) => RedBlackTree a -> a -> RedBlackTree a
rbtRebalance :: (Ord a) => RedBlackTree a -> [RedBlackTree a] - RedBlackTree a
--rbtRebalance               focus              ancestors

rbtInsert node v =
rbtInsertTailCall node v []

rbtInsertTailCall node@(RBTNode v color left right) newval path
      | v > newval = rbtInsertTailCall left newval (node:path)
      | otherwise = rbtInsertTailCall right newval (node:path)
rbtInsertTailCall RBTEmpty v path =
     rbtRebalance (RBTNode v Red RBTEmpty RBTEmpty) path

All over the place as we rebalance the tree, we’ll have places where we want to “rebuild” nodes to patch in the insertion change; as usual, we separate that into its own function.

-- Reconstruct takes a child node and a parent node, and creates a replacement
-- for the parent node with the child in the appropriate position. It allows
-- the color of the new node to be specified.
reconstructNode node@(RBTNode v c l r) parent@(RBTNode pv pc pl pr) color =
   if (pv > v)
      then (RBTNode pv color node pr)
      else (RBTNode pv color pl node)

Now, we need to think about what we’re going to do to keep the tree balanced as we walk back up the insertion path fixing the tree. There are two things we can do to make the tree respect the invariants: we can re-color nodes, or we can pivot subtrees.

Pivoting a tree is an interesting operation – it’s a process of swapping a node and one of its children to rotate a section of the tree. Suppose we have a binary search tree like the one in the diagram to the side. It’s poorly balanced; it’s got only one node to its left, but 7 nodes to its right. To correct this by pivoting, what we’ll do is take node 6 – currently a child of the root, and rotate the tree counterclockwise around it, so that 6 becomes the root, the old root (2) becomes the left child of 6, and the old left child of 6 (node 4) becomes the right child of the old root.

So after the pivot, our tree looks like this. This operation was a left pivot; a right pivot does the same kind of thing, but rotating the tree clockwise instead of counterclockwise.

So let’s go ahead and write the pivot operations. We’ll write two pivot functions: one for each direction. We’ll pass the pivot operation a subtree whose root and child in the appropriate direction are to be rotated. In addition, we’ll also add a parameter for managing the color of the new root node. In some cases, we’ll want to swap the colors of the nodes being moved; in other cases, we won’t. So we’ll put a boolean parameter in to specify whether or not to swap the colors.

-- pivot left tree at root; second parent indicates whether or not to swap
-- colors of the nodes that are being moved.
rbtPivotLeft :: (Ord a) => RedBlackTree a -> Bool -> RedBlackTree a
rbtPivotLeft (RBTNode rootval rootcolor sib (RBTNode focval foccolor focleft focright)) swap =
    (RBTNode focval newrootcolor oldroot focright) where
          newrootcolor = if swap then rootcolor else foccolor
          oldrootcolor = if swap then foccolor else rootcolor
          oldroot = RBTNode rootval oldrootcolor sib focleft


rbtPivotRight (RBTNode rootval rootcolor (RBTNode focval foccolor focleft focright) sib) swap =
       (RBTNode focval newrootcolor focleft oldroot) where
             newrootcolor = if swap then rootcolor else foccolor
             oldrootcolor = if swap then foccolor else rootcolor
             oldroot = RBTNode rootval oldrootcolor focright sib

So, let’s try taking a look at how the pivots work. First, we need to construct some trees to rebalance. We’ll just do it manually, since the insert code isn’t properly finished yet.

twentyseven = RBTNode 27 Black RBTEmpty RBTEmpty
twentytwo = RBTNode 22 Black RBTEmpty RBTEmpty
twentyfive = RBTNode 25 Black twentytwo twentyseven
sixteen = RBTNode 16 Black RBTEmpty RBTEmpty
twenty = RBTNode 20 Black sixteen twentyfive
twelve = RBTNode 12 Black RBTEmpty RBTEmpty
fifteen = RBTNode 15 Black twelve twenty
two = RBTNode 2 Black RBTEmpty RBTEmpty
seven = RBTNode 7 Black RBTEmpty RBTEmpty
five = RBTNode 5 Black two seven
ten = RBTNode 10 Black five fifteen

That produces a unbalanced binary tree that looks like this:

RBTNode 10 Black
(RBTNode 5 Black -- 10left
(RBTNode 2 Black RBTEmpty RBTEmpty)  -- 5 left
(RBTNode 7 Black RBTEmpty RBTEmpty)) -- 5 right
(RBTNode 15 Black -- 10 right
(RBTNode 12 Black RBTEmpty RBTEmpty) -- 15 left
(RBTNode 20 Black  -- 15 right
(RBTNode 16 Black RBTEmpty RBTEmpty) -- 20 left
(RBTNode 25 Black -- 20 right
(RBTNode 22 Black RBTEmpty RBTEmpty) -- 25 left
(RBTNode 27 Black RBTEmpty RBTEmpty)))) -- 25 right

Let’s do a quick test, and try doing a left pivot on the root.

*Main> rbtPivotLeft ten False
RBTNode 15 Black (RBTNode 10 Black (RBTNode 5 Black (RBTNode 2 Black RBTEmpty RBTEmpty) (RBTNode 7 Black RBTEmpty RBTEmpty)) (RBTNode 12 Black RBTEmpty RBTEmpty)) (RBTNode 20 Black (RBTNode 16 Black RBTEmpty RBTEmpty) (RBTNode 25 Black (RBTNode 22 Black RBTEmpty RBTEmpty) (RBTNode 27 Black RBTEmpty RBTEmpty)))
*Main>

Cleaned up, that looks like this:

RBTNode 15 Black
(RBTNode 10 Black
(RBTNode 5 Black
(RBTNode 2 Black RBTEmpty RBTEmpty)
(RBTNode 7 Black RBTEmpty RBTEmpty))
(RBTNode 12 Black RBTEmpty RBTEmpty))
(RBTNode 20 Black
(RBTNode 16 Black RBTEmpty RBTEmpty)
(RBTNode 25 Black
(RBTNode 22 Black RBTEmpty RBTEmpty)
(RBTNode 27 Black RBTEmpty RBTEmpty)))

Much better – that’s much closer to a balanced tree! So now that we know how to do the pivot, and we’ve seen that it works correctly, we can look at building the rebalance code.

With pivots out of the way, we can start looking at how to decide what operations to do to rebalance the tree. When we’re doing an insert, we end up inserting a red node on the bottom of the tree. It’s got two children, both null, which are considered black. If the parent of our new node is black, then everything is fine; we haven’t altered the number of black nodes on any path from a node to a leaf. So we’re done. But if the parent is red, then we’ve got a red child of a red node, so we need to do some fixing.

Fixing an imbalance in a red-black tree can (and in fact often will) trigger a cascade of changes. But part of what makes the structure so elegant is that we only need to look at the local structure immediately around the new insert; and then when we’ve corrected that, there’s only one place where the next problem could be. In every case where we’re rebalancing, we can look at a specific problem, and fix it, and then immediately move to where the next potential problem is. To code this, we’ll look at in terms of a focal node, which is the node causing the immediate problem we’re fixing; and we’ll fix the problem by looking at the local context of the focus.

The potential cases we can encounter are:

The focal node is the root of the tree. In that case, we make it black. That adds one black node to every path in the tree, which leaves us with a valid tree, so we’re done.
The focal node is red, but has a black parent. Again, that’s fine. No problem.
The focal node is red; it’s parent is also red. Then we need to look at its uncle; that is, the node that is the sibling of its parent. If both the new node, the parent and the uncle are all red, then we change the color of the parent and uncle to black, and the grandparent to red. After this, the grandparent becomes the focal node, and we continue to do our tree-fixing with the new focus.
Here’s where it gets a bit messy. If the focal node and its parent are both red, but the uncle is black, then we’re going to need to pivot. Getting the pivot right is tricky. There are four cases:
1. The focal node is the right child of its parent, and the parent is the left node of the grandparent, then we do a left pivot of the focal node and its parent, and the former parent becomes the new focal node.
2. The focal node is the left child of its parent, and the parent is the right child of the grandparent, then we do a right pivot of the focal node and its parent, and the former parent becomes the new focus.
3. The focal node is the left child of its parent, and the parent is the left child of the grandparent. Then we do a right pivot of the parent and the grandparent and swap the colors of the parent and grandparent. The parent becomes the focus.
4. The focal node is the right child of its parent, and the parent is the right child of the grandparent. Then we do a left pivot of the parent and the grandparent and swap the colors of the parent and grandparent. The parent becomes the focus.

Ok, there’s the algorithm for rebalancing. How can we code it in Haskell? We’ve got a list of the nodes from the insertion path, in leaf to root order. When we look at the rebalance, we can see that there are a bunch of different cases which we can separate via pattern matching:

The focus is the root of the tree. We can select this case by using an empty list for the pattern for the ancestors parameter. Once we’ve gotten to the root, the tree is balanced, and the only corrective thing we may need to do is make the root black. So:

-- Root is focus; no matter what color it is, just make it black
rbtRebalance (RBTNode v _ left right) [] = RBTNode v Black left right
rbtRebalance node@(RBTNode v _ left right) (parent@(RBTNode pv pc pl pr):[])
               | pv > v = RBTNode pv pc node pr
               | otherwise = RBTNode pv pc pl node

Also very simple is the case where the focus is black. In that case, we don’t need to do anything except patch in the insert, and continue up the tree. Again, we can select that case just by pattern matching.

-- black node - just patch in the change, and climb.
rbtRebalance focus@(RBTNode fv Black left right) (parent@(RBTNode pv pc pl pr):ancestors)
         | pv > fv = rbtRebalance (RBTNode pv pc focus pr) ancestors
         | otherwise = rbtRebalance (RBTNode pv pc pl focus) ancestors

Next, we’ve got the case of a red node with a black parent. We can identify it by using “RBTNode v Red left right” as a pattern for the focus, and “RBTNode _ Black _ _” as a pattern for the parent. A red node with a black parent is OK, as long as the subtree under the red is balanced; and since we’re balancing from the bottom up, we know that everything beneath this node is balanced. So:
```
rbtRebalance focus@(RBTNode fv Red left right) (parent@(RBTNode pv Black pl pr):ancestors) =
       rbtRebalance (reconstructNode focus parent Black) ancestors
```

Now we’re getting to the interesting cases, which are the cases where both the node and its parent are red. We can separate two cases here: cases where we’ll fix using a pivot, and cases where we’ll fix using a recoloring. The way to distinguish them is by looking at the uncle of the focus node; that is, the sibling of the nodes parent. The red-red case is complicated enough that instead of writing out huge pattern expressions, we’ll simplify it by separating the function into several layers of calls, each of which does a phase of the pattern match. We want to separate out the cases where we’ve got a red node with a red parent and a red uncle, and the cases where we’ve got a red node with a red parent and a black uncle.

If the focus, its parent, and its uncle are all red, then we’re in a recoloring case; if the focus and its parent are red, and the uncle is black, then we’re in a pivot case.

rbtRebalance focus@(RBTNode v Red left right) (parent@(RBTNode _ Red _ _):ancestors) =
    rebalanceRedRedNode focus parent ancestors

To be able to recognize sub-cases when we have a red node/red parent, we need to be able to look at the path from the grandparent to the focus, and the color of the uncle. So we’ll write some helper functions to get those.

uncleColor node parent grandparent =
    if (parent == rbtLeftChild grandparent)
        then rbtColor (rbtRightChild grandparent)
        else rbtColor (rbtLeftChild grandparent)

data TwoStepPath = LeftLeft | LeftRight | RightLeft | RightRight

pathFromGrandparent :: (Ord a) => RedBlackTree a -> RedBlackTree a -> RedBlackTree a -> TwoStepPath
pathFromGrandparent node@(RBTNode v _ l r) parent@(RBTNode pv _ pl pr) grand@(RBTNode gv _ gl gr)
        | pv < gv  && v        
        | pv >= gv && v         
        | pv = pv = LeftRight
        | pv >= gv && v >= pv = RightRight

To actually handle the red node/red parent, first we separate out the case where the red parent is the root of the tree – there are no more ancestors on the insertion path. In that case, we can just climb to root, and do the correction from there.

-- node is red, parent is red, but parent is root: just go to parent(root), and fix
-- from there.
rebalanceRedRedNode focus@(RBTNode fv fc fl fr) parent@(RBTNode pv pc pl pr) [] =
     rbtRebalance (reconstructNode focus parent Red) []

Otherwise, we need to check whether the uncle was red or black. If it was black, we do a recolor correction; if it was red, we figure out what kind of pivot to do. We’ll use a bunch of helper functions to make it easy.

rebalanceRedRedNode focus parent (grand@(RBTNode gv gc gl gr):ancestors) =
    if (uncleColor focus parent grand) == Red
        then recolorAndContinue focus parent grand ancestors
    else case (pathFromGrandparent focus parent grand) of
             LeftLeft -> rbtRebalance (pivotGrandparentRight focus parent grand) ancestors
             LeftRight -> rbtRebalance (pivotParentLeft focus parent) (grand:ancestors)
             RightLeft -> rbtRebalance (pivotParentRight focus parent) (grand:ancestors)
             RightRight -> rbtRebalance (pivotGrandparentLeft focus parent grand) ancestors

The code above is really just using patterns for case selection. The actual work is in the helper functions that get called. They’re all simple functions. First, we have some custom pivot functions – one for each direction for pivoting around a parent (the cases where the node is left of the parent, and the parent is right of the grandparent, or vise versa), and one for each direction pivoting around a grandparent (both node and parent are left children, or both are right children).

pivotGrandparentLeft node parent@(RBTNode pv pc pl pr) grand@(RBTNode gv gc gl gr) =
     rbtPivotLeft  (RBTNode gv gc gl (RBTNode pv pc pl node)) True

pivotGrandparentRight node parent@(RBTNode pv pc pl pr) grand@(RBTNode gv gc gl gr) =
     rbtPivotRight  (RBTNode gv gc (RBTNode pv pc node pr) gr) True

pivotParentLeft node parent@(RBTNode pv pc pl pr) =
     rbtPivotLeft (RBTNode pv pc pl node) False

pivotParentRight node parent@(RBTNode pv pc pl pr) =
    rbtPivotRight (RBTNode pv pc node pr) False

And a function to do the recoloring for when the uncle was red:

recolorAndContinue focus@(RBTNode v c l r) parent@(RBTNode pv pc pl pr) grand@(RBTNode gv gc gl gr) ancestors =
      let path = pathFromGrandparent focus parent grand
          uncle = (case path of
                    LeftLeft -> gr
                    LeftRight -> gr
                    RightLeft -> gl
                    RightRight -> gl)
          newUncle = if (uncle == RBTEmpty)
                         then RBTEmpty
                         else (RBTNode (rbtValue uncle) Black (rbtLeftChild uncle) (rbtRightChild uncle))
          newparent = reconstructNode focus parent Black
          newGrandParent = (case path of
              LeftLeft -> (RBTNode gv Red newparent newUncle)
              LeftRight -> (RBTNode gv Red newparent newUncle)
              RightLeft -> (RBTNode gv Red newUncle newparent)
              RightRight -> (RBTNode gv Red newUncle newparent))
          in rbtRebalance newGrandParent ancestors

And that, finally, is it. For the binary search tree without balancing code, the worst case is inserting a list of values in order. So let’s try that, to see how well this works.

*Main> foldl ( x y -> rbtInsert x y) RBTEmpty [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16]
RBTNode 4 Black (RBTNode 2 Black (RBTNode 1 Black RBTEmpty RBTEmpty) (RBTNode 3 Black RBTEmpty RBTEmpty))
(RBTNode 8 Red (RBTNode 6 Black (RBTNode 5 Black RBTEmpty RBTEmpty) (RBTNode 7 Black RBTEmpty RBTEmpty))
(RBTNode 12 Black (RBTNode 10 Red (RBTNode 9 Black RBTEmpty RBTEmpty) (RBTNode 11 Black RBTEmpty RBTEmpty))
(RBTNode 14 Red (RBTNode 13 Black RBTEmpty RBTEmpty) (RBTNode 15 Black RBTEmpty (RBTNode 16 Red RBTEmpty RBTEmpty)))))

Since that’s completely illegible, let’s clean it up, and look at it in picture form:

The shortest path from root to leaf is [4,2,1]; the longest is [4,8,12,14,15,16]. Just like we promised: the longest is no more than twice the shortest. It’s a pretty good search tree, and the rebalancing work isn’t terribly expensive, and amortizes nicely over a long run of inserts. The insert time ends up amortizing to O(lg n), just like the simple binary search tree insert.

0 thoughts on “Advanced Haskell Data Structures: Red-Black Trees”

Dan McKinley November 30, 2009 at 11:48 pm

Mark, are you familiar with this paper?
http://www.eecs.usma.edu/webs/people/okasaki/jfp99.ps
The gist of it is that imperative implementations of RB trees tend to be more complicated because destructive updates to the tree are possible. In functional languages, you can’t do that, so red-black might as well be boiled down to a very simple pattern match of four cases. Not totally sure what your sources are but it looks closer to a reproduction of an imperative algorithm.

Loading...

Reply ↓
solrize December 1, 2009 at 4:34 pm

There is a good example of using GADT’s to certify the correctness of some red-black tree operations:
http://www.reddit.com/r/programming/comments/w1oz/how_are_gadts_useful_in_practical_programming/cw3i9
I think the example may originate from a paper someplace but I don’t know which one.

Loading...

Reply ↓
lrs December 1, 2009 at 5:47 pm

I like your diagrams – what are you using to draw the trees? I’ve been coding binary trees in Haskell too, and I’m looking for a way to visualize them.
The best I can do so far is dump a representation of the tree to a text file in DOT format and run graphviz on it.

Loading...

Reply ↓
Federico December 6, 2009 at 5:09 pm

Mark woud you mind making a post about zippers or tries ? It seems that those are the most used data structures.

Loading...

Reply ↓
vila December 17, 2009 at 9:13 am

I remember barely understanding these in university. I hope i never need to understand them again.

Loading...

Reply ↓
Wry Mouth January 3, 2010 at 4:35 pm

Hmm. This is reminiscent of how I teach my middle school students “how to factor.” Draw two factorizations of, say, 24 on the board, and point out how the “balanced” one is faster than the “unbalanced” one. Maybe, someday, one of them will be a computer programmer and this will resonate! ;o/

Loading...

Reply ↓
Curt Sampson January 3, 2010 at 11:47 pm

Dan McKinley makes an excellent point. From p. 27 of Chris Okasaki’s Purely Functional Data Structures, the book that expands upon the work in his thesis:

One of the reasons this implementation is so much simpler than typical presentations of red-black trees…is that it uses subtly different rebalancing transformations. Imperative implementations typically split the four dangerous cases considered here into eight cases, according to the color of the sibling of the red node with a red child. Knowing the color of the red parent’s sibling allows the transformations to use fewer assignments in some cases and to terminate rebalancing early in others. However, in a [pure] functional setting, where we are copying the nodes in question anyway, we cannot reduce the number of assignments in this fashion nor can we terminate copying early, so there is no point in using the more complicated transformations.

The book is one I think that any Haskell programmer who builds his own data structures should own.

Loading...

Reply ↓