Suppose you’ve got a huge graph – millions of nodes. And you know that it’s not connected – so the graph actually consists of some number of pieces (called the connected components of the graph). And there are constantly new vertices and edges being added to the graph, but nothing is ever removed. Some questions you might want to ask about this graph at a particular point in time are:
- How many components are there in the graph?
- Which component is vertex X in?
- Are vertices X and Y in the same component?
- How many components are there?
All of these questions are variants of a classic computer science problem, called
union-find, which also comes up in an astonishing number of different contexts. The reason for the name is that in the representation of the solution, there
are two basic operations: union, and find. Basically, the division of the graph into
components is also a partition of the vertices of the graph into disjoint sets: union find
is a problem which focuses on a particular kind of disjoint set problem, where you can modify
the sets over time.
The basic idea of the classic union-find solution is to create a list of sets, where each of the member sets contains a collection of the members of one of the sets (vertices of the connected subgraph). Each of the
sets has one member, randomly selected, and called the set’s representative: as long as a set
remains disjoint (no edges are added which connect it to another component), its representative never
changes; when two sets are merged (an edge is added connecting two components), the representative
of the merged set will be the representative of one of the two sets that were merged. The only way to refer to a set is by using its representative.
This gives us a data structure with 3 real operations. One is to add a set of objects (connected vertices) to it. Another is union which specifies two objects (vertices), and merges the sets containing those objects (adding an edge). The last is find which takes an object, and returns the representative of the set to which that object belongs.
If you think of it as lists of lists, it’s pretty obvious how this all works. Adding a set of objects
is just linking that list into the set. Union is taking one set, and appending it to another. Find is searching the lists to figure out which set contains a particular element.
type setMember = (value, child, parent) def AddSet(sets, newSetRepresentative): sets.add(newSet) def Union(sets, setOneMember, setTwoMember): setOneRepresentative = sets.Find(setOneMember) setTwoRepresentative = sets.Find(setTwoMember) sets.remove(setTwoRepresentative) setOneRepresentative.append(setTwoRepresentative) def Find(sets, v): if (v.parent == null): return v else: return Find(sets, v.parent)
In this version, the costs are also easy to figure out. Using a doubly linked list for the sets, we
get addition costs as O(1); find is O(N) (since you’re following the list of links from a node to its
parent until you get to the representative, which is the only node with no parent, it could take up to N
steps, where N is the number or objects in the set of sets); and union is also O(N) (union requires two
Finds, each of which take O(N); and then finding the end of one of the lists, where the other set will be
linked in, which is also O(N)). If you’ve got really big sets, and you’re going to do a lot of operations,
then this is a really awful running time.
How can we improve that? The easiest way is to take each set, and replace it by a tree. That’s called
the disjoint forest solution. In a disjoint forest, a node can have more than one child, but each
node has either zero (if its a representative) or one parent. In this representation, the implementation
of find is exactly the same: climb up the parent chain to find a representative, which takes amortized
O(lg n), provided the trees remain shallow and bushy. Union is again two finds (lg n) followed by
attaching one tree a a child of the other trees representative (O(1)), so it’s also O(lg n).
Now, the question is, how can we say that the forest is really O(lg n)? That’s only true if the trees
are shallow and close to balanced. Unbalanced, deep trees could degenerate to exactly the same scenario as
the list implementation. Fortunately, maintaining an approximate, bushy balance is easy: For each tree, we
maintain something called rank. For a tree of one vertex, we assign rank=0. When we merge trees,
we always merge into the tree with higher rank (leaving the rank unchanged), unless the trees being merged
have equal rank=N. In that case, we pick one arbitrarily, and assign the result rank=N+1. That’s actually
enough to keep the amortized cost of all operations at O(lg N).
type setMember = (value, child, parent) def AddSet(sets, newSetRepresentative): sets.add(newSetRepresentative) newSetRepresentative.rank = 1 def Union(sets, setOneMember, setTwoMember): setOneRepresentative = sets.Find(setOneMember) setTwoRepresentative = sets.Find(setTwoMember) if (setOneRepresentative.rank > setTwoRepresentative.rank): higherRankRepresentative = setOneRepresentative lowerRankRepresentative = setTwoRepresentative else: higherRankRepresentative = setTwoRepresentative lowerRankRepresentative = setTwoRepresentative higherRankRepresentative.addChild(lowerRankRepresentative) sets.remove(lowerRankRepresentative) if (higherRankRepresentative.rank == lowerRankRepresentative.rank): higherRankRepresentative.rank++ def Find(sets, v): if (v.parent == null): return v else: return Find(sets, v.parent)
In fact, we can improve this even more in a simple way. Instead of maintaining rank, we can just
munge the tree. Each time we do a union, we just randomly pick one of the two tree, and attach is
as a child of the representative of the other. Each time we do a Find operation, we keep a list of each node visited on the path to
the representative, and re-link all of them as children of the representative. This way, once is a
while, you’ll have to do the expensive operation of relinking a bunch of nodes after doing a climb
up the tree (with potential cost O(lg n)), but as a result, a bunch of Finds that formerly would have taken O(lg n) will now take O(1). This one is easier to see with an example than with pseudo code. Image
we had a set of trees like this one, where the representatives are in red:
With this set of trees, we can do a Union(A,J). The representative of the set containing A is O, and J is a representative. So we merge those two sets, and get the following:
Now, suppose we want to do a Find(C). What we’d do is trace the path from C to its representative: C, to A, to O, to J. The representative is J. But we also keep the list of what we visited before reaching J – (C, A, O) – and link them as direct children of J, giving us this:
So doing tho find(C) was fairly expensive – it required us to visit three nodes, and re-link two. But now, doing finds on C, A, and G will be fast next time. In fact, this approach does a better job of keeping the trees shallow and bushy than rank, with no extra storage overhead. And the cost of both union and find remains an amortized O(lg n).