{"id":503,"date":"2007-09-06T11:40:25","date_gmt":"2007-09-06T11:40:25","guid":{"rendered":"http:\/\/scientopia.org\/blogs\/goodmath\/2007\/09\/06\/graph-searches-and-disjoint-sets-the-union-find-problem\/"},"modified":"2007-09-06T11:40:25","modified_gmt":"2007-09-06T11:40:25","slug":"graph-searches-and-disjoint-sets-the-union-find-problem","status":"publish","type":"post","link":"http:\/\/www.goodmath.org\/blog\/2007\/09\/06\/graph-searches-and-disjoint-sets-the-union-find-problem\/","title":{"rendered":"Graph Searches and Disjoint Sets: the Union-Find Problem"},"content":{"rendered":"<p> Suppose you&#8217;ve got a huge graph &#8211; millions of nodes. And you <em>know<\/em> that it&#8217;s not connected &#8211; so the graph actually consists of some number of pieces (called the connected components of the graph). And there are constantly new vertices and edges being added to the graph, but nothing is ever removed. Some questions you might want to ask about this graph at a particular point in time are:<\/p>\n<ul>\n<li> How many components are there in the graph?<\/li>\n<li> Which component is vertex X in?<\/li>\n<li> Are vertices X and Y in the same component?<\/li>\n<li> How many components are there?<\/li>\n<\/ul>\n<p> All of these questions are variants of a classic computer science problem, called<br \/>\nunion-find, which also comes up in an astonishing number of different contexts. The reason for the name is that in the representation of the solution, there<br \/>\nare two basic operations: union, and find. Basically, the division of the graph into<br \/>\ncomponents is also a partition of the vertices of the graph into disjoint sets: union find<br \/>\nis a problem which focuses on a particular kind of disjoint set problem, where you can modify<br \/>\nthe sets over time.<\/p>\n<p><!--more--><\/p>\n<p> The basic idea of the classic union-find solution is to create a list of sets, where each of the member sets contains a collection of the members of one of the sets (vertices of the connected subgraph). Each of the<br \/>\nsets has one member, randomly selected, and called the set&#8217;s <em>representative<\/em>: as long as a set<br \/>\nremains disjoint (no edges are added which connect it to another component), its representative never<br \/>\nchanges; when two sets are merged (an edge is added connecting two components), the representative<br \/>\nof the merged set will be the representative of one of the two sets that were merged. The <em>only<\/em> way to refer to a set is by using its representative.<\/p>\n<p> This gives us a data structure with 3 real operations. One is to add a set of objects (connected vertices) to it. Another is <em>union<\/em> which specifies two objects (vertices), and merges the sets containing those objects (adding an edge). The last is <em>find<\/em> which takes an object, and returns the representative of the set to which that object belongs.<\/p>\n<p> If you think of it as lists of lists, it&#8217;s pretty obvious how this all works. Adding a set of objects<br \/>\nis just linking that list into the set. Union is taking one set, and appending it to another. Find is searching the lists to figure out which set contains a particular element.<\/p>\n<pre>\ntype setMember = (value, child, parent)\ndef AddSet(sets, newSetRepresentative):\nsets.add(newSet)\ndef Union(sets, setOneMember, setTwoMember):\nsetOneRepresentative = sets.Find(setOneMember)\nsetTwoRepresentative = sets.Find(setTwoMember)\nsets.remove(setTwoRepresentative)\nsetOneRepresentative.append(setTwoRepresentative)\ndef Find(sets, v): if (v.parent == null): return v else: return Find(sets, v.parent)\n<\/pre>\n<p> In this version, the costs are also easy to figure out. Using a doubly linked list for the sets, we<br \/>\nget addition costs as O(1); find is O(N) (since you&#8217;re following the list of links from a node to its<br \/>\nparent until you get to the representative, which is the only node with no parent, it could take up to N<br \/>\nsteps, where N is the number or objects in the set of sets); and union is also O(N) (union requires two<br \/>\nFinds, each of which take O(N); and then finding the end of one of the lists, where the other set will be<br \/>\nlinked in, which is also O(N)). If you&#8217;ve got really big sets, and you&#8217;re going to do a lot of operations,<br \/>\nthen this is a really awful running time.<\/p>\n<p> How can we improve that? The easiest way is to take each set, and replace it by a tree. That&#8217;s called<br \/>\nthe <em>disjoint forest<\/em> solution. In a disjoint forest, a node can have more than one child, but each<br \/>\nnode has either zero (if its a representative) or one parent. In this representation, the implementation<br \/>\nof find is exactly the same: climb up the parent chain to find a representative, which takes amortized<br \/>\nO(lg n), provided the trees remain shallow and bushy. Union is again two finds (lg n) followed by<br \/>\nattaching one tree a a child of the other trees representative (O(1)), so it&#8217;s also O(lg n). <\/p>\n<p> Now, the question is, how can we say that the forest is really O(lg n)? That&#8217;s only true if the trees<br \/>\nare shallow and close to balanced. Unbalanced, deep trees could degenerate to exactly the same scenario as<br \/>\nthe list implementation. Fortunately, maintaining an approximate, bushy balance is easy: For each tree, we<br \/>\nmaintain something called <em>rank<\/em>. For a tree of one vertex, we assign rank=0. When we merge trees,<br \/>\nwe always merge into the tree with higher rank (leaving the rank unchanged), unless the trees being merged<br \/>\nhave equal rank=N. In that case, we pick one arbitrarily, and assign the result rank=N+1. That&#8217;s actually<br \/>\nenough to keep the amortized cost of all operations at O(lg N).<\/p>\n<pre>\ntype setMember = (value, child, parent)\ndef AddSet(sets, newSetRepresentative):\nsets.add(newSetRepresentative)\nnewSetRepresentative.rank = 1\ndef Union(sets, setOneMember, setTwoMember):\nsetOneRepresentative = sets.Find(setOneMember)\nsetTwoRepresentative = sets.Find(setTwoMember)\nif (setOneRepresentative.rank &gt; setTwoRepresentative.rank):\nhigherRankRepresentative = setOneRepresentative\nlowerRankRepresentative = setTwoRepresentative\nelse:\nhigherRankRepresentative = setTwoRepresentative\nlowerRankRepresentative = setTwoRepresentative\nhigherRankRepresentative.addChild(lowerRankRepresentative)\nsets.remove(lowerRankRepresentative)\nif (higherRankRepresentative.rank == lowerRankRepresentative.rank):\nhigherRankRepresentative.rank++\ndef Find(sets, v):\nif (v.parent == null):\nreturn v\nelse:\nreturn Find(sets, v.parent)\n<\/pre>\n<p> In fact, we can improve this even more in a simple way. Instead of maintaining rank, we can just<br \/>\nmunge the tree. Each time we do a union, we just randomly pick one of the two tree, and attach is<br \/>\nas a child of the representative of the other.  Each time we do a Find operation, we keep a list of each node visited on the path to<br \/>\nthe representative, and re-link all of them as children of the representative. This way, once is a<br \/>\nwhile, you&#8217;ll have to do the expensive operation of relinking a bunch of nodes after doing a climb<br \/>\nup the tree (with potential cost O(lg n)), but as a result, a bunch of Finds that formerly would have taken O(lg n) will now take O(1). This one is easier to see with an example than with pseudo code. Image<br \/>\nwe had a set of trees like this one, where the representatives are in red:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" alt=\"union-find-initial.png\" src=\"https:\/\/i0.wp.com\/scientopia.org\/img-archive\/goodmath\/img_258.png?resize=325%2C160\" width=\"325\" height=\"160\" \/><\/p>\n<p> With this set of trees, we can do a Union(A,J). The representative of the set containing A is O, and J is a representative. So we merge those two sets, and get the following:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" alt=\"union-find-union.png\" src=\"https:\/\/i0.wp.com\/scientopia.org\/img-archive\/goodmath\/img_259.png?resize=249%2C219\" width=\"249\" height=\"219\" \/><\/p>\n<p> Now, suppose we want to do a Find(C). What we&#8217;d do is trace the path from C to its representative: C, to A, to O, to J. The representative is J. But we also keep the list of what we visited before reaching J &#8211; (C, A, O) &#8211; and link them as direct children of J, giving us this:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" alt=\"union-find-find.png\" src=\"https:\/\/i0.wp.com\/scientopia.org\/img-archive\/goodmath\/img_260.png?resize=317%2C194\" width=\"317\" height=\"194\" \/><\/p>\n<p> So doing tho find(C) was fairly expensive &#8211; it required us to visit three nodes, and re-link two. But now, doing  finds on C, A, and G will be fast next time. In fact, this approach does a better job of keeping the trees shallow and bushy than rank, with no extra storage overhead. And the cost of both union and find remains an amortized O(lg n).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Suppose you&#8217;ve got a huge graph &#8211; millions of nodes. And you know that it&#8217;s not connected &#8211; so the graph actually consists of some number of pieces (called the connected components of the graph). And there are constantly new vertices and edges being added to the graph, but nothing is ever removed. Some questions [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[80,25,56],"tags":[],"class_list":["post-503","post","type-post","status-publish","format-standard","hentry","category-computational-complexity","category-graph-theory","category-set-theory"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4lzZS-87","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/503","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/comments?post=503"}],"version-history":[{"count":0,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/503\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/media?parent=503"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/categories?post=503"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/tags?post=503"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}