{"id":327,"date":"2007-02-28T15:30:26","date_gmt":"2007-02-28T15:30:26","guid":{"rendered":"http:\/\/scientopia.org\/blogs\/goodmath\/2007\/02\/28\/not-quite-basics-sorting-algorithms\/"},"modified":"2007-02-28T15:30:26","modified_gmt":"2007-02-28T15:30:26","slug":"not-quite-basics-sorting-algorithms","status":"publish","type":"post","link":"http:\/\/www.goodmath.org\/blog\/2007\/02\/28\/not-quite-basics-sorting-algorithms\/","title":{"rendered":"Not Quite Basics: Sorting Algorithms"},"content":{"rendered":"<p> Multiple people have written to me, after seeing yesterday&#8217;s <a href=\"http:\/\/scienceblogs.com\/goodmath\/2007\/02\/basics_algorithm_1.php\">algorithms<\/a> basics post, asking me to say more about sorting algorithms. I have to say that it&#8217;s not my favorite topic &#8211; sorting is one of those old bugaboos that you can&#8217;t avoid, but which gets really dull after a while. But there is a kernel of interest to it &#8211; sorting can be used to demonstrate a lot of interesting ideas about<br \/>\ncomputational complexity.<\/p>\n<p><!--more--><\/p>\n<p> In that vein, let&#8217;s start with a little bit of complexity. There&#8217;s actually a trick based on Kolmogorov\/Chaitin algorithmic information theory that can tell us about about the <em>minimum<\/em> number of comparisons necessary to sort a list of values. Knowing that, we can understand how good or bad a particular algorithm is.<\/p>\n<p> Consider <em>all possible<\/em> lists of N distinct values. How many possibilities are there? N!. How many lists containing exactly those N values are in sorted order? 1. So &#8211; what we&#8217;re doing by sorting is basically equivalent to looking for a <em>permutation function<\/em> that will re-order the members of the list into sorted order. Since there are N! different possible lists, that means that for a particular unsorted list, we need to select the correct one out of N! permutation functions to get the sorted list. The way we do that is by performing comparisons within the list to identify which ordering of the list values we have, and therefore which of the permutation functions is the right one.<\/p>\n<p> So &#8211; what&#8217;s the <em>minimum<\/em> amount of information we need to generate using comparisons to identify which permutation function will be the correct one? Well, the simplest way of identifying permutation functions is by assigning a unique number to each one. So what we need to do in order to sort is <em>at least<\/em> generate the index of the correct permutation function. There are N! permutations; that means that we need <em>log<\/em>(N!) bits to identify the function. log(N!) is proportional to n&times;log(N). So at a minimum, we need to generate N&times;log(N) bits of information. A comparison instruction generates one bit of information &#8211; so we need to do <em>at least<\/em> <b>O(N&times;log(N))<\/b> comparisons.<\/p>\n<p> I don&#8217;t know about you, but when I first saw that proof, I thought it was incredibly nifty.<\/p>\n<p> So &#8211; on to some interesting sort functions. For each, we&#8217;ll assume that we&#8217;re given a list L<sub>1<\/sub>&#8230;L<sub>N<\/sub> numbers, and that we want to return a result list R<sub>i<\/sub>&#8230;R<sub>N<\/sub> which contains the list in sorted order.<\/p>\n<h3>Insertion Sort<\/h3>\n<p> Probably the most intuitive sorting algorithm is called <em>insertion sort<\/em>, and if you&#8217;ve ever done any kind of filing for an office, you&#8217;ve probably used it.<\/p>\n<p> For simplicity, assume that &lt; will report true if its right parameter is undefined; this will allow me to present the algorithm without cluttering it with boundary-checking.<\/p>\n<pre>\nStart with an empty result list: R&larr;[]\nfor each L<sub>i<\/sub> &isin; L:\nsearch R for the first position j where L<sub>i<\/sub>&lt;R<sub>j<\/sub>\ninsert L<sub>i<\/sub> into R immediately following R<sub>j<\/sub>.\n<\/pre>\n<p> So let&#8217;s look at sorting a simple list: [3,5,2,8,4,1,9,12]:<\/p>\n<ol>\n<li> Initial result = [].<\/li>\n<li> Insert 3: [3]<\/li>\n<li> Insert 5: [3,5]<\/li>\n<li> Insert 2: [2,3,5]<\/li>\n<li> Insert 8: [2,3,5,8]<\/li>\n<li> Insert 4: [2,3,4,5,8]<\/li>\n<li> Insert 1: [1,2,3,4,5,8]<\/li>\n<li> Insert 9: [1,2,3,4,5,8,9]<\/li>\n<li> Insert 12: [1,2,3,4,5,8,9,12]<\/li>\n<\/ol>\n<p> What&#8217;s the complexity of this? Well, it depends on two things: how is the list stored; and how will we find the right position for inserting another value?<\/p>\n<p> Suppose we stored the numbers as a contiguous array in memory. Then to <em>find<\/em> the insert position, we can actually use binary search &#8211; so it will take log (n) steps to find the insert position. So to insert N values, we&#8217;re looking at N log(N) comparisons &#8211; great, right?<\/p>\n<p> Not so fast. Each time we insert a value, we need to <em>copy<\/em> all of the values greater than it. So while we&#8217;ll do no more than N log(N) compares, we might end up doing <b>O<\/b>(n<sup>2<\/sup>) copies.<\/p>\n<p> What if we stored the numbers in a linked list? In a linked list, we can&#8217;t use the binary search trick to find the insert position. So we&#8217;re looking at <b>O<\/b>(n<sup>2<\/sup>) comparisons, and N copies.<\/p>\n<p> So insertion sort isn&#8217;t a great algorithm. In fact, it&#8217;s not even really better than bubble sort in terms of complexity.<\/p>\n<h3>Heap Sort<\/h3>\n<p> Heap sort is sort-of a variation on insertion sort &#8211; only instead of doing the insert into a flat list, we&#8217;re inserting into a binary search tree which we&#8217;ll call a heap. So, if we use something like <a href=\"http:\/\/scientopia.org\/blogs\/goodmath\/2007\/01\/balanced-binary-trees-in-haskell\">a red-black tree<\/a> which makes sure that the tree is balanced, and doesn&#8217;t<br \/>\ndegenerate into a list, then we&#8217;re looking at <b>O<\/b>(N log(N)) comparisons, and <b>O<\/b>(N) copies for a linked tree implementation. <\/p>\n<p><em>(For the more advanced folks out there, I&#8217;m lying a little bit here. There are a bunch of different data structures that implement a heap &#8211; a heap is really a structure where you can add values, and when you remove a value from it, it always returns the smallest value in the heap.)<\/em><\/p>\n<p> So looking at the previous example, and assuming we&#8217;re using a perfect balancing method, after the inserts,  we&#8217;d end up with a heap that looked like: [5,[2,[1],[3,[],[4]]],[9,[8],[12]]. To retrieve from it, we do an in-order traversal of the tree: [1,2,3,4,5,8,9,12].<\/p>\n<p> The main problem with heap sort is that implementing balanced trees is a pain.<\/p>\n<h3>QuickSort<\/h3>\n<p> Quicksort is a nifty algorithm, although it&#8217;s really rather overrated. It&#8217;s got <em>worst case<\/em> performance <b>O<\/b>(N<sup>2<\/sup>); but average <b>O<\/b>(N log(N)). Too many people forget about that &#8220;worst case&#8221; part, and consider quicksort to be the be-all and end-all of sorting algorithms.<\/p>\n<p> It&#8217;s a kind of recursive algorithm known as a <em>divide and conquer<\/em> algorithm, which means that it works by dividing its input into parts, getting the solutions for the parts, and then combining them.<\/p>\n<pre>\nPick a random element L<sub>i<\/sub>, and call it the <em>pivot<\/em>.\nSplit the list into two sub-lists:\nLT={L<sub>i<\/sub> | L<sub>i<\/sub> &lt; pivot} and\nGT={L<sub>i<\/sub> | L<sub>i<\/sub> &ge; pivot}\nLet SortedLT = Sort(LT)\nLet SortedGT = Sort(GT)\nConcatenate SortedLT, pivot, and SortedGT.\n<\/pre>\n<p> Let&#8217;s look at our example again: L=[3,5,2,8,4,1,9,12]:<\/p>\n<ol>\n<li> Let pivot=8: LT=[3,5,2,4,1], GT=[9,12].<\/li>\n<li> Now recurse:\n<ol>\n<li> Sort LT: pivot=2; LT=[1], GT=[3,5,4]&#8230; Continue until you have<br \/>\nSortedLT=[1,2,3,4,5].<\/li>\n<li> Sort GT: SortedGT=[9,12].\n<\/ol>\n<\/li>\n<li> Concatenate: result=[1,2,3,4,5] + [8] + [9,12] = [1,2,3,4,5,8,9,12].<\/li>\n<\/ol>\n<p> If the pivot you select each time is close to the median value of the list, then this will run in <b>O<\/b>(n lg n) time. Unfortunately, you can&#8217;t always easily find the median. In general, either the list is in random order (in which case, picking the first element of the list as the pivot is as good as any other); or the list is in almost sorted order (in which case, picking the element at the midpoint of the list will usually work well.) Unfortunately, it&#8217;s possible to always end up picking the smallest value it the list as the pivot &#8211; in which case, you&#8217;re going to see <b>O<\/b>(n<sup>2<\/sup>) performance. <\/p>\n<p> The <em>nice<\/em> thing about quicksort is that on average, you can get good performance, <em>and<\/em> you can do it on a contiguous array <em>in-place<\/em>, meaning that you don&#8217;t need to allocate any extra memory to store copies of parts of the list. Combine that with average <b>O<\/b>(n lg n) performance, and it&#8217;s a pretty attractive algorithm: easy to implement, space efficient, and usually fast.<\/p>\n<h3>Merge Sort<\/h3>\n<p> My favorite sort algorithm. This one works best on linked lists &#8211; otherwise, it blows a bunch of memory in copying. Like quicksort, it&#8217;s a divide and conquer algorithm; but instead of doing the comparisons ahead of time like in quicksort, it does the comparisons at merge time.<\/p>\n<pre>\nSplit L into to sublists, A and B. <em>Note - no comparisons. Just break the list in\nhalf.<\/em>\nSort A into SortedA and B into SortedB.\nwhile both SortedA and SortedB are non-empty:\nif SortedA<sub>1<\/sub> &lt; SortedB<sub>1<\/sub> then:\nremove SortedA<sub>1<\/sub> from SortedA and append it to the result.\nelse:\nremove SortedB<sub>1<\/sub> from SortedB and append it to the result.\nappend whichever list is non-empty to the result.\n<\/pre>\n<p> An example really helps to see how this works. Again, using [3,5,2,8,4,1,9,12]:<\/p>\n<ol>\n<li> Result=[]<\/li>\n<li> A=[3,5,2,8], B=[4,1,9,12]<\/li>\n<li> Recurse getting SortedA=[2,3,5,8], and SortedB=[1,4,9,12]<\/li>\n<li> SortedA[1]=2, SortedB[1]=1. So remove B[1] and append it to Result. Result=[1], SortedA=[2,3,5,8], SortedB=[4,9,12]<\/li>\n<li> SortedA[1]=2, SortedB[1]=4. So remove A[1] and append it to Result. Result=[1,2],  SortedA=[3,5,8], SortedB=[4,9,12]<\/li>\n<li> SortedA[1]=3, SortedB[1]=4. So remove A[1] and append it to Result. Result=[1,2,3], SortedA=[5,8], SortedB=[4,9,12]<\/li>\n<li> SortedA[1]=5, SortedB[1]=4. So remove B[1] and append it to Result. Result=[1,2,3,4],  SortedA=[5,8], SortedB=[9,12]<\/li>\n<li> SortedA[1]=5, SortedB[1]=9. So remove A[1] and append it to Result. Result=[1,2,3,4,5], SortedA=[8], SortedB=[9,12]<\/li>\n<li> SortedA[1]=8, SortedB[1]=9. So remove A[1] and append it to Result. Result=[1,2,3,4,5,8],  SortedA=[], SortedB=[9,12]<\/li>\n<li> SortedA is empty, so append SortedB to the result: Result=[1,2,3,4,5,8,9,12].<\/li>\n<\/ol>\n<p> Mergesort is always <b>O<\/b>(n lg n) comparisons. But it can also wind up using <b>O<\/b>(n lg n) extra space for copies, compared to no space for in-place quicksort. (Both use some stack space, but quicksort doesn&#8217;t use data space; and in the average case, the stack space is both dwarfed by the data, and very similar between the two algorithms.)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Multiple people have written to me, after seeing yesterday&#8217;s algorithms basics post, asking me to say more about sorting algorithms. I have to say that it&#8217;s not my favorite topic &#8211; sorting is one of those old bugaboos that you can&#8217;t avoid, but which gets really dull after a while. But there is a kernel [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[74,80,30],"tags":[],"class_list":["post-327","post","type-post","status-publish","format-standard","hentry","category-basics","category-computational-complexity","category-information-theory"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4lzZS-5h","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/327","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/comments?post=327"}],"version-history":[{"count":0,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/327\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/media?parent=327"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/categories?post=327"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/tags?post=327"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}