Not Quite Basics: Sorting Algorithms

Multiple people have written to me, after seeing yesterday’s algorithms basics post, asking me to say more about sorting algorithms. I have to say that it’s not my favorite topic – sorting is one of those old bugaboos that you can’t avoid, but which gets really dull after a while. But there is a kernel of interest to it – sorting can be used to demonstrate a lot of interesting ideas about
computational complexity.

In that vein, let’s start with a little bit of complexity. There’s actually a trick based on Kolmogorov/Chaitin algorithmic information theory that can tell us about about the minimum number of comparisons necessary to sort a list of values. Knowing that, we can understand how good or bad a particular algorithm is.

Consider all possible lists of N distinct values. How many possibilities are there? N!. How many lists containing exactly those N values are in sorted order? 1. So – what we’re doing by sorting is basically equivalent to looking for a permutation function that will re-order the members of the list into sorted order. Since there are N! different possible lists, that means that for a particular unsorted list, we need to select the correct one out of N! permutation functions to get the sorted list. The way we do that is by performing comparisons within the list to identify which ordering of the list values we have, and therefore which of the permutation functions is the right one.

So – what’s the minimum amount of information we need to generate using comparisons to identify which permutation function will be the correct one? Well, the simplest way of identifying permutation functions is by assigning a unique number to each one. So what we need to do in order to sort is at least generate the index of the correct permutation function. There are N! permutations; that means that we need log(N!) bits to identify the function. log(N!) is proportional to n×log(N). So at a minimum, we need to generate N×log(N) bits of information. A comparison instruction generates one bit of information – so we need to do at least O(N×log(N)) comparisons.

I don’t know about you, but when I first saw that proof, I thought it was incredibly nifty.

So – on to some interesting sort functions. For each, we’ll assume that we’re given a list L1…LN numbers, and that we want to return a result list Ri…RN which contains the list in sorted order.

Insertion Sort

Probably the most intuitive sorting algorithm is called insertion sort, and if you’ve ever done any kind of filing for an office, you’ve probably used it.

For simplicity, assume that < will report true if its right parameter is undefined; this will allow me to present the algorithm without cluttering it with boundary-checking.

Start with an empty result list: R←[]
for each Li ∈ L:
search R for the first position j where Li<Rj
insert Li into R immediately following Rj.

So let’s look at sorting a simple list: [3,5,2,8,4,1,9,12]:

  1. Initial result = [].
  2. Insert 3: [3]
  3. Insert 5: [3,5]
  4. Insert 2: [2,3,5]
  5. Insert 8: [2,3,5,8]
  6. Insert 4: [2,3,4,5,8]
  7. Insert 1: [1,2,3,4,5,8]
  8. Insert 9: [1,2,3,4,5,8,9]
  9. Insert 12: [1,2,3,4,5,8,9,12]

What’s the complexity of this? Well, it depends on two things: how is the list stored; and how will we find the right position for inserting another value?

Suppose we stored the numbers as a contiguous array in memory. Then to find the insert position, we can actually use binary search – so it will take log (n) steps to find the insert position. So to insert N values, we’re looking at N log(N) comparisons – great, right?

Not so fast. Each time we insert a value, we need to copy all of the values greater than it. So while we’ll do no more than N log(N) compares, we might end up doing O(n2) copies.

What if we stored the numbers in a linked list? In a linked list, we can’t use the binary search trick to find the insert position. So we’re looking at O(n2) comparisons, and N copies.

So insertion sort isn’t a great algorithm. In fact, it’s not even really better than bubble sort in terms of complexity.

Heap Sort

Heap sort is sort-of a variation on insertion sort – only instead of doing the insert into a flat list, we’re inserting into a binary search tree which we’ll call a heap. So, if we use something like a red-black tree which makes sure that the tree is balanced, and doesn’t
degenerate into a list, then we’re looking at O(N log(N)) comparisons, and O(N) copies for a linked tree implementation.

(For the more advanced folks out there, I’m lying a little bit here. There are a bunch of different data structures that implement a heap – a heap is really a structure where you can add values, and when you remove a value from it, it always returns the smallest value in the heap.)

So looking at the previous example, and assuming we’re using a perfect balancing method, after the inserts, we’d end up with a heap that looked like: [5,[2,[1],[3,[],[4]]],[9,[8],[12]]. To retrieve from it, we do an in-order traversal of the tree: [1,2,3,4,5,8,9,12].

The main problem with heap sort is that implementing balanced trees is a pain.


Quicksort is a nifty algorithm, although it’s really rather overrated. It’s got worst case performance O(N2); but average O(N log(N)). Too many people forget about that “worst case” part, and consider quicksort to be the be-all and end-all of sorting algorithms.

It’s a kind of recursive algorithm known as a divide and conquer algorithm, which means that it works by dividing its input into parts, getting the solutions for the parts, and then combining them.

Pick a random element Li, and call it the pivot.
Split the list into two sub-lists:
LT={Li | Li < pivot} and
GT={Li | Li ≥ pivot}
Let SortedLT = Sort(LT)
Let SortedGT = Sort(GT)
Concatenate SortedLT, pivot, and SortedGT.

Let’s look at our example again: L=[3,5,2,8,4,1,9,12]:

  1. Let pivot=8: LT=[3,5,2,4,1], GT=[9,12].
  2. Now recurse:
    1. Sort LT: pivot=2; LT=[1], GT=[3,5,4]… Continue until you have
    2. Sort GT: SortedGT=[9,12].
  3. Concatenate: result=[1,2,3,4,5] + [8] + [9,12] = [1,2,3,4,5,8,9,12].

If the pivot you select each time is close to the median value of the list, then this will run in O(n lg n) time. Unfortunately, you can’t always easily find the median. In general, either the list is in random order (in which case, picking the first element of the list as the pivot is as good as any other); or the list is in almost sorted order (in which case, picking the element at the midpoint of the list will usually work well.) Unfortunately, it’s possible to always end up picking the smallest value it the list as the pivot – in which case, you’re going to see O(n2) performance.

The nice thing about quicksort is that on average, you can get good performance, and you can do it on a contiguous array in-place, meaning that you don’t need to allocate any extra memory to store copies of parts of the list. Combine that with average O(n lg n) performance, and it’s a pretty attractive algorithm: easy to implement, space efficient, and usually fast.

Merge Sort

My favorite sort algorithm. This one works best on linked lists – otherwise, it blows a bunch of memory in copying. Like quicksort, it’s a divide and conquer algorithm; but instead of doing the comparisons ahead of time like in quicksort, it does the comparisons at merge time.

Split L into to sublists, A and B. Note - no comparisons. Just break the list in
Sort A into SortedA and B into SortedB.
while both SortedA and SortedB are non-empty:
if SortedA1 < SortedB1 then:
remove SortedA1 from SortedA and append it to the result.
remove SortedB1 from SortedB and append it to the result.
append whichever list is non-empty to the result.

An example really helps to see how this works. Again, using [3,5,2,8,4,1,9,12]:

  1. Result=[]
  2. A=[3,5,2,8], B=[4,1,9,12]
  3. Recurse getting SortedA=[2,3,5,8], and SortedB=[1,4,9,12]
  4. SortedA[1]=2, SortedB[1]=1. So remove B[1] and append it to Result. Result=[1], SortedA=[2,3,5,8], SortedB=[4,9,12]
  5. SortedA[1]=2, SortedB[1]=4. So remove A[1] and append it to Result. Result=[1,2], SortedA=[3,5,8], SortedB=[4,9,12]
  6. SortedA[1]=3, SortedB[1]=4. So remove A[1] and append it to Result. Result=[1,2,3], SortedA=[5,8], SortedB=[4,9,12]
  7. SortedA[1]=5, SortedB[1]=4. So remove B[1] and append it to Result. Result=[1,2,3,4], SortedA=[5,8], SortedB=[9,12]
  8. SortedA[1]=5, SortedB[1]=9. So remove A[1] and append it to Result. Result=[1,2,3,4,5], SortedA=[8], SortedB=[9,12]
  9. SortedA[1]=8, SortedB[1]=9. So remove A[1] and append it to Result. Result=[1,2,3,4,5,8], SortedA=[], SortedB=[9,12]
  10. SortedA is empty, so append SortedB to the result: Result=[1,2,3,4,5,8,9,12].

Mergesort is always O(n lg n) comparisons. But it can also wind up using O(n lg n) extra space for copies, compared to no space for in-place quicksort. (Both use some stack space, but quicksort doesn’t use data space; and in the average case, the stack space is both dwarfed by the data, and very similar between the two algorithms.)

0 thoughts on “Not Quite Basics: Sorting Algorithms

  1. Ørjan Johansen

    The main problem with heap sort is that implementing balanced trees is a pain.

    It is easier to implement a binary heap, one of the other heap structures you allude to. It can even be done in-place, like quicksort. But of course it cannot beat the latter in simplicity.

  2. Flaky

    To nitpick: Technically Quicksort requires O(log n) storage for the stack, the worst case in terms of storage, although in practice the number of extra bytes needed is usually insignificant and can be statically allocated (per each instance of the algorithm).

  3. matt

    I’m not sure I can even conceive of a geekier topic of discussion than “my favourite sort algorithm.” High praise 🙂
    My personal faves are ones I’ve never had occasion to implement, but they have a kind of charm: radix sort — O(n) performance where applicable, though perhaps misleadingly; and bogosort — the proverbial worst possible sort algorithm, in which you randomly shuffle your list and then check whether it happens to be in order — a procedure that seems to have a lot to say about life…

  4. Tommy

    It is probably obvious… but could you please explain?!
    Given “There are N! permutations” (OK.)
    Then, why does the following follow???
    “that means that we need log(N!) bits”

  5. Nick Johnson

    How does Radix sort figure into this theory of sorting complexity? Radix sort is O(n), albeit with a large constant multiplier that makes it barely faster (and often slower) than O(n log n) sorts.

  6. Josiah Carlson

    My personal favorite sorting algorithm is heapsort. Why? Most libraries come with a sane heap implementation (using an array addressed like a binary heap), and implementing sorting from it is fairly simple. Even modifying the library-supplied heap to do in-place sorting (your output array is your input array) is generally quite easy. You also don’t need to rely on a balanced tree (red-black, avl, 2-3, etc.) implementation – only an array that is indexable in O(1) time.
    While mergesort is O(1) extra space for lists, and can be made to perform in O(1) extra space for sorting arrays (, the algorithms for doing so are nontrivial. I tend to stay away from mergesort (except for the variant used as Python’s list.sort()), if only because I usually have to be concerned about allocating temporary space.
    The trick with making quicksort O(1) space, including recursion stack space, is when breaking up the sequence into two halves, always recurse on the smaller one first. Alternatively, if you always wanted to pick the median, there is a deterministic selection algorithm that is able to choose any ith item in a sequence if it were sorted (median is fine) in O(n) time, though it has a constant of around 7 or so (which would make a deterministic quicksort around 8nlogn, as opposed to heapsort which is in the worst-case 2nlogn). The typical suggestion is to choose 3 elements randomly and select the median of those three (also known as the median of 3 method).

  7. Nick Johnson

    Clarification to my above post: Radix is actually O(mn), where m is the number of digits (eg, bits) in the value being sorted, but if you hold that constant (as you would when, for example, sorting a list of 32bit integers), you get O(n).

  8. Josiah Carlson

    Re: Radix Sort
    Because Radix Sort is not comparison based (it is address based), it is typically not analyzed in the same way. Generally speaking, you talk about radix sort as taking k passes over n elements, determining relative ordering of the next most significant log(n)/k bits in each pass, using O(n**(1/k)) extra space.

  9. Mark C. Chu-Carroll

    Radix sort is cheating 🙂
    It claims to be O(N)… And it is, sort of, if you cheat.
    The trick to radix sort is that it claims to take a fixed number of iterations – and so as a constant factor, that can be eliminated.
    But the number of iterations is dependent on the numeric range – each time you add another digit to the numeric range, radix sort needs another iteration. So in general, to sort N D-digit numbers, it’s O(N*D). But what’s D? Well, in the worst case (a non-sparse list of numbers), D is log(N). So really, radix sort is still O(n log n), it’s just that it hides the “log n” factor in the numeric representation, and pretends that it’s constant.

  10. Mark C. Chu-Carroll

    In general, to represent a number N in binary, it takes ceiling(log2(N)) bits. So, for example, to represent the number 12 in binary, it takes 4 bits; the log2(12) is roughly 3.6.
    So if we want to know how many bits we need to represent N!, it’s roughly log2(N!).
    Does that help?

  11. Nick Johnson

    Mark: Not exactly – there’s nothing prohibiting duplicates. You could have a very large N but a very small D.

  12. Suresh

    Hi Mark,
    just another nitpick. You don’t need to invoke Kolmogorov complexity to argue the bound really. In fact, the cleanest argument is just that the height of the decision tree is log(N!) because it has N! leaves and is a binary tree. Arguing from information-theoretic considerations forces you to discuss whether a comparison is really deciding one bit of the answer or not, yadda yadda yadda. not that it’s wrong, but that it seems overkill.

  13. Mark C. Chu-Carroll

    Sure it’s overkill, but it’s fun overkill. Like I said, sort algorithms aren’t really my favorite subject; throwing in a bit of K-C complexity made it more fun for me to write :-).

  14. _Arthur

    Badly implemented, QuickSort can take O(n) space on the stack.
    It suffice to always sort the smallest sub-partition first, and the stack space goes down to a manageable O(log n).
    One old trick is to, when partitions are small, say 20 elements, to sort such small subpartitions using Insertion Sort instead of recursing on QuickSort. InsertionSort really shine on such small arrays.
    One even better trick is, when partitions are 20 elements or smaller, to NOT sort them, to leave them unsorted. You call InsertionSort on the whole array once the (partial) QuickSort is done. That way, instead of calling the InsertionSort subroutine hundreds of time, you call it once. No element will be moved more than 20 places (none will cross a pivot). The cost for InsertionSort to skip over pivots is negligible/insignificant.

  15. _Arthur

    When you only have medium-sized arrays to sort, say, 10,000 elements, ShellSort is the perfect sort. It is about as easy as InsertionSort to program, when theres dozen of ways to make serious errors of programming when programming QuickSort.
    To ShellSort you sort elements 1, 11, 21, 31 … with InsertionSort, same with elements, 2, 12, 22, 32 … You sort sub-arrays separated by by some increment. You repeat with a smaller increment, and the final sort is an InsertionSort with a 1 increment.
    The “increment” series must not be multiples. The one I use is 40, 13, 4, 1 ; you construct in by doing inc=3*inc+1 while inc

  16. Pseudonym

    Mark: If you think that radix sort is cheating, then your “proof” that you need O(N log N) time for a comparison-based sort is also cheating, because it assumes that comparisons take constant time. It’s O(N log N) comparisons all right, but not O(N log N) time.
    MSD recursive radix sort on “strings” (arbitrary-length sequences of fixed-size numbers, where comparing two fixed-size numbers is constant time) has asymptotic worst-case complexity of O(S), where S is the size of the key data measured in the fixed-size number unit (e.g. bytes).
    Comparison-based sort on the same data would take O(K N log N) where K is the average number of base comparisons required per real comparison. On non-pathological data, that works out to O(N log N log log N) time.

  17. Xanthir, FCD

    Just so this doesn’t get lost in the comments of the Basics:Algorithm post, I recommend everyone check out this site on various sorting algorithms:
    It’s a very cool little site, with imbedded java animations showing each sort in operation. It’s a bit difficult to tell exactly what’s going on with stuff like quicksort or shellsort, but for most of them you get a really intuitive idea of what’s happening. It also lets you see the source of each sort, for the curious.

  18. bigTom

    With a few modifications quick-sort can be made pretty efficient. One can pick a random entry for the median -then the chance of getting anywhere near worst can can be vanishing small. Also if N is large pick several random values, and use the median of them oas the “median” and on average it will require fewer passes. With a smallish amount of additional storage you can make a version that vectorizes. And memory access per pass is to a couple of stride-1 data streams, i.e. modern
    computer architectures whose memory access is very-far from random can run at high efficiency.
    My other favorite is single pass radix, which is usefull if you have integer data where N is much larger thanthe range of values. This is important in symbolic factorization (large sparse linear algebra). I believe it is called distribution counting sort (but I might be wrong). Unfortuantely it requires scratch storage of size N!

  19. _Arthur

    BigTom, I never used a random pivot, because most random() function calls cost more than a hundred integer comparisons.
    I always use median-of-3, the median value of the first, last and middle element of the partition, as pivot. It cost 5 comparisons to get the median. It protects you from 0-sized supbartitions. The degenerate case would require that the first, last and middle element all be the 3 largest or the 3 lowest of a partition, a rare case with real-world data.
    With real world data, you often get data that is partialy ordered, or clustered, or, for some reason, sorted but in reverse order. Your QuickSort must be able to handle those cases, more than a “random” degenerate case.
    The most common degenerate case for QuickSort is when you try to sort all equal values, say an array of 1 million zeroes. Some QuickSorts revert to O(N**2) with that case, others are speeded up. Make sure your implementation handle that case well.
    Most only check their QuickSort with randomized input: that’s a mistake.
    Also test your QuickSort with reverse-ordered data, another degenerate case.

  20. Andrew Wade

    Merge Sort
    My favorite sort algorithm. This one works best on linked lists – otherwise, it blows a bunch of memory in copying.

    I recall reading a description of how to do a merge sort with data stored on magnetic tape (that wouldn’t all fit into main memory). Not too many alternatives in that situation! One other advantage is that it can be a stable sort: the sort doesn’t change the order of equal entries. On occasion this property turns out to be useful.

  21. _Arthur

    Andrew, modern multi-way Merges would expect to have the data on Hard Drives, and would try to maximize the use of RAM. For that kind of sort, it becomes important to avoid triggering virtual memory — if possible –, not an easy task with current OSes.
    The way I see it, you would try to determine how much *REAL* memory your program can count on, say 128Mb on 1GB RAM, then read the first 128 Mb of the 20 Gb of data to be sorted, sort it in-memory, then withe the 128Mb block; repeat on the next 128Mb block of data, white in back in a temporary file. You track of those 128 Mb block as a tree (yes, that consumes memory space too).
    In the next passes, you’d load the first 32Mb of the 4 128Mb files that begins with the lowest values (your tree keeps track of those), and merge it in a 4-ways memory merge, writing blocks of output and reading blocks of new input.
    The newly created temporary output files will be at least 4*128Mb sorted values big.
    Another approach would be to use the OS to map the merge files into virtual address space. As long you only access the files sequentially, the virtual memory overhead should be identical to a standard disk read. You must be extremely familiar with your OS for that one to work.

  22. Anonymous

    There are N! permutations; that means that we need log(N!) bits to identify the function.
    In general, to represent a number N in binary, it takes ceiling(log2(N)) bits.

    So now I know why log(N) is popping up so often in the metric. (Or perhaps relearn, because I have a nagging suspicion I have seen this before. If so, I hope it will stick this time!)
    I was going to ask about log/log2, since N bits would presumably adress 2^N choices. So it is the Ordo in O(N×log(N)) that permits us to change log2 to log, I gather.
    Yes, this is nifty, but I’m also glad to have it sorted out!
    Btw, seeing the detailed concerns in diverse sortings, I’m reminded of the earlier remark about compilers being better at optimizing code than humans. It is a pity data is so contingent. Perhaps I should wish for a software that ingests some chosen bits and typical sizes of a sorting problem and spits out an optimized algorithm for the current machine…

  23. brh

    Heap sort is sort-of a variation on insertion sort – only instead of doing the insert into a flat list, we’re inserting into a binary search tree which we’ll call a heap.

    A heap is a binary tree not a binary search tree.

  24. Harald Korneliussen

    Josiah wrote: “My personal favorite sorting algorithm is heapsort. Why? Most libraries come with a sane heap implementation (using an array addressed like a binary heap), and implementing sorting from it is fairly simple.”
    well, most libraries come with a sane sorting algorithm, as well 😉

  25. Cherez

    I always found Selection Sort to be the most straightforward sorting algorithm, it might deserve its place here. I also have always seen Heap Sort to be an optimal implementation of Selection Sort (to be honest, I don’t see how it’s like Insertion Sort in any implementation I’ve read.)
    Anyway, I’ve found that on normal (random) data, a normal Quicksort outperforms everything else. An out of place Merge Sort will run in 2-3 times as long, and Heap Sort runs in about 2-3 times as long as Merge.
    All this based on linear arrays, I’ve not done much sorting of linked lists.

  26. Sharon Curtis

    Sorry to be pedantic, but the “at least O(N×log(N))” doesn’t actually make any sense. The “O” symbol is actually an upper bound. So if you say (e.g.) that an algorithm is O(N) then you are saying it’s no worse than linear. A constant-time algorithm would also be O(N), although you wouldn’t usually say that because it’s confusing.
    The symbol for a lower bound is Ω (capital omega), so when you say something like “sorting takes at least order N log N comparisons”, the way to write that is “sorting takes Ω(N×log(N)) comparisons”.
    If you have a bound that is both a lower and upper bound, then you can use the symbol Θ (capital theta). So you could say “heapsort takes order N log N comparisons” and write that down as “heapsort takes Θ(N×log(N)) comparisons”.
    For most purposes, using the upper bound symbol O works just fine because giving an upper bound is usually sufficient. But when you say something “at least O(…)” you’re saying something is at least something bounded above by … and that doesn’t convey any information at all. Like saying “x is at least some number smaller than y” gives you no information about what x might be.

  27. Luna_the_cat

    Merge sort:
    Ok, so how would you do that in Perl? Given an array of variable length, dependent on the previous calculation….

  28. Andrew Wade

    In the next passes, you’d load the first 32Mb of the 4 128Mb files that begins with the lowest values (your tree keeps track of those), and merge it in a 4-ways memory merge, writing blocks of output and reading blocks of new input.
    The newly created temporary output files will be at least 4*128Mb sorted values big.

    I hadn’t thought of greater than 2-way merges, but of course that makes perfect sense in this situation. The limitation to scaling this up to a merge of all lists at once would be that you need to read the source lists/files in large enough chunks to get good performance from the hard drives, so you’re limited by available memory.

    Another approach would be to use the OS to map the merge files into virtual address space. As long you only access the files sequentially, the virtual memory overhead should be identical to a standard disk read. You must be extremely familiar with your OS for that one to work.

    You’d want to be extremely familiar with your OS anyway; you want to avoid duplicating data between the OS filesystem cache and the application buffers, and you want the OS to drop data from the OS filesystem cache when you’re done with it. (Closing a file when you’re done with it will not be enough.) Fortunately almost all OSes have good POSIX support nowadays¹, so you could either use mmap() in combination with madvise(…, MADV_SEQUENTIAL) (my preference), or open(…, O_DIRECT) in combination with read() and write() (if you can do a better job of scheduling IO than the OS).
    ¹ Linux and Unixes, Windows NT and derivatives (2000, XP, …), OS X. Probably some others as well.

  29. KeithB

    Sedgewick: Algorithms in C (vol 1) discusses all these and has some *excellent* diagrams showing how each sort works, and why they fail on pathological data.

  30. bigTom

    Arthur you are right, you have to do some special things to avoid the pathological cases. In fact in the worst case of only one unique data value, a naive implementation can give O(infinity) (if all ties go the same way). I never do pure quick-sort, once a sub-problem is small enough (the value can be tuned for best performance), I just leave it in place, a bubble sort can be used at the end to cleanup the unfinished work.

  31. _Arthur

    Tom, from the top of my head, it takes much less time to merge 4 blocks in memory, than to READ one from the disk. So your sorting application must schedule and start asynchronously the disk reads of the merge files well in advance.
    It is likely that the application would spend most of its time blocked, waiting for IO.
    Increasing the merge order might help in that context, ask the OS to read, say, 16 different mergefiles, so to have 16 read operations ongoing at any time. That might keep both the CPU and the IO busy all the time. Still need a strategy to minimize the number of passes.

  32. Steve

    I think perhaps you brushed off Radix Sort a bit too quickly.
    If we let w = max(n) and d =1 + floor(log_k(w)), then radix sort will have d passes on n inputs and worst case running time of O(n log_k n). This is just as you said. But this is why Radix is not considered to be a _general_ purpose sort.
    When the problem’s conditions are such at w is constant then d is constant, and radix sort will run in O(n) time. There are many cases where w *is* constant which makes radix sort attractive because we attain asymptotically linear time sorting in those cases. This is not a “cheat”.
    You have to know when Radix Sort (or Counting Sort or Bucket Sort) is the right solution to your problem…which brings out a host of other issues with asymptotic analysis, for example:
    1) The hypothetical RAM machine upon which it is based may differ significantly from the actual machine/language upon/in which algorithms are implemented (comparisons may not be constant time, etc).
    2) Asymptotic analysis says that O(n log n) is better than O(n^2) for some n=n_0. If you don’t reach n_0 in your problem domain, it may actually be better/faster to implement, maintain and run the O(n^2) algorithm.

  33. _Arthur

    Another important characteristic of QuickSort is ith high locality. The Partition() subalgorithm scans the sub array by accessing values sequentially in 2 sequences ( in opposite directions or in the same direction, depending on your partition() flavor).
    That has tremendous implication both in L2 caching implications, and in virtual memory efficency.
    Modern CPU detect sequential memory accesses, and speculatively prefetch cache lines from main memory to the L2 cache, a 100-clocks operation nowadays. So the partition() algorith will not stall witing for memory accesses, because the values at vec[i] and vec[k] will already have been plucked from memory, and are ready at hand in the L2 cache.
    Same thing happens virtual memory wise, a 4K page will be intensively used, and then no longer accessed for a great while, a good behavior, VM-wise
    Compare to a ShellSort, that access elements v[1],v[1001],v[2001],v[3001] … and then later v[2],v[1002],v[2002],v[3002]… Those value will cause repeated VM page faults, and, even if present in main memory, will cause memory stalls every time, because when the program comes back to access v[2], the cache line containing both v[1] and v[2] is no longer in the cache. and when you start the next pass, smae thing happen.
    ShellSort and HeapSort are not well-behaved, cachewise, while QuickSort and MergeSort (and RadixSort, too, damnit) exhibit excellent localization characteristics.

  34. dave

    One nifty thing about radix sort is that if you set up data-on-paper-cards properly, you can easily parallelize away the N term and radix sort it in O(D) time:
    Punch holes along one side of the cards, and encode the values in a binary representation by cutting or not-cutting between the edge and the hole. Then stack them so that the holes line up and run a long rod through the holes, and you can lift out the ones with that hole not-cut. Then just move them to the back/front and repeat on the hole for the next bit until you run out of holes.
    Also on the subject of parallelizing: If you can throw O(N) parallelism at it, you can bubble sort in O(N) time without needing to do anything clever. Though I suspect (not having sat down and thought about it) that with a little bit of cleverness you can parallelize an O(N log N) sort to reduce either the time or the parallelism needed to O(log N), so bubble sort still loses.
    External sorting has always been best done by sorting chunks as large as you can fit in main memory and then doing N-way merges with the largest value of N you can. Unless you’re limited by the number of tapes you can mount at once, you probably want to merge everything at once, since that way you only have to read and write everything twice (once for the chunk sort and once for the merge). If you’re serious enough about sorting big datasets to be worrying about the speed of your external sort, you can afford enough memory to keep a few disk blocks worth for every file you’re working with (one block per file for pending I/O and one for the in-memory merge, with blocks sized at the minimum “efficient” transfer size for the hardware/filesystem).
    Of course, in real life, the right one to choose is usually either insertion sort or whatever-your-library-has, since most sorts are on small datasets and insertion sort is hard to beat for small datasets and easy to implement.

  35. Josiah Carlson

    Re: _Arthur and random pivots
    Depending on the quality of the random number you are after, you don’t necessarily need to spend much time generating it. For example, the Wichman-Hill PRNG takes 3 multiples, 3 additions, 4 modulos, and 3 divides – and it’s probably overkill for quicksort. You can do fairly well with just your input parameters – addresses of your input, start, end, and the first few values. You will have to do a modulo, but the other mixing can be some shifts and xors.
    Also, median of 3 can be done in 3 comparisons:
    if low > mid: swap low and mid
    if mid > high: swap mid and high
    if low > mid: swap low and mid
    median is mid

  36. _Arthur

    Yes, Josiah, many Quicksort coders will blindly follow the “use a random value for pivot” recommendation without realizing that the call for the random() function entails a high overhead, and the divides and modulo are the most costly CPU operations, in clock cycles.
    And, if the random() routine your calling is bigger than one expect, it can potentially crowd out the partition() code out of the instruction cache, adversely affecting performance.
    That’s why I highly recommend the median-of-3 choice of pivot: it is trivial to program and use only the cheapest instructions (comparisons), and it garantees your pivot not to be the worst possible choice.
    If you happen to sort already-sorted data, a case often encountered in real-world circumstances, it even provides the best possible choice for pivot!
    QuickSort is a sorting algorithm prone to revert to O(N**2) on some pathological input.
    One protection against that is to use the median-of-3 choice of pivot.
    Another is to make sure your Partition() sub-algorithm doesn’t needlessly swap data when partitionning all-equal values.

  37. Magnus

    Torbjörn, you asked about log₁₀ vs. log₂ and yes it’s O that lets us switch between the two. log₂ N/log₂ 10 = log₁₀ N that is they differ only by a constant and hence O(log₂ N) = O(log₁₀ N).


Leave a Reply