Data Structures and Algorithms

Quicksort

There are two main things that faster sorting algorithms tackle:

They might compare every distinct pair of elements
- Learn only one piece of information
- Contrast with binary search, with learns $N/2$ pieces of information with the first comparison
They often move elements one place at a time (bubble and insertion)
- Even if the element is "far" out of place
- Contrast with selection sort, which moves each element to its final place

Quicksort

Quicksort is a recursive sorting function which is "easy" to implement. It works well with a variety of input data, and uses $O(1)$ additional memory (plus memory for the additional recursive stack frames).

Quicksort is a divide and conquer algorithm.

The base case: Arrays of length 0 or 1 are trivially sorted.

The recursive step:

Guess an element elt to partition array
Form array of (LHS)elt(RHS) (divide)
For all x in the left hand side, x <= elt
For all y in the right hand side, y <= elt
Sort the left hand side and right hand side

void quicksort(int a[], int left, int right) {
    // Base case
    if (left >= right) return;

    // Recursive step
    // This returns an index where everything to the left of
    // that index is less than it, and everything to the right
    // of it is greater than it.
    int pivot = partition(a, left, right);
    quicksort(a, left, pivot - 1);
    quicksort(a, pivot + 1, right);
}

Note how you never have to sort the pivot location. The main part of the quicksort research is this: how do we partition?

How to partition

It would be ideal if you could pivot at the middle. If you've done that, you've found the median. Why don't you just find the median and pivot from there? This is a tad hard to find.

Simple alternative: just pick any element. If the array is random, you can always pick the first item and you are just as likely to get the median as anything else. This isn't guaranteed to be a good pick, but the quality is amortized.

Example

Let's say your dataset is $(2, 9, 3, 4, 7, 5, 8, 6)$ . Let's say you use a simple heuristic, and choose the last element as the pivot: 6.

You start from the left hand side. As long as what you're looking at left of 6 is less than the pivot, then it's in the right place. As long as what you're looking at right of 6 is greater than the pivot, it's in the right place.

First, since 9 is left of 6 and shouldn't be there, you go from the right and find something less than your pivot. The first of these values is 5. Swap 9 and 5.

$(2, 5, 3, 4, 7, 9, 8, 6)$

Next, since 7 is greater than your pivot, the only value you can swap it with is 6. So you swap 6 and 7.

$(2, 5, 3, 4, 6, 9, 8, 7)$

Now, you just have to quicksort $(2, 5, 3, 4)$ and $(9, 8, 7)$ .

Worst case for last element

A sorted array: $(1, 2, 3, 4, 5)$

Simple partition

int partition(int a[], int left, int right) {
    int pivot = right--;
    while (true) {
        while (a[left] < a[pivot])
            left++;
        while (a[right] >= a[pivot] && left < right)
            right--;

        if (left >= right) break;

        swap(a[left], a[right]);
    }

    swap(a[left], a[pivot]);

    return left;
}

Another way to partition

Since the last item was just a random pick, you would be just as likely to find the median at the end, the middle, whatever. Maybe your data is partially sorted, so why don't we try something closer to the middle?

int partitionMiddle(int a[], int left, int right) {
    // Find physical middle of range
    int pivot = (left + right)/2;

    // Move this to the end
    swap(a[pivot], a[right]);

    // Go on as normal
    partition(int a[], int left, int right);
}

Analysis

Cost of partitioning $N$ elements: $O(N)$ .

Worst case: Pivot always leaves one side empty
- $T(N) = N + T(N - 1) + T(1)$
- $T(N) = N + T(N - 1)$
- $T(N) = O(N^2)$
Best case: Pivot divides elements equally
- $T(N) = N + T(N/2) + T(N/2)$
- $T(N) = O(N \log N)$

Pros and Cons

Advantages:

On average, $O(N \log N)$
Efficient memory usage
Thoroughly analyzed and understood

Disadvantages:

Worst case $O(N^2)$
Not stable – could swap things all across the place
Fragile (simple implementation mistakes very hard to fix)

Improvements

Improving Splits

If it's worst case $O(N^2)$ if you choose a bad split, then choosing a good split is really important. Any single choice could always be the worst one. However, it's too expensive to actually compute the best one (median).

Rather than compute the median, sample it! Pick three elements, take their medians. This is very likely to give you better performance. Sampling is a very powerful technique!

Other

In divide and conquer, most sorts are "little". Reduce the cost of these. Insertion sort is faster than quicksort on small arrays. Bail out of quicksort when size < k. Either insertion sort each small array or use a single (fast!) insertion pass at the end.

What if many elements are equal?

Performance of Sorting Algorithms

Sorting algorithms that have a worst-case $O(N^2)$ time:

Bubble
Insertion
Selection
Heapsort
Quicksort

Memory usage:

In place:
- Bubble
- Insertion
- Selection
Heapsort