nullprogram.com/blog/2014/08/29/
The C standard library includes a quicksort function called qsort()
.
It sorts homogeneous arrays of arbitrary type. The interface is exactly
what you’d expect given the constraints of the language.
void qsort(void *base, size_t nmemb, size_t size,
int (*compar)(const void *, const void *));
It takes a pointer to the first element of the array, the number of
members, the size of each member, and a comparator function. The
comparator has to operate on void *
pointers because C doesn’t have
templates or generics or anything like that. That’s two interfaces
where type safety is discarded: the arguments passed to qsort()
and
again when it calls the comparator function.
One of the significant flaws of this interface is the lack of context
for the comparator. C doesn’t have closures, which in other languages
would cover this situation. If the sort function depends on some
additional data, such as in Graham scan where points are
sorted relative to a selected point, the extra information needs to be
smuggled in through a global variable. This is not reentrant and
wouldn’t be safe in a multi-threaded environment. There’s a GNU
extension here, qsort_r()
, that takes an additional context
argument, allowing for reentrant comparators.
Quicksort has some really nice properties. It’s in-place, so no
temporary memory needs to be allocated. If implemented properly it
only consumes O(log n) space, which is the stack growth during
recursion. Memory usage is localized, so it plays well with caching.
That being said, qsort()
is also a classic example of an API naming
mistake. Few implementations actually use straight
quicksort. For example, glibc’s qsort()
is merge sort (in
practice), and the other major libc implementations use a hybrid
approach. Programs using their language’s sort function
shouldn’t be concerned with how it’s implemented. All the matters is
the interface and whether or not it’s a stable sort. OpenBSD made the
exact same mistake when they introduced arc4random()
,
which no longer uses RC4.
Since quicksort is an unstable sort — there are multiple possible
results when the array contains equivalent elements — this means
qsort()
is not guaranteed to be stable, even if internally the C
library is using a stable sort like merge sort. The C standard
library has no stable sort function.
Comparator Composability
The unfortunate side effect of unstable sorts is that they hurt
composability. For example, let’s say we have a person
struct like
this,
struct person {
const char *first, *last;
int age;
};
Here are a couple of comparators to sort either by name or by age. As
a side note, strcmp()
automatically works correctly with UTF-8 so
this program isn’t limited to old-fashioned ASCII names.
#include <string.h>
int compare_name(const void *a, const void *b)
{
struct person *pa = (struct person *) a;
struct person *pb = (struct person *) b;
int last = strcmp(pa->last, pb->last);
return last != 0 ? last : strcmp(pa->first, pb->first);
}
int compare_age(const void *a, const void *b)
{
struct person *pa = (struct person *) a;
struct person *pb = (struct person *) b;
return pa->age - pb->age;
}
And since we’ll need it later, here’s a COUNT_OF
macro to get the
length of arrays at compile time. There’s a less error prone
version out there, but I’m keeping it simple.
#define COUNT_OF(x) (sizeof(x) / sizeof(0[x]))
Say we want to sort by name, then age. When using a stable sort,
this is accomplished by sorting on each field separately in reverse
order of preference: a composition of individual comparators. Here’s
an attempt at using quicksort to sort an array of people by age then
name.
struct person people[] = {
{"Joe", "Shmoe", 24},
{"John", "Doe", 30},
{"Alan", "Smithee", 42},
{"Jane", "Doe", 30}
};
qsort(people, COUNT_OF(people), sizeof(struct person), compare_name);
qsort(people, COUNT_OF(people), sizeof(struct person), compare_age);
But this doesn’t always work. Jane should come before John,
but the original sort was completely lost in the second sort.
Joe Shmoe, 24
John Doe, 30
Jane Doe, 30
Alan Smithee, 42
This could be fixed by defining a new comparator that operates on both
fields at once, compare_age_name()
, and performing a single sort.
But what if later you want to sort by name then age? Now you need
compare_name_age()
. If a third field was added, there would need to
be 6 (3!) different comparator functions to cover all the
possibilities. If you had 6 fields, you’d need 720 comparators!
Composability has been lost to a combinatorial nightmare.
Pointer Comparison
The GNU libc documentation claims that qsort()
can be made
stable by using pointer comparison as a fallback. That is, when
the relevant fields are equivalent, use their array position to
resolve the difference.
If you want the effect of a stable sort, you can get this result by
writing the comparison function so that, lacking other reason
distinguish between two elements, it compares them by their
addresses.
This is not only false, it’s dangerous! Because elements may be sorted
in-place, even in glibc, their position will change during the sort.
The comparator will be using their current positions, not the
starting positions. What makes it dangerous is that the comparator
will return different orderings throughout the sort as elements are
moved around the array. This could result in an infinite loop, or
worse.
Making it Stable
The most direct way to work around the unstable sort is to eliminate
any equivalencies. Equivalent elements can be distinguished by adding
an intrusive order
field which is set after each sort. The
comparators will fall back on this sort to maintain the original
ordering.
struct person {
const char *first, *last;
int age;
size_t order;
};
And the new comparators.
int compare_name_stable(const void *a, const void *b)
{
struct person *pa = (struct person *) a;
struct person *pb = (struct person *) b;
int last = strcmp(pa->last, pb->last);
if (last != 0)
return last;
int first = strcmp(pa->first, pb->first);
if (first != 0)
return first;
return pa->order - pb->order;
}
int compare_age_stable(const void *a, const void *b)
{
struct person *pa = (struct person *) a;
struct person *pb = (struct person *) b;
int age = pa->age - pb->age;
return age != 0 ? age : pa->order - pb->order;
}
The first sort doesn’t need to be stable, but there’s not much reason
to keep around two definitions.
qsort(people, COUNT_OF(people), sizeof(people[0]), compare_name_stable);
for (size_t i = 0; i < COUNT_OF(people); i++)
people[i].order = i;
qsort(people, COUNT_OF(people), sizeof(people[0]), compare_age_stable);
And the result:
Joe Shmoe, 24
Jane Doe, 30
John Doe, 30
Alan Smithee, 42
Without defining any new comparators I can sort by name then age just
by swapping the calls to qsort()
. At the cost of an extra
bookkeeping field, the number of comparator functions needed as fields
are added is O(n) and not O(n!) despite using an unstable sort.