Digital sorting

Many array sorting algorithms (like Insertionsort, Selectionsort, Mergesort, Quicksort, and Heapsort) are based on comparisons between the elements of the array. This chapter introduces a sorting algorithm based on counting the elements of the array. This algorithm, in turn, is the mainspring of the Radixsort method for sorting short strings(This chapter was adapted from section 5.1 of the book by Sedgewick and Wayne.)

Table of contents:

Arrays of small integers

Suppose we want to rearrange in increasing order an array v[0 . . n−1] whose elements are integers in the range

0 . . R−1

where R is a small number.  This range is the universe of the array.  The sorted array will have a run of elements equal to 0, followed by a run of elements equal to 1, etc., and finally a run of elements equal to R−1.  The figure shows an array over the universe 0 . . 4 before and after sorting:

2 3 3 4 1 3 0 3 1 2 2 1 2 4 3 4 4 2 3 4
0 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4

Exercises 1

  1. Array of bits.  Write a function to rearrange in increasing order an array v[0 . . n−1] whose elements are 0s and 1s.
  2. DNA.  Write a function to rearrange in increasing order an array whose elements are the characters A, C, G, and T of the genetic code.  Begin by transforming the set  A C G T  into the numeric interval 0 . . 3.

Sorting by counting

To sort an array v[0..n-1] whose universe is known in advance, we can resort to a simple counting procedure which does not compare one element to another. To begin, we need the idea of frequency.

The frequency of an element r of the universe 0..R-1 is the number of occurrences of r in the array v[0..n-1].  This number will be denoted by  f[r].  In the example above, where the universe is 0..4, the frequencies are given by the following table f[0..4]:

  0 1 2 3 4
f 1 3 5 6 5

The frequency of predecessors of an element r of 0..R is the sum f[0] + ... + f[r-1], denoted by  fp[r].  Note that we added R to the universe and that the sum that defines fp[r] does not include f[r].  Of course fp[0] is zero since the sum is empty.  In the example above, the frequency of predecessors is given by the following table:

  0 1 2 3 4 5
fp 0 1 4 9 15 20

Observe that fp[r] is the starting index of the run of elements equal to r in the sorted array. In the example above, the run of 1's begins at index 1, the run of 2's begins at index 4, the run of 3's begins at index 9, etc.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4

Now, the array v[0..n-1] can be sorted in two steps:  first, a table fp is used to copy the elements of v[0..n-1] to the correct positions of an auxiliary array aux[0..n-1];  then, the array aux is copied back to the space occupied by v.  The following code implements this algorithm:

// This function rearranges v[0..n-1] in
// increasing order assuming that the
// elements of the array belong to the
// universe 0..R-1.

void 
countingsort (int *v, int n, int R) 
{
   int r;
   int *fp, *aux;
   fp = malloc ((R+1) * sizeof (int));
   aux = malloc (n * sizeof (int));

   for (r = 0; r <= R; ++r) 
      fp[r] = 0;
   for (int i = 0; i < n; ++i) {
      r = v[i];
      fp[r+1] += 1; 
   }
   // A
   for (r = 1; r <= R; ++r)
      fp[r] += fp[r-1];
   // B
   for (int i = 0; i < n; ++i) {
      r = v[i];
      aux[fp[r]] = v[i];
      fp[r]++;  // C
   }
   // D 
   for (int i = 0; i < n; ++i)
      v[i] = aux[i];

   free (fp);   
   free (aux);
}

On line A, fp[r] is the frequency of r-1, i.e., fp[r] == f[r-1] for every r in 0..R-1.  On line B, fp[r] is the frequency of predecessors of r; hence, fp[r] will be the starting index of the run of elements whose value is r.  Line C increments fp[r] in preparation for the next element of value r in the array.  On line D, aux[0..n-1] is in increasing order.

Time consumption.  The function countingsort consumes  n + R  units of time.  If R is small (no more than some multiple of n), this is better than the  n log n  time bound of algorithms like Mergesort, Quicksort, and Heapsort.

Stability.  The function countingsort produces a stable sort (i.e., it does not change the relative positions of equal-value elements).  This stability is the basis for the use of the function in the digital sorting algorithm to be examined in the next section.

Exercises 2

  1. Prove that the function countingsort is correct.
  2. Correctness check.  Write a program to test, experimentally, the correctness of the countingsort function. (See the analogous exercise for Insertionsort.) Your program should receive the value of R from the command line.
  3. Criticize the following simplified version of function countingsort:
    int *f = malloc (R * sizeof (int)); 
    for (int r = 0; r < R; ++r)  f[r] = 0;
    for (int i = 0; i < n; ++i)  f[v[i]] += 1; 
    int i = 0;
    for (int r = 0; r < R; ++r) 
       for (int k = 0; k < f[r]; ++k)
          v[i++] = r;
    free (f);
    
  4. Array of structs.  Suppose the array v[0..n-1] represents a set of points in the plane. Each point is a pair (x, y) of integers with x in the range 0..999. Adapt the code of countingsort to sort the array with respect to the x coordinates.
  5. Arrays of letters.  Adapt the countingsort function to sort an array whose elements belong to the set of characters A..Z.
  6. Stability.  Show that the countingsort function is stable, i.e., that it preserves the relative order of elements of same value.

Digital sorting of fixed-length strings

Let v[0..n-1] be an array of strings, all of the same length. (You may assume that all the strings are ASCII, though this is not essential.)  Now consider the problem of

rearranging the array in lexicographic order.

In the context of this problem, the bytes of the strings are traditionally called digits, even if they do not belong to the set 0..9.

Example.  Suppose that the strings are car licence plates.  (We could restrict ouselves to the forty three characters from '0' to 'Z'.)  Here is an example with 10 plates, each 7 digits long:

FON1723
EAD3312
CDA7891
FAJ4021
DOG1125
BAT7271
GIZ1234
BAT7328
BIG8733
CAT9955

(This example is from the book by Sedgewick and Wayne.)  To put this array in lexicographic order, we can call countingsort several times:  first, to sort by the last digit, then to sort by the one-but-last digit, and so on:

CDA7891  EAD3312  FAJ4021  DOG1125  CDA7891  EAD3312  BAT7271
FAJ4021  FAJ4021  DOG1125  GIZ1234  EAD3312  FAJ4021  BAT7328
BAT7271  FON1723  GIZ1234  FON1723  DOG1125  BAT7271  BIG8733
EAD3312  DOG1125  BAT7271  EAD3312  BIG8733  BAT7328  CAT9955
FON1723  BAT7328  EAD3312  FAJ4021  FAJ4021  CAT9955  CDA7891
BIG8733  BIG8733  BAT7328  BAT7271  FON1723  CDA7891  DOG1125
GIZ1234  GIZ1234  FON1723  BAT7328  BAT7271  BIG8733  EAD3312
DOG1125  CAT9955  BIG8733  CDA7891  BAT7328  GIZ1234  FAJ4021
CAT9955  BAT7271  CDA7891  BIG8733  CAT9955  DOG1125  FON1723
BAT7328  CDA7891  CAT9955  CAT9955  GIZ1234  FON1723  GIZ1234

Since countingsort is stable, the array of strings will be eventually in lexicographic order.  Notice that it is essential to apply countingsort from right to left, i.e., starting from the last digit.

Algorithm.  The following code implements the algorithm used in the example.  The parameter W represents the common length of all the strings (7 in the example above) and the parameter R (smaller than 256) is the size of the universe of digits:

typedef unsigned char byte;

// Rearranges in lexicographic order the
// array v[0..n-1] of strings. Each v[i]
// is a string v[i][0..W-1] whose elements
// belong to the set 0..R-1.

void 
digitalsort (byte **v, int n, int W, int R) 
{
   int *fp;
   byte **aux;
   fp = malloc ((R+1) * sizeof (int));
   aux = malloc (n * sizeof (byte *));

   for (int d = W-1; d >= 0; --d) {
      int r;
      for (r = 0; r <= R; ++r) 
         fp[r] = 0;
      for (int i = 0; i < n; ++i) {
         r = v[i][d];
         fp[r+1] += 1; 
      }
      for (r = 1; r <= R; ++r) 
         fp[r] += fp[r-1]; 
      for (int i = 0; i < n; ++i) {
         r = v[i][d];
         aux[fp[r]] = v[i]; 
         fp[r]++; 
      }
      for (int i = 0; i < n; ++i) 
         v[i] = aux[i];
   }

   free (fp);
   free (aux);
}

The code ignores the null byte, \0, at the end of each string and therefore can be applied to any array of arrays of bytes (provided the value of each byte is in the range 0..R-1).

This sorting algorithm is digital because it sorts the strings digit-by-digit.  The algorithm is also known as

Time consumption.  The digitalsort function consumes  (n+R)W  units of time.  If R is small (at worst n), the time consumption is proportional to  Wn,  that is, to the total number of digits in the input.

Animations.  There are several animations of the Radixsort algorithm on YouTube. Here is a sample:

Exercises 3

  1. What happens if we replace the line  for (d = W-1; d >= 0; --d)  by  for (d = 0; d < W; ++d) in the code of digitalsort?
  2. Apply the digital sort algorithm to the array of strings
    no is th ti fo al go pe to co to th ai of th pa
    
  3. Write a version of digitalsort for arrays of ASCII strings of variable length.
  4. Write a function that receives integers n and W and produces an array of n random strings with W ASCII characters each. (This function will generate test data for digitalsort.)
  5. Correctness check.  Write a program to test, experimentally, the correctness of your implementation of the digital sort algorithm. (See the analogous exercise for Insertionsort.)
  6. Performance test.  Write a program to time your implementation of the digital sort algorithm. (See the analogous exercise for Mergesort.)  Run experiments for W equal to 1, 2, 4, 8.
  7. Implement a digital sort algorithm for linked lists.