Hashing is used extensively in applications
and deserves recognition
as one of the cleverer inventions
of computer science.
— E.S. Roberts, Programming Abstractions in C
This chapter uses a small counting problem to introduce a data structure known as hash table. This structure is responsible for accelerating many algorithms that depend on serching, inserting, and deleting items from a table.
Table of contents:
Suppose we are given a sequence of positive integers on the standard input stdin. These integers will be called keys. The keys are in arbitrary order and there may be many repeated keys. Consider now the following counting problem:
find the number of occurrences of each key.
For example, the number of occurrences (also known as the frequency) of each key in the sequence 4998 9886 1933 1435 9886 1435 9886 7233 4998 7233 1435 1435 1004 is given by the following table:
1004 | 1435 | 1933 | 4998 | 7233 | 9886 |
1 | 4 | 1 | 2 | 2 | 3 |
Our counting problem has an important requirement: the counting must be done online, that is, each key must be processed as soon as it is received from the input stream, so that, at each step, we have the counting result relative to the part of the sequence seen so far.
The performance of any algorithm for this problem will be measured by the time taken to count one key, i.e., to process the current element of the sequence. Ideally, we would like this time to be constant, that is, independent of the number of keys already seen (as well as of the number of distinct keys already seen). But we shall be forced to be content with something less than this ideal.
Four solutions. We shall discuss four different algorithms for the problem. The first two are very simple but slow. The next two are much faster because they use the hashing technique. Though simple, the first two algorithms provide an important introduction to the next two.
The algorithms are discussed in terms of two parameters: N and R . Parameter N is the length of the input stream, that is, the number of keys in the input sequence. The value of N can be very large (millions or even billions), but the number of distinct keys is typically much smaller.
Parameter R is a number greater than any key. In the example above, we can take R to be 10000. The set 0..R-1 is the universe of keys. Typically, most keys in the universe do not occur in the input stream.
We begin with an algorithm known as direct addressing. Though very simple, this algorithm contains the seed of the hashing technique.
Let's assume that the keys are of type int and that R keys fit comfortably in the (random access) memory of our machine. We can then use a table tb[0..R-1] to record the number of occurrences of each key.
int *tb; tb = malloc (R * sizeof (int));
The direct addressing algorithm initializes the array tb with zeros and iterates the following routine: read a key k from the input stream and execute the following function:
void count (int k) { tb[k]++; }
After each execution of this function, for each k in 0..R-1, the value of tb[k] will be the number of occurrences of k in the part of the sequence seen up to this moment.
Performance. The amount of time consumed by each execution of count is constant, that is, it does not depend on R and does not depend on the number of keys already processed. (In particular, it does not depend on N.) Hence, the direct addressing algorithm is very fast,
But the algorithm is only practical if R is small and known explicitly in advance. Moreover, the algorithm can waste a lot of space in memory. For example, if R is 10000 and the sequence contains only 1000 distinct keys, then 90% of the table tb will be idle.
Our second algorithm stores the frequencies of keys in a linked list. The cells of the list have the following structure:
typedef struc record cell; struct record { int key, occur; cell *next; };
If p is the address of a cell then p->occur will be the number of occurrences of the key p->key. If p and q are addresses of two different cells then p->key will be different from q->key. The linked list will be pointed to by the global variable tb:
cell *tb;
The counting algorithm initializes tb with NULL and iterates the following routine: read a key k from the input stream and call the following function to count the key:
void count (int k) { cell *p; p = tb; while (p != NULL && p->key != k) p = p->next; if (p != NULL) p->occur += 1; else { p = malloc (sizeof (cell)); p->key = k; p->occur = 1; p->next = tb; tb = p; } }
Performance. In the worst case, each execution of count consumes time proportional the number of distinct keys already read. Therefore, the execution of count may become slower and slower as the input stream is read. If all the keys are distinct, the last executions of count may consume time proportional to N.
Even in the average case, typical of practical applications, the count function may consume time proportional to one half of the number of distinct keys already read, which is not very good.
On the happy side, the linked list algorithm does not waste space, since the number of cells in the list is no greater than the number of distinct keys.
The next algorithms for the counting problem will combine the good characteristics of the two previous algorithms. But before going into this, we must introduce the concept of hashing. We begin by establishing a terminology and describing the ideias in an abstract manner; concrete implementations will be given in subsequent sections.
We assume that the frequencies are stored in a table tb[0 .. M-1]. The parameter M is knowns as modulus. The value of M and the exact nature of the elements of tb will stay undefined for now. But you must imagine that, somehow, each element of tb records the number of occurrences of some key. We say that tb is a hash table (or dispersion table). The size M of the table is usually much smaller than the size R of the universe of keys. Hence, a typical element of tb will have to take care of two or more keys.
To find the position in the table corresponding to a key k, we need to convert k into an index between 0 and M-1. Any function that does the conversion, taking the universe 0..R-1 of the keys into the set 0..M-1 of indices, is a hash function. In this chapter, we shall indicate by
hash (k, M)
a call to a hash function for a key k. The number hash (k, M) is the hash code of key k. A very popular hash function takes each key k to k % M , that is, to the remainder of the division of k by M. If M is 100, for example, then k % M consists of the two last decimal digits of k.
If a hash function take two keys to the same index, we have a collision. If M is smaller than R — and, even more so, if M is smaller than the number of distinct keys — then collisions are inevitable. If M is 100, for example, the function remainder-on-division-by-M will make all the keys that have the same last two decimal digits to collide.
A good hash function must spread, i.e., disperse, very well the keys over the set 0..M-1 of indices. A function that takes every key to an index that is even, for example, is not good. A function that only depends on a few digits of the key is also not good. Unfortunately, there is no hash function that is good for all the sets of keys extracted from the universe 0..R-1. To begin dealing with this difficulty,
since this tends to reduce the number of collisions. See, for example, the set of keys in the first column of the following table and consider the division of each key by the nonprime 100 and then by the prime 97. (This example was copied from the book by Sedgewick and Wayne.) Observe that there are more collisions in the first case:
k k%100 k%97
212 12 18
618 18 36
302 2 11
940 40 67
702 2 23
704 4 25
612 12 30
606 6 24
772 72 93
510 10 25
423 23 35
650 50 68
317 17 26
907 7 34
507 7 22
304 4 13
714 14 35
857 57 81
801 1 25
900 0 27
413 13 25
701 1 22
418 18 30
601 1 19
In general, finding a good hash function is more of an art than a science.
Now that we took care of spreading the keys over the interval 0..M-1, we must invent a way to resolve the collisions, that is, a way to store all the keys that a hash function sends to the same position of the hash table. The following sections describe two approaches to this question.
The hashing technique has two ingredients: a hash function and a mechanism for resolving collisions. The previous section discussed the first ingredient; this section takes care of the second.
The collisions in a hash table can be resolved by means of linked lists: all the keys that have a given hash code are stored in a linked list (in arbitrary order). The cells of the lists are equal to those of algorithm 2:
typedef struc record cell; struct record { int key, occur; cell *next; };
The hash table tb[0..M-1] is implemented as an array and the elements of the array are pointers to the linked lists:
cell **tb; tb = malloc (M * sizeof (cell *));
For each index h, the linked list tb[h] will contain all the keys that have hash code h.
The resulting counting algorithm is known as hashing with chaining and can be seen as a combination of the algorithms 1 and 2 discussed above. The algorithm initializes all the elements of the array tb with NULL and iterates the following routine: read a key k from the input stream and then run the following function:
void count (int k) { int h = hash (k, M); cell *p = tb[h]; while (p != NULL && p->key != k) p = p->next; if (p != NULL) p->occur += 1; else { p = malloc (sizeof (cell)); p->key = k; p->occur = 1; p->next = tb[h]; tb[h] = p; } }
Performance. In the worst case, the hash function takes all the keys to the same position in the hash table and therefore all the keys end up in the same linked list. In this case, the performance is no better than that of algorithm 2: each execution of count consumes time proportional to the number of keys already read from the input stream.
In the average case, typical of practical applications, the performance of count is much better. If the function hash spreads well the keys over the set 0..M-1, all the linked lists will have nearly the same length and we can expect the time consumption of count to be bounded by something proportional to
n / M ,
where n is the number of keys read so far. If M is 997, for example, we can expect the function to be nearly 1000 times faster than that of algorithm 2. We should, of course, choose M to be large enough for the M lists to be short (say a few dozen elements) but not so large that many of the lists will be empty.
This section describes a second way of resolving collisions in a hash table: all the keys that collide are stored in the table itself.
The elements of the hash table tb[0..M-1] are cells that have only two fields:
typedef struc record cell; struct record { int key, occur; };
cell *tb; tb = malloc (M * sizeof (cell));
As the counting proceeds, some of the cells of the table tb will be vacant while others will be occupied. In the vacant cells, the key field will have value -1. In the occupied cells, the key will belong to 0..R-1 and occur will be the corresponding number of occurrences. If a cell tb[h] is vacant, we can guarantee that no key in the part of the input stream already seen has hash code equal to h. But if tb[h] is occupied, we cannot conclude that the hash code of tb[h].key is equal to h.
Each key k of the input stream will be counted in the following manner. Let h be the hash code of k. If a cell tb[h] is vacant or has key equal to k, the contents of the cell will be updated. Else (i.e., if tb[h] is occupied and td[h]->key ≠ k), the algorithm looks for the next cell in tb that is vacant or has key equal to k.
The implementation of these ideias leads to the hashing with linear probing algorithm. The algorithm begins by setting key = -1 and occur = 0 in every cell. Then, it iterates the following routine: reads a key k and calls the following function:
void count (int k) {
int c, probe, h;
h = hash (k, M);
for (probe = 0; probe < M; probe++) {
c = tb[h].key;
if (c == -1 || c == k) break; // *
h = (h + 1) % M;
}
if (probe >= M)
exit (EXIT_FAILURE);
if (c == -1)
tb[h].key = k;
tb[h].occur++;
}
The function does M probings,
i.e., attempts to find a good
cell
(see line * of the code).
The search fails only if the table tb is full.
In that case,
the execution of the function is aborted.
(It would have been better to resize
the table tb and continue working.)
Suppose, for example, that the size M of the hash table is 10 (we are using a nonprime number to make the example simpler). Suppose also that hash (k, M) is defined as k % M. If the input stream is
333 336 1333 333 7777 446 556 999
then the final state of the hash table tb[0..9] will be as follows:
key occur
999 1
-1 0
-1 0
333 2
1333 1
-1 0
336 1
7777 1
446 1
556 1
Performance. In the worst case, the hash function takes all the keys to the same position of the hash table and therefore the keys occupy consecutive cells of the table. In this case, the performance is no better than that of algorithm 2: each execution of count consumes time proportional to the number of keys already read from the input stream.
In the average case, typical of practical applications,
the performance of count is much better.
⚠ If more than half of the cells of the hash table are vacant
(as would happen if we use resizing)
and the hash function spreads well
the keys over the set 0..M-1,
an execution of the count function
will not need to do more than a few probings
to find a good
cell.
So, the time consumption of one execution
of count will be pratically independent of the number
de keys already read.
In many applications, the keys are strings (especially ASCII strings) rather than numbers. How to build a hash table in this case? We could, for example, use a hash table of size 256 and a hash function that takes each string to the numerical value of its first byte. (All the strings that begin with 'a', for example, would be taken to the position 97 of the table.) But this would do a poor spreading of the set of keys over the interval 0..255.
A general way of dealing with string keys involves two steps: first, convert the string into an integer number; second, submit this number to a hash function. For an obvious conversion, treat each string as an integer in base 256. The string "abcd", for example, is converted into the number 97×2563 + 98×2562 + 99×2561 + 100×2560, which is equal to 1633837924. This kind of conversion can easily produce numbers greater than INT_MAX, and therefore lead to an arithmetic overflow. If the calculations are done with int variables, the result can be strictly negative, and that would be disastrous. To mitigate this, we can use unsigned variables, thereby making sure that the result is between 0 and UINT_MAX even if an overflow occurs:
typedef char *string; unsigned convert (string s) { unsigned h = 0; for (int i = 0; s[i] != '\0'; i++) h = h * 256 + s[i]; return h; }
To submit the result of the conversion to the hash function, just do hash (convert (s), M). If hash is the remainder on division by M, just do convert(s) % M.
A good alternative is to do the conversion interleaved with the calculation of the hash function. The result may not be identical to convert(s) % M, but will do the job of spreading the keys over the set 0..M-1:
int string_hash (string s, int M) { unsigned h = 0; for (int i = 0; s[i] != '\0'; i++) h = (h * 256 + s[i]) % M; return h; }
If the string is "abcd" and M is 101, for example, the index computed by string_hash is 11:
( 0 * 256 + 97) % 101 = 97 (97 * 256 + 98) % 101 = 84 (84 * 256 + 99) % 101 = 90 (90 * 256 + 100) % 101 = 11
The use of base 256 is not mandatory; we can use any other base. For reasons that are not too obvious, it is better to use a prime number, say 31, for the base.
256by
1in function convert. Show that all the permutations of the string s will collide.
unsigned h = s[0]; for (i = 1; s[i] != '\0'; i++)