This chapter deals with the conversion of strings to numbers and with the basic operations on strings implemented in the string library.
Table of contents:
A decimal character chain is any sequence of decimal digits (i.e., characters 0 to 9) possibly preceded by the character + or − . A decimal string is any ASCII string that represents a decimal character chain. For exemple, the decimal string 45 57 56 55 54 0 represents (see the ASCII table) the character chain −9876 .
How to convert a decimal string into the corresponding number of type int? The function atoi (the name is a shorthand for alphanumeric-to-int) of the stdlib library receives a decimal string and returns the corresponding int. Unfortunately, the function does not check whether the string represents a valid int (in particular, it does not check whether the string is too long) and so creates opportunities for errors. This can be exploited by malicious third parties, especially when used to treat command line arguments. Therefore, it is better to use the more complex strtol function.
The strtol function (the name is a shorthand for string-to-long) of the stdlib library converts a decimal string into the corresponding long int. This function accepts decimal strings as well as strings in other bases: the third argument specifies the base. For example,
strtol ("-9876", NULL, 10)
returns the number −9876 and strtol ("-9876", NULL, 16) returns −39030. We shall not discuss the meaning of the second argument for now.
We say that a byte b is smaller than a byte c if the number represented by b in binary notation is smaller than the number represented by c. For example, 01000001 is smaller than 01000010. In the set of bytes whose first bit is 0, this order relation between bytes is consistent with the usual order ('A' < 'B' < ... < 'Z') of the letters in the ASCII alphabet according to the ASCII table.
Now that we defined the order relation between bytes, we can deal with the order relation between strings. We say that a string s is lexicographically smaller than a string t if the first byte of s that differs from the corresponding byte of t is smaller than that byte of t. More precisely, s is lexicographically smaller than t if
s[k] < t[k]
for the smallest k such that s[k] ≠ t[k]. Moreover, if such k does not exist then s and t are equal or one is a proper prefix of the other; in the second case, the shorter string is considered lexicographically smaller that the longer one.
A sequence of strings is in lexicographic order
if each string in the sequence
is lexicographically smaller than (or equal to)
each of the subsequent strings.
(Some people say alphabetic order
in place de
lexicographic order
,
but this is incorrect.)
In the case of ASCII strings,
the lexicographic order is analogous to the order
in which words appear in a dictionary
(except that it puts all the uppercase letters
before all the lowercase).
For example, the following list of strings is in lexicographic order:
a ab abc abcd abd ac acd ad b bc bcd bd c cd d
(See also the Lexicographic order entry in the appendix.)
char *a, *b; a = "apple"; b = "banana"; if (a < b) printf ("%s is smaller than %s\n", a, b); else printf ("%s is smaller than %s\n", b, a);
char *a, *b; a = "apple"; b = "banana"; if (*a < *b) printf ("%s is smaller than %s\n", a, b); else printf ("%s is greater than %s\n", a, b);
The string library contains various string manipulatation functions. We describe next the most important ones.
String length. The function strlen receives a string and returns its length. This function could be written thus:
unsigned int strlen (char *s) { int k; for (k = 0; s[k] != '\0'; ++k) ; return k; }
String copy. The function strcpy receives two strings and copies the second (including the final null byte) to the space occupied by the first. The original content of the first string is lost. Do not call this function if the length of the first string is smaller than the length of the second. (Buffer overflow is a frequent cause of security bugs!) This function could be written like this:
void strcpy (char *s, char *t) { int i; for (i = 0; t[i] != '\0'; ++i) s[i] = t[i]; s[i] = '\0'; }
(Actually, the function strcpy is not of type void: it returns its first argument. This is useful in some situations, but in general the user is only interested in the side effect of the function and ignores its return value.)
The function strcpy can be used as follows to obtain an effect similar to that of the example in the beginning of the chapter Strings and character chains:
char s[10]; strcpy (s, "ABC");
String compare. The function strcmp compares two strings lexicographically, byte-for-byte. It returns a strictly negative number if the first string is lexicographically smaller than the second, returns 0 if the two strings are equal, and returns a strictly positive number if the first string is lexicographically greater than the second. This function could have been coded like this:
int strcmp (char *s, char *t) { int i; for (i = 0; s[i] == t[i]; ++i) if (s[i] == '\0') return 0; unsigned char si = s[i], ti = t[i]; return si - ti; }
Though the parameters of the function are of type char *, the function behaves as if they were of type unsigned char *. (One must remember that the value of the expression si - ti is computed in int arithmetic rather than modulo 256.)
If s and t are ASCII strings, the lexicographic order in based on the ordering of characters (... 0 .. 9 .. A B .. Z .. a b .. z ...) in the ASCII table. For example, the following sequence of words is in lexicographic order:
Acapulco Seattle acarus amoeba among naive native pinata pitch role rubber seat
String collate. To compare non-ASCII strings you should use the function strcoll from the string library. The behavior of this function depends on the value of environment variable LC_COLLATE on your system. We shall assume that our system uses UTF-8 encoding and that the value of LC_COLLATE is en_US.UTF-8. (Type locale on the terminal to learn the value of this variable.)
Function strcoll receives two strings that represent
character chains
and decides the relative positions of these chains
in a dictionary order
that respects the norms of the local language,
as determined by LC_COLLATE.
The function returns a strictly negative number
if the first chain comes before the second,
returns zero if the two chains are equal,
and returns a strictly positive number
if the first chain comes after the second.
The following list, for example, is in dictionary order
:
Acapulco acarus amoeba amœba among naive naïve native pinata piñata pitch role rôle rubber seat Seattle
strcpy (s, t)and
s = t?
if (strcmp (s, t) < 0)and
if (s < t)?
char a[7], b[7]; strcpy (a, "orange"); strcpy (b, "apple");
char *a = malloc (7), *b = malloc (7); strcpy (a, "orange"); strcpy (b, "apple");
char *a, *b; a = "orange"; b = "apple";