What are ASCII and UTF-8 encodings of character chains?

A large part of all the data in the world is in text form. In other words, a large part of all the data has the form of chains of characters. Many of these chains are represented by strings in digital files. This chapter discusses the relation between character chains and strings.

Strings

In C language, a string is an array of bytes in which the null byte 00000000 is interpreted as a sentinel that signals the end of the relevant part of the array. Here is a small example of a string:

(the symbols 0 and 1 represent bits). In this example, only the first 4 bytes constitute the string.

The length of a string is its number of bytes, not counting the final null byte. (Hence, the string in the example above has length 3.) A string is empty if its length is zero.

Each byte of a string is treated as a char, and therefore a string is an array of chars. (We could have said unsigned char; this would not make any difference in practice.) To break the association between char and character, we may use byte as a synonym for char.

The address of a string is the address of its first byte. In informal discussions, we blurr the distinction between a string and its address. So, the expression consider a string str must be understood as consider a string whose address is str.

The following code fragment builds a string str one byte at a time. After the code fragment is executed, the array str[0..3] contains the string 01000001 01000010 01000011 00000000. The part str[4..9] of the array is ignored.

We can treat strings as a new data type. To do this, use the typedef below and start writing string instead of char *.

Exercises 1

Write a boolean function to decide whether two strings of same length differ in exactly one byte.
Write a function that will receive a byte b and return a string whose only byte is b.
Empty vs null. Suppose that str is a string. What is the difference between the statements str is empty and str is NULL?
Piece of string. Write a function that will receive a string str and positive integers i and j and return a string with the same contents as the segment str[i..j]. Write two versions: in the first, your function must not allocate new space and can make changes to str; in the second, your function must return a copy of the segment str[i..j] and cannot make changes to str.
Write a boolean function that receives strings s and t and decides whether s is a segment of t. Write a program that uses your function to count the number of occurrences of a string s in a string t.

Character chains

A character chain is a sequence of characters. This page, for example, is a long character chain. The length of a character chain is the number of characters in the chain. For example, the character chain

In digital files, and in computer memory, character chains are represented by strings. For chains of ASCII characters, each character is represented by a single byte, as defined by the ASCII table. Strings of this kind will be called ASCII strings. The value of each byte of such a string belongs to the interval 0..127. For example, the character chain i <= 99 is represented by the string 105 32 60 61 32 57 57 0.

Now consider character chains in which not every character belongs to the ASCII alphabet. Since we are assuming that out programming environment uses UTF-8 encoding, each non-ASCII character (such as £, À, ñ, ó, ≤, etc.) is represented by two, three, or even four bytes and these bytes do not belong to the interval 0..127. For example, the character chain i ≤ 99 is represented by the string 105 32 226 137 164 32 57 57 0 . The character ≤ takes up 3 bytes; the other five characters take up 1 byte each since they belong to the ASCII alphabet. The details of this representation are discussed in chapter Unicode and UTF-8.

The UTF-8 code was defined in such a way that the character \0 is the only one whose code contains a null byte. Hence, the first null byte of the string indicates the end of the string as well as the end of the character chain that the string represents.

Exercises 2

Suppose that a string str is a UTF-8 representation of a character chain s. Give an upper bound and a lower bound on the length of str as a function of the length of s. Repeat the exercise assuming that str is an ASCII string.
★ Write a boolean function to decide whether a given string is ASCII or not.
Palindromes. Write a boolean function to decide whether a given character chain is a palindrome (i.e., can be read from left to right or from right to left indifferently). The chain is represented by a string. Write a program to test your function.
Write a function that will receive an ASCII string and display a table with the number of occurrences of each character in the string. Write a program to test your function.
★ Write a function that will receive an ASCII string and replace each segment of two or more spaces by a single space. Redo the exercise without the restriction to ASCII strings.
★ Number of characters. Write a function that will receive a string and calcule the length of the character chain that the string represents in UTF-8 encoding. (Hint: Study the structure of the UTF-8 code.) [Solution: ./solutions/string4.html]

Reading and writing character chains

To obtain the character chain that a given string represents, we must decode the string. In the case of ASCII strings, the decoding is simple: just look up the ASCII table. In general, however, the decoding is more complex (partly because not every string is a valid UTF-8 representation of a character chain). Fortunately, the function printf with format specification %s takes care of the decoding. For example, the code fragment

The format specification %s can also be used with the scanf function. For example, the program fragment below reads a character chain (without white-spaces) from the keyboard, adds a null byte to its end, and stores the resulting string in the array str:

Constant character chains

To specify a constant character chain, just wrap it in double quotes (characters 34). Constant character chains (also known as literals) figure frequently in C programs: the first argument of the functions printf and scanf are typically such constants.

When a constant character chain is assigned to a string, it is converted into a sequence of bytes and a null byte is added at the end. For example, after the code fragment

the string str contains 65 66 67 0. The characters of a constant character chain may contain non-ASCII characters; we can, for example, say

Exercises 3

What is the effect of the following two code fragments? What is wrong with the second fragment?
```
byte *s;
s = "ABC";
printf ("%s\n", s);
```
```
byte s[20];
s = "ABC";
printf ("%s\n", s);
```
What is the difference between "A" and 'A'?
What is the difference between "mno" and "m\no"? What is the difference between "MNOP" and "MN0P"? What is the difference between "MN\0P" and "MN0P"?
Rewrite the function countvowels without the restriction to ASCII strings. Suppose that the string represents a text in Spanish and count the letters á, ó, É, etc. as vowels.
Delete diacritics. Write a function to delete all the diacritical marks from a given character chain, replacing À by A, ñ by n, ó by o, and so on. The input character chain is represented by a string in UTF-8 encoding. Assume that the input character chain contains only characters used in Spanish, and therefore each character occupies at most 2 bytes. The output of the function must be an ASCII string.