What is Unicode? What is UTF-8 encoding of characters?

This chapter makes a quick introduction to Unicode, encoding schemes, and UTF-8. For more on the subject, see the references at the end of this page.

Characters

A character is a typographic symbol used to write text in some language. (This definition is not perfect, but it will suffice.) Here are some examples of characters:

The number of characters used by the different languages in the world is huge. Ordinary English uses just 94 characters, but we are exposed to many other languages, sometimes several languages in the same sentence. To this we must add the special characters used by different areas of science.

To begin organizing this Tower of Babel, we must give names to all the characters. The Unicode Consortium of IT companies assigned numerical names (known as code points) to more than 1 million characters. Here is a tiny sample of the list of characters and their numerical names:

In this sample, the numerical names of the characters are written in decimal notation. In general, however, these names are written in hexadecimal notation and preceded by U+:

The set of all the characters on the Unicode list could be called Unicode alphabet and we could say that each character of this alphabet is a Unicode character. (If the aspirations of the Unicode project are justified, then all the characters in the world are Unicode characters.)

ASCII characters

The first 128 characters of the Unicode alphabet are the most important. This set of characters goes from U+0000 to U+007F and is known as ASCII alphabet. The elements of this alphabet will be called ASCII characters. The ASCII alphabet contains letters, decimal digits, punctuation, and some special characters. The list of the 128 ASCII characters and their Unicode numbers is recorded in the ASCII table.

Unfortunately, the ASCII alphabet is not sufficient to write text in a language like Spanish and French since it lacks letters with diacritics.

Encoding schemes

How can we store Unicode characters in digital files and in memory? We could represent each character by its Unicode number written in binary notation. But this would require 3 bytes per character, which is very inefficient given that 1 byte is enough for the most common characters. We must, therefore, resort do more complex representations.

An encoding scheme (or character encoding) is a table that associates a sequence of bytes with each Unicode number, and therefore with each Unicode character. The sequence of bytes associated with a character is the code of the character. The next sections examine two encodings: ASCII and UTF-8.

ASCII encoding

The ASCII code is very simple: the Unicode number of each character is written in binary notation. This code is used only for the ASCII alphabet. Since this alphabet has only 128 characters, the ASCII code uses only 1 byte per character and the first bit of this byte is 0. Here is a sample of the code table:

Unicode		ASCII	hexa

U+0021	!	`00100001`	`x21`
U+0022	"	`00100010`	`x22`
U+002D	-	`00101101`	`x2D`
U+0039	9	`00100111`	`x39`
U+0041	A	`01000001`	`x41`
U+0042	B	`01000010`	`x42`
U+0061	a	`01100001`	`x61`
U+0062	b	`01100010`	`x62`
U+007E	~	`01111110`	`x7E`

(Why not use all the 8 bits of a byte? We could then encode additional 128 characters. The ISO-LATIN-1 code does exactly this, but the table is rarely used nowadays. The ISO-LATIN-1 set includes the characters ª ± º ¼ ½ ¾ À Á Â Ã Ç È É Ê Ì Í Î Ò Ó Ô × Ù Ú Û à á â ã ç è é ê ì í î ò ó ô õ ÷ ù ú û among others. The the numerical names of these characters are the same in the ISO-LATIN-1 table and the Unicode table.)

UTF-8 encoding

If we were to use a fixed number of bytes per character, we would need 3 bytes to represent each character of the Unicode alphabet. This would be rather wasteful, since 1 byte is sufficient for the most common characters. The solution is to resort to a multibyte code, that employs a variable number of bytes per character: some characters use 1 byte, others use 2 bytes, and so on.

The most widely used multibyte code is known as UTF-8. It associates a sequence of 1 to 4 bytes (8 to 32 bits) with each Unicode character. The first 128 characters use the good old ASCII code of 1 byte por character. The remaining characters have a longer code. Here is a tiny sample:

Unicode		UTF-8 code	hexa

U+0021	!	`00100001`	`x21`
U+0022	"	`00100010`	`x22`
U+002D	-	`00101101`	`x2D`
U+0039	9	`00100111`	`x39`
U+0041	A	`01000001`	`x41`
U+0042	B	`01000010`	`x42`
U+0061	a	`01100001`	`x61`
U+0062	b	`01100010`	`x62`
U+007E	~	`01111110`	`x7E`
U+00A3	£	`11000010 10100011`	`xC2A3`
U+00C1	Á	`11000011 10000001`	`xC381`
U+00F1	ñ	`11000011 10110001`	`xC3B1`
U+00F3	ó	`11000011 10110011`	`xC3B3`
U+03A3	Σ	`11001110 10100011`	`xCEA3`
U+2212	−	`11100010 10001000 10010010`	`xE28892`
U+221E	∞	`11100010 10001000 10011110`	`xE2889E`
U+2264	≤	`11100010 10001001 10100100`	`xE289A4`

Decoding. Since the number of bytes per character is not fixed, the decoding of a sequence of bytes is not easy. How do we know where the code of one character ends and the code of the next character begins? The UTF-8 encoding scheme was designed so that the first bits of the code of a character indicate how many bytes the code occupies. If the first bit is 0, and therefore the value of the first byte is smaller than 128, then this is the only byte of the character. If the value of the first byte belongs to the interval 192 .. 223 then the code of the character has two bytes. And so on.

Assume UTF-8. The C programming language does not prescribe any specific encoding scheme. But the most used encoding is UTF-8. The present site assumes that all the text files, be they programs or data, use UTF-8 code. (But in many examples, only the ASCII subset of UTF-8 is used.)

Exercises 1

Consider the following sequence of bytes, written in decimal notation. What character chain does the sequence of bytes represent in UTF-8 encoding?
```
118 91 110 93 32 61 32 226 136 158
```
The following sequence of bytes is written in hexadecimal notation. What character chain do these bytes represent in UTF-8 encoding?
```
x41 x74 x65 x6E x63 x69 xC3 xB3 x6E x21
```
Write a function to receive a file containing text in UTF-8 encoding and decide whether each byte of the file represents one character (that is, whether the alphabet of the file is ASCII).
Write the sequences of bytes that represent each of the following character chains in UTF-8 encoding:
- ASCII string
- Atención!
- piña colada
- $50 ≈ £42
- π = 3.14±0.01
- ⌊9.9⌋ = 9
- v[n] = ∞
(Consult the page UTF-8 encoding table and Unicode characters. You may have to use the go to other block button.)

How is my file encoded?

There is no way of knowing, with certainty, which encoding a given text file uses. The author of the file must announce, outside the file, the encoding scheme he/she used.

There are utilities (as file, for example) that scan a file and try to guess, with some degree of confidence, its encoding scheme.

If you know the encoding scheme used by your file, you can use the iconv filter to change the encoding. You can, for example, convert an ISO-LATIN-1 file into an equivalent UTF-8 file.

Exercises 2

The utilities od and hexdump print the sequence of (numerical values of) the bytes of a file. Use one of these utilities to study the contents of a file and guess the encoding it uses.
The function isalpha in the ctype library decides whether a given ASCII character is a letter. Write an extension of isalpha that will recognize letters with diacritics. Your function must receive a string that contains the UTF-8 code of a character and decide whether the code represents a valid letter (with or without a diacritic mark on it).

Unicode number	character

33	!
34	"
45	-
57	9
65	A
66	B
97	a
98	b
126	~
163	£
193	Á
241	ñ
243	ó
931	Σ
8722	−
8734	∞
8804	≤

`x69`	`x20`	`xE2`	`x89`	`xA4`	`x20`	`x39`	`x39`
i	␣		≤		␣	9	9

Unicode and UTF-8