! ? @ + - * / = < > 0 1 2 3 4 5 6 7 8 9 A E I O U C … Á Â Ã É Í Ó Ô Õ Ú Ç … ≡ ≠ ≤ ≥ Γ Δ Π Σ Ω ⋮
This chapter makes a quick introduction to Unicode, encoding schemes, and UTF-8. For more on the subject, see the references at the end of this page.
Table of contents:
A character is a typographic symbol used to write text in some language. (This definition is not perfect, but it will suffice.) Here are some examples of characters:
! " - 9 A B a b ~ £ Á ñ ó Σ − ∞ ≤
The number of characters used by the different languages in the world is huge. Ordinary English uses just 94 characters, but we are exposed to many other languages, sometimes several languages in the same sentence. To this we must add the special characters used by different areas of science.
To begin organizing this Tower of Babel, we must give names to all the characters. The Unicode Consortium of IT companies assigned numerical names (known as code points) to more than 1 million characters. Here is a tiny sample of the list of characters and their numerical names:
Unicode number | character |
33 | ! |
34 | " |
45 | - |
57 | 9 |
65 | A |
66 | B |
97 | a |
98 | b |
126 | ~ |
163 | £ |
193 | Á |
241 | ñ |
243 | ó |
931 | Σ |
8722 | − |
8734 | ∞ |
8804 | ≤ |
In this sample, the numerical names of the characters
are written in decimal notation.
In general, however,
these names are written in
hexadecimal notation
and preceded by U+
:
Unicode | character |
U+0021 | ! |
U+0022 | " |
U+002D | - |
U+0039 | 9 |
U+0041 | A |
U+0042 | B |
U+0061 | a |
U+0062 | b |
U+007E | ~ |
U+00A3 | £ |
U+00C1 | Á |
U+00F1 | ñ |
U+00F3 | ó |
U+03A3 | Σ |
U+2212 | − |
U+221E | ∞ |
U+2264 | ≤ |
The complete list of characters and their Unicode numbers can be seen on the Wikipedia page List of Unicode characters or the Wikibooks page Unicode / Character reference.
The set of all the characters on the Unicode list could be called Unicode alphabet and we could say that each character of this alphabet is a Unicode character. (If the aspirations of the Unicode project are justified, then all the characters in the world are Unicode characters.)
The first 128 characters of the Unicode alphabet are the most important. This set of characters goes from U+0000 to U+007F and is known as ASCII alphabet. The elements of this alphabet will be called ASCII characters. The ASCII alphabet contains letters, decimal digits, punctuation, and some special characters. The list of the 128 ASCII characters and their Unicode numbers is recorded in the ASCII table.
Unfortunately, the ASCII alphabet is not sufficient to write text in a language like Spanish and French since it lacks letters with diacritics.
How can we store Unicode characters in digital files and in memory? We could represent each character by its Unicode number written in binary notation. But this would require 3 bytes per character, which is very inefficient given that 1 byte is enough for the most common characters. We must, therefore, resort do more complex representations.
An encoding scheme (or character encoding) is a table that associates a sequence of bytes with each Unicode number, and therefore with each Unicode character. The sequence of bytes associated with a character is the code of the character. The next sections examine two encodings: ASCII and UTF-8.
The ASCII code is very simple: the Unicode number of each character is written in binary notation. This code is used only for the ASCII alphabet. Since this alphabet has only 128 characters, the ASCII code uses only 1 byte per character and the first bit of this byte is 0. Here is a sample of the code table:
Unicode | ASCII | hexa | |
U+0021 | ! | 00100001 | x21 |
U+0022 | " | 00100010 | x22 |
U+002D | - | 00101101 | x2D |
U+0039 | 9 | 00100111 | x39 |
U+0041 | A | 01000001 | x41 |
U+0042 | B | 01000010 | x42 |
U+0061 | a | 01100001 | x61 |
U+0062 | b | 01100010 | x62 |
U+007E | ~ | 01111110 | x7E |
The last column shows the ASCII code written in hexadecimal notation.
(Why not use all the 8 bits of a byte? We could then encode additional 128 characters. The ISO-LATIN-1 code does exactly this, but the table is rarely used nowadays. The ISO-LATIN-1 set includes the characters ª ± º ¼ ½ ¾ À Á Â Ã Ç È É Ê Ì Í Î Ò Ó Ô × Ù Ú Û à á â ã ç è é ê ì í î ò ó ô õ ÷ ù ú û among others. The the numerical names of these characters are the same in the ISO-LATIN-1 table and the Unicode table.)
If we were to use a fixed number of bytes per character we would need 3 bytes. The solution is to resort to a multibyte code, that employs a variable number of bytes per character: some characters use 1 byte, others use 2 bytes, and so on.
The most widely used multibyte code is known as UTF-8. It associates a sequence of 1 to 4 bytes (8 to 32 bits) with each Unicode character. The first 128 characters use the good old ASCII code of 1 byte por character. The remaining characters have a longer code. Here is a tiny sample:
Unicode | UTF-8 code | hexa | |
U+0021 | ! | 00100001 | x21 |
U+0022 | " | 00100010 | x22 |
U+002D | - | 00101101 | x2D |
U+0039 | 9 | 00100111 | x39 |
U+0041 | A | 01000001 | x41 |
U+0042 | B | 01000010 | x42 |
U+0061 | a | 01100001 | x61 |
U+0062 | b | 01100010 | x62 |
U+007E | ~ | 01111110 | x7E |
U+00A3 | £ | 11000010 10100011 | xC2A3 |
U+00C1 | Á | 11000011 10000001 | xC381 |
U+00F1 | ñ | 11000011 10110001 | xC3B1 |
U+00F3 | ó | 11000011 10110011 | xC3B3 |
U+03A3 | Σ | 11001110 10100011 | xCEA3 |
U+2212 | − | 11100010 10001000 10010010 | xE28892 |
U+221E | ∞ | 11100010 10001000 10011110 | xE2889E |
U+2264 | ≤ | 11100010 10001001 10100100 | xE289A4 |
(The last column shows the UTF-8 code in hexadecimal notation.) The list of UTF-8 codes of all the Unicode characters can be seen in UTF-8 encoding table and Unicode characters or in Wikibooks page Unicode / Character reference. For example, the character chain i ≤ 99 is represented in UTF-8 by the sequence of bytes
x69 | x20 | xE2 | x89 | xA4 | x20 | x39 | x39 |
i | ␣ | ≤ | ␣ | 9 | 9 |
where ␣ indicates the space character.
Decoding. Since the number of bytes per character is not fixed, the decoding of a sequence of bytes is not easy. How do we know where the code of one character ends and the code of the next character begins? The UTF-8 encoding scheme was designed so that the first bits of the code of a character indicate how many bytes the code occupies. If the first bit is 0, and therefore the value of the first byte is smaller than 128, then this is the only byte of the character. If the value of the first byte belongs to the interval 192 .. 223 then the code of the character has two bytes. And so on.
Assume UTF-8. The C programming language does not prescribe any specific encoding scheme. But the most used encoding is UTF-8. The present site assumes that all the text files, be they programs or data, use UTF-8 code. (But in many examples, only the ASCII subset of UTF-8 is used.)
118 91 110 93 32 61 32 226 136 158
x41 x74 x65 x6E x63 x69 xC3 xB3 x6E x21
(Consult the page UTF-8 encoding table and Unicode characters. You may have to use the go to other block button.)
There is no way of knowing, with certainty, which encoding a given text file uses. The author of the file must announce, outside the file, the encoding scheme he/she used.
There are utilities (as file, for example) that scan a file and try to guess, with some degree of confidence, its encoding scheme.
If you know the encoding scheme used by your file, you can use the iconv filter to change the encoding. You can, for example, convert an ISO-LATIN-1 file into an equivalent UTF-8 file.