Programming in D - Characters

Characters

Characters are building blocks of strings. Any symbol of a writing system is called a character: letters of alphabets, numerals, punctuation marks, the space character, etc. Confusingly, building blocks of characters themselves are called characters as well.

Arrays of characters make up strings. We have seen arrays in the previous chapter; strings will be covered two chapters later.

Like any other data, characters are also represented as integer values that are made up of bits. For example, the integer value of the lowercase 'a' is 97 and the integer value of the numeral '1' is 49. These values are merely a convention, assigned when the ASCII standard was designed.

In many programming languages, characters are represented by the char type, which can hold only 256 distinct values. If you are familiar with the char type from other languages, you may already know that it is not large enough to support the symbols of many writing systems. Before getting to the three distinct character types of D, let's first take a look at the history of characters in computer systems.

History

ASCII Table

The ASCII table was designed at a time when computer hardware was very limited compared to modern systems. Having been based on 7 bits, the ASCII table can have 128 distinct code values. That many distinct values are sufficient to represent the lowercase and uppercase versions of the 26 letters of the basic Latin alphabet, numerals, commonly used punctuation marks, and some terminal control characters.

As an example, the ASCII codes of the characters of the string "hello" are the following (the commas are inserted just to make it easier to read):

104, 101, 108, 108, 111

Every code above represents a single letter of "hello". For example, there are two 108 values corresponding to the two 'l' letters.

The codes of the ASCII table were later increased to 8 bits to become the Extended ASCII table. The Extended ASCII table has 256 distinct codes.

IBM Code Pages

IBM Corporation has defined a set of tables, each one of which assign the codes of the Extended ASCII table from 128 to 255 to one or more writing systems. These code tables allowed supporting the letters of many more alphabets. For example, the special letters of the Turkish alphabet are a part of IBM's code page 857.

Despite being much more useful than ASCII, code pages have some problems and limitations: In order to display text correctly, it must be known what code page a given text was originally written in. This is because the same code corresponds to a different character in most other tables. For example, the code that represents 'Ğ' in table 857 corresponds to 'ª' in table 437.

In addition to the difficulty in supporting multiple alphabets in a single document, alphabets that have more than 128 non-ASCII characters cannot be supported by an IBM table at all.

ISO/IEC 8859 Code Pages

The ISO/IEC 8859 code pages are a result of international standardization efforts. They are similar to IBM's code pages in how they assign codes to characters. As an example, the special letters of the Turkish alphabet appear in code page 8859-9. These tables have the same problems and limitations as IBM's tables. For example, the Dutch digraph ĳ does not appear in any of these tables.

Unicode

Unicode solves all problems and limitations of previous solutions. Unicode includes more than a hundred thousand characters and symbols of the writing systems of many human languages, current and old. (New ones are constanly under review for addition to the table.) Each of these characters has a unique code. Documents that are encoded in Unicode can include all characters of separate writing systems without any confusion or limitation.

Unicode encodings

Unicode assigns a unique code for each character. Since there are more Unicode characters than an 8-bit value can hold, some characters must be represented by at least two 8-bit values. For example, the Unicode character code of 'Ğ' (286) is greater than the maximum value of a ubyte.

The way characters are represented in electronic mediums is called their encoding. We have seen above how the string "hello" is encoded in ASCII. We will now see three Unicode encodings that correspond to D's character types.

UTF-32: This encoding uses 32 bits (4 bytes) for every Unicode character. The UTF-32 encoding of "hello" is similar to its ASCII encoding, but every character is represented with 4 bytes:

0,0,0,104, 0,0,0,101, 0,0,0,108, 0,0,0,108, 0,0,0,111

As another example, the UTF-32 encoding of "aĞ" is the following:

0,0,0,97, 0,0,1,30

Note: The order of the bytes of UTF-32 may be different on different computer systems.

'a' and 'Ğ' are represented by 1 and 2 significant bytes respectively, and the values of the other 5 bytes are all zeros. These zeros can be thought of as filler bytes to make every Unicode character occupy 4 bytes each.

For documents based on the basic Latin alphabet, this encoding always uses 4 times as many bytes as the ASCII encoding. When most of the characters of a given document have ASCII equivalents, the 3 filler bytes for each of those characters make this encoding more wasteful compared to other encodings.

On the other hand, there are benefits of representing every character by an equal number of bytes. For example, the next Unicode character is always exactly four bytes away.

UTF-16: This encoding uses 16 bits (2 bytes) to represent most of the Unicode characters. Since 16 bits can have about 65 thousand unique values, the other (less commonly used) 35 thousand Unicode characters must be represented using additional bytes.

As an example, "aĞ" is encoded by 4 bytes in UTF-16:

0,97, 1,30

Note: The order of the bytes of UTF-16 may be different on different computer systems.

Compared to UTF-32, this encoding takes less space for most documents, but because some characters must be represented by more than 2 bytes, UTF-16 is more complicated to process.

UTF-8: This encoding uses 1 to 4 bytes for every character. If a character has an equivalent in the ASCII table, it is represented by 1 byte, with the same numeric code as in the ASCII table. The rest of the Unicode characters are represented by 2, 3, or 4 bytes. Most of the special characters of the European writing systems are among the group of characters that are represented by 2 bytes.

For most documents in western countries, UTF-8 is the encoding that takes the least amount of space. Another benefit of UTF-8 is that the documents that were produced using ASCII can be opened directly (without conversion) as UTF-8 documents. UTF-8 also does not waste any space with filler bytes, as every character is represented by significant bytes. As an example, the UTF-8 encoding of "aĞ" is:

97, 196,158

The character types of D

There are three D types to represent characters. These characters correspond to the three Unicode encodings mentioned above. Copying from the Fundamental Types chapter:

Type	Definition	Initial Value
char	UTF-8 code unit	0xFF
wchar	UTF-16 code unit	0xFFFF
dchar	UTF-32 code unit and Unicode code point	0x0000FFFF

Contrary to some other programming languages, characters in D may consist of different numbers of bytes. For example, because 'Ğ' must be represented by at least 2 bytes in Unicode, it doesn't fit in a variable of type char. On the other hand, because dchar consists of 4 bytes, it can hold any Unicode character.

Character literals

Literals are constant values that are written in the program as a part of the source code. In D, character literals are specified within single quotes:

    char  letter_a = 'a';
    wchar letter_e_acute = 'é';

Double quotes are not valid for characters because double quotes are used when specifying strings, which we will see in a later chapter. 'a' is a character literal and "a" is a string literal that consists of a single character.

Variables of type char can only hold letters that are in the ASCII table.

There are many ways of inserting characters in code:

Most naturally, typing them on the keyboard.
Copying from another program or another text. For example, you can copy and paste from a web site, or from a program that is specifically for displaying Unicode characters. (One such program in most Linux environments is Character Map (charmap on the terminal).)
Using short names of the characters. The syntax for this feature is \&character_name;. For example, the name of the Euro sign is euro and it can be specified in the program as follows:
```
    wchar currencySymbol = '\&euro;';
```
See the list of named characters for all characters that can be specified this way.
Specifying characters by their integer Unicode values:
```
    char a = 97;
    wchar Ğ = 286;
```
Specifying the codes of the characters of the ASCII table either by \value_in_octal or \xvalue_in_hexadecimal syntax:
```
    char questionMarkOctal = '\77';
    char questionMarkHexadecimal = '\x3f';
```
Specifying the Unicode values of the characters by using the \ufour_digit_value syntax for wchar, and the \Ueight_digit_value syntax for dchar (note u versus U). The Unicode values must be specified in hexadecimal:
```
    wchar Ğ_w = '\u011e';
    dchar Ğ_d = '\U0000011e';
```

These methods can be used to specify the characters within strings as well. For example, the following two lines have the same string literals:

    writeln("Résumé preparation: 10.25€");
    writeln("\x52\&eacute;sum\u00e9 preparation: 10.25\&euro;");

Control characters

Some characters only affect the formatting of the text, they don't have a visual representation themselves. For example, the new-line character, which specifies that the output should continue on a new line, does not have a visual representation. Such characters are called control characters. Some common control characters can be specified with the \control_character syntax.

Syntax	Name	Definition
\n	new line	Moves the printing to a new line
\r	carriage return	Moves the printing to the beginning of the current line
\t	tab	Moves the printing to the next tab stop

For example, the write() function, which does not start a new line automatically, would do so for every \n character. Every occurrence of the \n control character within the following literal represents the start of a new line:

    write("first line\nsecond line\nthird line\n");

The output:

first line
second line
third line

Single quote and backslash

The single quote character itself cannot be written within single quotes because the compiler would take the second one as the closing character of the first one: '''. The first two would be the opening and closing quotes, and the third one would be left alone, causing a compilation error.

Similarly, since the backslash character has a special meaning in the control character and literal syntaxes, the compiler would take it as the start of such a syntax: '\'. The compiler then would be looking for a closing single quote character, not finding one, and emitting a compilation error.

For those reasons, the single quote and the backslash characters are escaped by a preceding backslash character:

Syntax	Name	Definition
\'	single quote	Allows specifying the single quote character:'\''
\\	backslash	Allows specifying the backslash character: '\\' or "\\"

The std.uni module

The std.uni module includes functions that are useful for working with Unicode characters. You can see this module at its documentation.

The functions that start with is answer certain questions about characters. The result is false or true depending on whether the answer is no or yes, respectively. These functions are useful in logical expressions:

isLower: is it a lowercase character?
isUpper: is it an uppercase character?
isAlpha: is it a Unicode alphabetic character?
isWhite: is it a whitespace character?

The functions that start with to produce new characters from existing ones:

toLower: produces the lowercase version of the given character
toUpper: produces the uppercase version of the given character

Here is a program that uses all those functions:

import std.stdio;
import std.uni;

void main() {
    writeln("Is ğ lowercase? ", isLower('ğ'));
    writeln("Is Ş lowercase? ", isLower('Ş'));

    writeln("Is İ uppercase? ", isUpper('İ'));
    writeln("Is ç uppercase? ", isUpper('ç'));

    writeln("Is z alphabetic? ",       isAlpha('z'));
    writeln("Is \&euro; alphabetic? ", isAlpha('\&euro;'));

    writeln("Is new-line whitespace? ",   isWhite('\n'));
    writeln("Is the underscore whitespace? ", isWhite('_'));

    writeln("The lowercase of Ğ: ", toLower('Ğ'));
    writeln("The lowercase of İ: ", toLower('İ'));

    writeln("The uppercase of ş: ", toUpper('ş'));
    writeln("The uppercase of ı: ", toUpper('ı'));
}

The output:

Is ğ lowercase? true
Is Ş lowercase? false
Is İ uppercase? true
Is ç uppercase? false
Is z alphabetic? true
Is € alphabetic? false
Is new-line whitespace? true
Is the underscore whitespace? false
The lowercase of Ğ: ğ
The lowercase of İ: i
The uppercase of ş: Ş
The uppercase of ı: I

Limited support for ı and i

The lowercase and uppercase versions of the letters 'ı' and 'i' are consistently dotted or undotted in some alphabets (e.g. the Turkish alphabet). Most other aphabets are inconsistent in this regard: the uppercase of the dotted 'i' is undotted 'I'.

Because the computer systems have started with the ASCII table, traditionally the uppercase of 'i' is 'I' and the lowercase of 'I' is 'i'. For that reason, these two letters may need special attention. The following program demonstrates this problem:

import std.stdio;
import std.uni;

void main() {
    writeln("The uppercase of i: ", toUpper('i'));
    writeln("The lowercase of I: ", toLower('I'));
}

The output is according to the basic Latin alphabet:

The uppercase of i: I
The lowercase of I: i

Characters are converted between their uppercase and lowercase versions normally by their Unicode character codes. This method is problematic for many alphabets. For example, the Azeri and Celt alphabets are subject to the same problem of producing the lowercase of 'I' as 'i'.

There are similar problems with sorting: Many letters like 'ğ' and 'á' may be sorted after 'z' even for the basic Latin alphabet.

Problems with reading characters

The flexibility and power of D's Unicode characters may cause unexpected results when reading characters from an input stream. This contradiction is due to the multiple meanings of the term character. Before expanding on this further, let's look at a program that exhibits this problem:

import std.stdio;

void main() {
    char letter;
    write("Please enter a letter: ");
    readf(" %s", &letter);
    writeln("The letter that has been read: ", letter);
}

If you try that program in an environment that does not use Unicode, you may see that even the non-ASCII characters are read and printed correctly.

On the other hand, if you start the same program in a Unicode environment (e.g. a Linux terminal), you may see that the character printed on the output is not the same character that has been entered. To see this, let's enter a non-ASCII character in a terminal that uses the UTF-8 encoding (like most Linux terminals):

Please enter a letter: ğ
The letter that has been read:   ← no letter on the output

The reason for this problem is that the non-ASCII characters like 'ğ' are represented by two codes, and reading a char from the input reads only the first one of those codes. Since that single char is not sufficient to represent the whole Unicode character, the program does not have a complete character to display.

To show that the UTF-8 codes that make up a character are indeed read one char at a time, let's read two char variables and print them one after the other:

import std.stdio;

void main() {
    char firstCode;
    char secondCode;

    write("Please enter a letter: ");
    readf(" %s", &firstCode);
    readf(" %s", &secondCode);

    writeln("The letter that has been read: ",
            firstCode, secondCode);
}

The program reads two char variables from the input and prints them in the same order that they are read. When those codes are sent to the terminal in that same order, they complete the UTF-8 encoding of the Unicode character on the terminal and this time the Unicode character is printed correctly:

Please enter a letter: ğ
The letter that has been read: ğ

These results are also related to the fact that the standard inputs and outputs of programs are char streams.

We will see later in the Strings chapter that it is easier to read characters as strings, instead of dealing with UTF codes individually.

D's Unicode support

Unicode is a large and complicated standard. D supports a very useful subset of it.

A Unicode-encoded document consists of the following levels of concepts, from the lowermost to the uppermost:

Code unit: The values that make up the UTF encodings are called code units. Depending on the encoding and the characters themselves, Unicode characters are made up of one or more code units. For example, in the UTF-8 encoding the letter 'a' is made up of a single code unit and the letter 'ğ' is made up of two code units.
D's character types char, wchar, and dchar correspond to UTF-8, UTF-16, and UTF-32 code units, respectively.
Code point: Every letter, numeral, symbol, etc. that the Unicode standard defines is called a code point. For example, the Unicode code values of 'a' and 'ğ' are two distinct code points.
Depending on the encoding, code points are represented by one or more code units. As mentioned above, in the UTF-8 encoding 'a' is represented by a single code unit, and 'ğ' is represented by two code units. On the other hand, both 'a' and 'ğ' are represented by a single code unit in both UTF-16 and UTF-32 encodings.

The D type that supports code points is dchar. char and wchar can only be used as code units.
Character: Any symbol that the Unicode standard defines and what we call "character" or "letter" in daily talk is a character.
This definition of character is flexible in Unicode, which brings a complication. Some characters can be formed by more than one code point. For example, the letter 'ğ' can be specified in two ways:
- as the single code point for 'ğ'
- as the two code points for 'g' and '˘' (combining breve)
Although they would mean the same character to a human reader, the single code point 'ğ' is different from the two consecutive code points 'g' and '˘'.

Summary

Unicode supports all characters of all writing systems.
char is for UTF-8 encoding; although it is not suitable to represent characters in general, it supports the ASCII table.
wchar is for UTF-16 encoding; although it is not suitable to represent characters in general, it can support letters of multiple alphabets.
dchar is for UTF-32 encoding; as it is 32 bits, it can also represent code points.

[ ↢ Prev ] [ Next ↣ ]