Programming in D – Tutorial and Reference
Ali Çehreli

Other D Resources

Characters

Characters are building blocks of strings. Any symbol of a writing system is called a character: letters of alphabets, numerals, punctuation marks, the space character, etc. Confusingly, building blocks of characters themselves are called characters as well.

Arrays of characters make up strings. We have seen arrays in the previous chapter; strings will be covered two chapters later.

Like any other data, characters are also represented as integer values that are made up of bits. For example, the integer value of the lowercase 'a' is 97 and the integer value of the numeral '1' is 49. These values are merely a convention, assigned when the ASCII standard was designed.

In many programming languages, characters are represented by the char type, which can hold only 256 distinct values. If you are familiar with the char type from other languages, you may already know that it is not large enough to support the symbols of many writing systems. Before getting to the three distinct character types of D, let's first take a look at the history of characters in computer systems.

History
ASCII Table

The ASCII table was designed at a time when computer hardware was very limited compared to modern systems. Having been based on 7 bits, the ASCII table can have 128 distinct code values. That many distinct values are sufficient to represent the lowercase and uppercase versions of the 26 letters of the basic Latin alphabet, numerals, commonly used punctuation marks, and some terminal control characters.

As an example, the ASCII codes of the characters of the string "hello" are the following (the commas are inserted just to make it easier to read):

104, 101, 108, 108, 111

Every code above represents a single letter of "hello". For example, there are two 108 values corresponding to the two 'l' letters.

The codes of the ASCII table were later increased to 8 bits to become the Extended ASCII table. The Extended ASCII table has 256 distinct codes.

IBM Code Pages

IBM Corporation has defined a set of tables, each one of which assign the codes of the Extended ASCII table from 128 to 255 to one or more writing systems. These code tables allowed supporting the letters of many more alphabets. For example, the special letters of the Turkish alphabet are a part of IBM's code page 857.

Despite being much useful than ASCII, code pages have some problems and limitations: In order to display text correctly, it must be known what code page a given text was originally written in. This is because the same code corresponds to a different character in most other tables. For example, the code that represents 'Ğ' in table 857 corresponds to 'ª' in table 437.

In addition to the difficulty in supporting multiple alphabets in a single document, alphabets that have more than 128 non-ASCII characters cannot be supported by an IBM table at all.

ISO/IEC 8859 Code Pages

The ISO/IEC 8859 code pages are a result of international standardization efforts. They are similar to IBM's code pages in how they assign codes to characters. As an example, the special letters of the Turkish alphabet appear in code page 8859-9. These tables have the same problems and limitations as IBM's tables. For example, the Dutch digraph ij does not appear in any of these tables.

Unicode

Unicode solves all problems and limitations of previous solutions. Unicode includes more than a hundred thousand characters and symbols of the writing systems of many human languages, current and old. (New ones are constanly under review for addition to the table.) Each of these characters has a unique code. Documents that are encoded in Unicode can include all characters of separate writing systems without any confusion or limitation.

Unicode encodings

Unicode assigns a unique code for each character. Since there are more Unicode characters than an 8-bit value can hold, some characters must be represented by at least two 8-bit values. For example, the Unicode character code of 'Ğ' (286) is greater than the maximum value of a ubyte.

The way characters are represented in electronic mediums is called their encoding. We have seen above how the string "hello" is encoded in ASCII. We will now see three Unicode encodings that correspond to D's character types.

UTF-32: This encoding uses 32 bits (4 bytes) for every Unicode character. The UTF-32 encoding of "hello" is similar to its ASCII encoding, but every character is represented with 4 bytes:

0,0,0,104, 0,0,0,101, 0,0,0,108, 0,0,0,108, 0,0,0,111

As another example, the UTF-32 encoding of "aĞ" is the following:

0,0,0,97, 0,0,1,30

Note: The order of the bytes of UTF-32 may be different on different computer systems.

'a' and 'Ğ' are represented by 1 and 2 significant bytes respectively, and the values of the other 5 bytes are all zeros. These zeros can be thought of as filler bytes to make every Unicode character occupy 4 bytes each.

For documents based on the basic Latin alphabet, this encoding always uses 4 times as many bytes as the ASCII encoding. When most of the characters of a given document have ASCII equivalents, the 3 filler bytes for each of those characters make this encoding more wasteful compared to other encodings.

On the other hand, there are benefits of representing every character by an equal number of bytes. For example, the next Unicode character is always exactly four bytes away.

UTF-16: This encoding uses 16 bits (2 bytes) to represent most of the Unicode characters. Since 16 bits can have about 65 thousand unique values, the other (less commonly used) 35 thousand Unicode characters must be represented using additional bytes.

As an example, "aĞ" is encoded by 4 bytes in UTF-16:

0,97, 1,30

Note: The order of the bytes of UTF-16 may be different on different computer systems.

Compared to UTF-32, this encoding takes less space for most documents, but because some characters must be represented by more than 2 bytes, UTF-16 is more complicated to process.

UTF-8: This encoding uses 1 to 4 bytes for every character. If a character has an equivalent in the ASCII table, it is represented by 1 byte, with the same numeric code as in the ASCII table. The rest of the Unicode characters are represented by 2, 3, or 4 bytes. Most of the special characters of the European writing systems are among the group of characters that are represented by 2 bytes.

For most documents in western countries, UTF-8 is the encoding that takes the least amount of space. Another benefit of UTF-8 is that the documents that were produced using ASCII can be opened directly (without conversion) as UTF-8 documents. UTF-8 also does not waste any space with filler bytes, as every character is represented by significant bytes. As an example, the UTF-8 encoding of "aĞ" is:

97, 196,158
The character types of D

There are three D types to represent characters. These characters correspond to the three Unicode encodings mentioned above. Copying from the Fundamental Types chapter:

Type Definition Initial Value
char UTF-8 code unit 0xFF
wchar UTF-16 code unit 0xFFFF
dchar UTF-32 code unit and Unicode code point 0x0000FFFF

Compared to some other programming languages, characters in D may consist of different number of bytes. For example, because 'Ğ' must be represented by at least 2 bytes in Unicode, it doesn't fit in a variable of type char. On the other hand, because dchar consists of 4 bytes, it can hold any Unicode character.

Character literals

Literals are constant values that are written in the program as a part of the source code. In D, character literals are specified within single quotes:

    char  letter_a = 'a';
    wchar letter_e_acute = 'é';

Double quotes are not valid for characters because double quotes are used when specifying strings, which we will see in a later chapter. 'a' is a character literal and "a" is a string literal that consists of a single character.

Variables of type char can only hold letters that are in the ASCII table.

There are many ways of inserting characters in code:

These methods can be used to specify the characters within strings as well. For example, the following two lines have the same string literals:

    writeln("Résumé preparation: 10.25€");
    writeln("\x52\ésum\u00e9 preparation: 10.25\€");
Control characters

Some characters only affect the formatting of the text, they don't have a visual representation themselves. For example, the new-line character, which specifies that the output should continue on a new line, does not have a visual representation. Such characters are called control characters. Some common control characters can be specified with the \control_character syntax.

Syntax Name Definition
\n new line Moves the printing to a new line
\r carriage return Moves the printing to the beginning of the current line
\t tab Moves the printing to the next tab stop

For example, the write() function, which does not start a new line automatically, would do so for every \n character. Every occurrence of the \n control character within the following literal represents the start of a new line:

    write("first line\nsecond line\nthird line\n");

The output:

first line
second line
third line
Single quote and backslash

The single quote character itself cannot be written within single quotes because the compiler would take the second one as the closing character of the first one: '''. The first two would be the opening and closing quotes, and the third one would be left alone, causing a compilation error.

Similarly, since the backslash character has a special meaning in the control character and literal syntaxes, the compiler would take it as the start of such a syntax: \'. The compiler then would be looking for a closing single quote character, not finding one, and emitting a compilation error.

For those reasons, the single quote and the backslash characters are escaped by a preceding backslash character:

Syntax Name Definition
\' single quote Allows specifying the single quote character:'\''
\\ backslash Allows specifying the backslash character: '\\' or "\\"
The std.uni module

The std.uni module includes functions that are useful with Unicode characters. You can see this module at its documentation.

The functions that start with is answer certain questions about characters. The result is false or true depending on whether the answer is no or yes, respectively. These functions are useful in logical expressions:

The functions that start with to produce new characters from existing ones:

Here is a program that uses all those functions:

import std.stdio;
import std.uni;

void main() {
    writeln("Is ğ lowercase? ", isLower('ğ'));
    writeln("Is Ş lowercase? ", isLower('Ş'));

    writeln("Is İ uppercase? ", isUpper('İ'));
    writeln("Is ç uppercase? ", isUpper('ç'));

    writeln("Is z alphanumeric? ",       isAlpha('z'));
    writeln("Is \€ alphanumeric? ", isAlpha('\€'));

    writeln("Is new-line whitespace? ",  isWhite('\n'));
    writeln("Is underline whitespace? ", isWhite('_'));

    writeln("The lowercase of Ğ: ", toLower('Ğ'));
    writeln("The lowercase of İ: ", toLower('İ'));

    writeln("The uppercase of ş: ", toUpper('ş'));
    writeln("The uppercase of ı: ", toUpper('ı'));
}

The output:

Is ğ lowercase? true
Is Ş lowercase? false
Is İ uppercase? true
Is ç uppercase? false
Is z alphanumeric? true
Is € alphanumeric? false
Is new-line whitespace? true
Is underline whitespace? false
The lowercase of Ğ: ğ
The lowercase of İ: i
The uppercase of ş: Ş
The uppercase of ı: I
Limited support for ı and i

The lowercase and uppercase versions of the letters 'ı' and 'i' are consistently dotted or undotted in some alphabets (e.g. the Turkish alphabet). Most other aphabets are inconsistent in this regard: the uppercase of the dotted 'i' is undotted 'I'.

Because the computer systems have started with the ASCII table, traditionally the uppercase of 'i' is 'I' and the lowercase of 'I' is 'i'. For that reason, these two letters may need special attention. The following program demonstrates this problem:

import std.stdio;
import std.uni;

void main() {
    writeln("The uppercase of i: ", toUpper('i'));
    writeln("The lowercase of I: ", toLower('I'));
}

The output is according to the basic Latin alphabet:

The uppercase of i: I
The lowercase of I: i

Characters are converted between their uppercase and lowercase versions normally by their Unicode character codes. This method is problematic for many alphabets. For example, the Azeri and Celt alphabets are subject to the same problem of producing the lowercase of 'I' as 'i'.

There are similar problems with sorting: Many letters like 'ğ' and 'á' may be sorted after 'z' even for the basic Latin alphabet.

Problems with reading characters

The flexibility and power of D's Unicode characters may cause unexpected results when reading characters from an input stream. This contradiction is due to the multiple meanings of the term character. Before expanding on this further, let's look at a program that exhibits this problem:

import std.stdio;

void main() {
    char letter;
    write("Please enter a letter: ");
    readf(" %s", &letter);
    writeln("The letter that has been read: ", letter);
}

If you try that program in an environment that does not use Unicode, you may see that even the non-ASCII characters are read and printed correctly.

On the other hand, if you start the same program in a Unicode environment (e.g. a Linux terminal), you may see that the character printed on the output is not the same character that has been entered. To see this, let's enter a non-ASCII character in a terminal that uses the UTF-8 encoding (like most Linux terminals):

Please enter a letter: ğ
The letter that has been read:   ← no letter on the output

The reason for this problem is that the non-ASCII characters like 'ğ' are represented by two codes, and reading a char from the input reads only the first one of those codes. Since that single char is not sufficient to represent the whole Unicode character, the program does not have a complete character to display.

To show that the UTF-8 codes that make up a character are indeed read one char at a time, let's read two char variables and print them one after the other:

import std.stdio;

void main() {
    char firstCode;
    char secondCode;

    write("Please enter a letter: ");
    readf(" %s", &firstCode);
    readf(" %s", &secondCode);

    writeln("The letter that has been read: ",
            firstCode, secondCode);
}

The program reads two char variables from the input and prints them in the same order that they are read. When those codes are sent to the terminal in that same order, they complete the UTF-8 encoding of the Unicode character on the terminal and this time the Unicode character is printed correctly:

Please enter a letter: ğ
The letter that has been read: ğ

These results are also related to the fact that the standard inputs and outputs of programs are char streams.

We will see later in the Strings chapter that it is easier to read characters as strings, instead of dealing with UTF codes individually.

D's Unicode support

Unicode is a large and complicated standard. D supports a very useful subset of it.

A Unicode-encoded document consists of the following levels of concepts, from the lowermost to the uppermost:

Summary