Programming in D

Strings

We have used strings in many programs that we have seen so far. Strings are a combination of the two features that we have covered in the last three chapters: characters and arrays. In the simplest definition, strings are nothing but arrays of characters. For example, char[] is a type of string.

This simple definition may be misleading. As we have seen in the Characters chapter, D has three separate character types. Arrays of these character types lead to three separate string types, some of which may have surprising outcomes in some string operations.

`readln` and `strip`, instead of `readf`

There are surprises even when reading strings from the terminal.

Being character arrays, strings can contain control characters like '\n' as well. When reading strings from the input, the control character that corresponds to the Enter key that is pressed at the end of the input becomes a part of the string as well. Further, because there is no way to tell readf() how many characters to read, it continues to read until the end of the entire input. For these reasons, readf() does not work as intended when reading strings:

import std.stdio;

void main() {
    char[] name;

    write("What is your name? ");
    readf(" %s", &name);

    writeln("Hello ", name, "!");
}

The Enter key that the user presses after the name does not terminate the input. readf() continues to wait for more characters to add to the string:

What is your name? Mert
   ← The input is not terminated although Enter has been pressed
   ← (Let's assume that Enter is pressed a second time here)

One way of terminating the standard input stream in a terminal is pressing Ctrl-D under Unix-based systems and Ctrl-Z under Windows systems. If the user eventually terminates the input that way, we see that the new-line characters have been read as parts of the string as well:

Hello Mert
   ← new-line character after the name
!  ← (one more before the exclamation mark)

The exclamation mark appears after those characters instead of being printed right after the name.

readln() is more suitable when reading strings. Short for "read line", readln() reads until the end of the line. It is used differently because the " %s" format string and the & operator are not needed:

import std.stdio;

void main() {
    char[] name;

    write("What is your name? ");
    readln(name);

    writeln("Hello ", name, "!");
}

readln() stores the new-line character as well. This is so that the program has a way of determining whether the input consisted of a complete line or whether the end of input has been reached:

What is your name? Mert
Hello Mert
!  ← new-line character before the exclamation mark

Such control characters as well as all whitespace characters at both ends of strings can be removed by std.string.strip:

import std.stdio;
import std.string;

void main() {
    char[] name;

    write("What is your name? ");
    readln(name);
    name = strip(name);

    writeln("Hello ", name, "!");
}

The strip() expression above returns a new string that does not contain the trailing control characters. Assigning that return value back to name produces the intended output:

What is your name? Mert
Hello Mert!    ← no new-line character

readln() can be used without a parameter. In that case it returns the line that it has just read. Chaining the result of readln() to strip() enables a shorter and more readable syntax:

    string name = strip(readln());

I will start using that form after introducing the string type below.

`formattedRead` for parsing strings

Once a line is read from the input or from any other source, it is possible to parse and convert separate data that it may contain with formattedRead() from the std.format module. Its first parameter is the line that contains the data, and the rest of the parameters are used exacly like readf():

import std.stdio;
import std.string;
import std.format;

void main() {
    write("Please enter your name and age," ~
          " separated with a space: ");

    string line = strip(readln());

    string name;
    int age;
    formattedRead(line, " %s %s", name, age);

    writeln("Your name is ", name,
            ", and your age is ", age, '.');
}

Please enter your name and age, separated with a space: Mert 30
Your name is Mert, and your age is 30.

Both readf() and formattedRead() return the number of items that they could parse and convert successfully. That value can be compared against the expected number of data items so that the input can be validated. For example, as the formattedRead() call above expects to read two items (a string as name and an int as age), the following check ensures that it really is the case:

    uint items = formattedRead(line, " %s %s", name, age);

    if (items != 2) {
        writeln("Error: Unexpected line.");

    } else {
        writeln("Your name is ", name,
                ", and your age is ", age, '.');
    }

When the input cannot be converted to name and age, the program prints an error:

Please enter your name and age, separated with a space: Mert
Error: Unexpected line.

Double quotes, not single quotes

We have seen that single quotes are used to define character literals. String literals are defined with double quotes. 'a' is a character; "a" is a string that contains a single character.

`string`, `wstring`, and `dstring` are immutable

There are three string types that correspond to the three character types: char[], wchar[], and dchar[].

There are three aliases of the immutable versions of those types: string, wstring, and dstring. The characters of the variables that are defined by these aliases cannot be modified. For example, the characters of a wchar[] can be modified but the characters of a wstring cannot be modified. (We will see D's immutability concept in later chapters.)

For example, the following code that tries to capitalize the first letter of a string would cause a compilation error:

    string cannotBeMutated = "hello";
    cannotBeMutated[0] = 'H';             // ← compilation ERROR

We may think of defining the variable as a char[] instead of the string alias but that cannot be compiled either:

    char[] a_slice = "hello";  // ← compilation ERROR

This time the compilation error is due to the combination of two factors:

The type of string literals like "hello" is string, not char[], so they are immutable.
The char[] on the left-hand side is a slice, which, if the code compiled, would provide access to all of the characters of the right-hand side.

Since char[] is mutable and string is not, there is a mismatch. The compiler does not allow accessing characters of an immutable array through a mutable slice.

The solution here is to take a copy of the immutable string by using the .dup property:

import std.stdio;

void main() {
    char[] s = "hello".dup;
    s[0] = 'H';
    writeln(s);
}

The program can now be compiled and will print the modified string:

Hello

Similarly, char[] cannot be used where a string is needed. In such cases, the .idup property can be used to produce an immutable string variable from a mutable char[] variable. For example, if s is a variable of type char[], the following line will fail to compile:

    string result = s ~ '.';          // ← compilation ERROR

When the type of s is char[], the type of the expression on the right-hand side of the assignment above is char[] as well. .idup is used for producing immutable strings from existing strings:

    string result = (s ~ '.').idup;   // ← now compiles

Potentially confusing length of strings

We have seen that some Unicode characters are represented by more than one byte. For example, the character 'é' (the latin letter 'e' combined with an acute accent) is represented by Unicode encodings using at least two bytes. This fact is reflected in the .length property of strings:

    writeln("résumé".length);

Although "résumé" contains six letters, the length of the string is the number of UTF-8 code units that it contains:

The type of the elements of string literals like "hello" is char and each char value represents a UTF-8 code unit. A problem that this may cause is when we try to replace a two-code-unit character with a single-code-unit character:

    char[] s = "résumé".dup;
    writeln("Before: ", s);
    s[1] = 'e';
    s[5] = 'e';
    writeln("After : ", s);

The two 'e' characters do not replace the two 'é' characters; they replace single code units, resulting in an invalid UTF-8 encoding:

Before: résumé
After : re�sueé    ← INCORRECT

When dealing with letters, symbols, and other Unicode characters directly, as in the code above, the correct type to use is dchar:

    dchar[] s = "résumé"d.dup;
    writeln("Before: ", s);
    s[1] = 'e';
    s[5] = 'e';
    writeln("After : ", s);

The output:

Before: résumé
After : resume

Please note the two differences in the new code:

The type of the string is dchar[].
There is a d at the end of the literal "résumé"d, specifying its type as an array of dchars.

In any case, keep in mind that the use of dchar[] and dstring does not solve all of the problems of manipulating Unicode characters. For instance, if the user inputs the text "résumé" you and your program cannot assume that the string length will be 6 even for dchar strings. It might be greater if e.g. at least one of the 'é' characters is not encoded as a single code point but as the combination of an 'e' and a combining accute accent. To avoid dealing with this and many other Unicode issues, consider using a Unicode-aware text manipulation library in your programs.

String literals

The optional character that is specified after string literals determines the type of the elements of the string:

import std.stdio;

void main() {
     string s = "résumé"c;   // same as "résumé"
    wstring w = "résumé"w;
    dstring d = "résumé"d;

    writeln(s.length);
    writeln(w.length);
    writeln(d.length);
}

The output:

8
6
6

Because all of the Unicode characters of "résumé" can be represented by a single wchar or dchar, the last two lengths are equal to the number of characters.

String concatenation

Since they are actually arrays, all of the array operations can be applied to strings as well. ~ concatenates two strings and ~= appends to an existing string:

import std.stdio;
import std.string;

void main() {
    write("What is your name? ");
    string name = strip(readln());

    // Concatenate:
    string greeting = "Hello " ~ name;

    // Append:
    greeting ~= "! Welcome...";

    writeln(greeting);
}

The output:

What is your name? Can
Hello Can! Welcome...

Comparing strings

Note: Unicode does not define how the characters are ordered other than their Unicode codes. For that reason, you may get results that don't match your expectations below.

We have used comparison operators <, >=, etc. with integer and floating point values before. The same operators can be used with strings as well, but with a different meaning: strings are ordered lexicographically. This ordering takes each character's Unicode code to be its place in a hypothetical grand Unicode alphabet. The concepts of less and greater are replaced with before and after in this hypothetical alphabet:

import std.stdio;
import std.string;

void main() {
    write("      Enter a string: ");
    string s1 = strip(readln());

    write("Enter another string: ");
    string s2 = strip(readln());

    if (s1 == s2) {
        writeln("They are the same!");

    } else {
        string former;
        string latter;

        if (s1 < s2) {
            former = s1;
            latter = s2;

        } else {
            former = s2;
            latter = s1;
        }

        writeln("'", former, "' comes before '", latter, "'.");
    }
}

Because Unicode adopts the letters of the basic Latin alphabet from the ASCII table, the strings that contain only the letters of the ASCII table will always be ordered correctly.

Lowercase and uppercase are different

Because each character has a unique code, every letter variant is different from the others. For example, 'A' and 'a' are different letters, when directly comparing Unicode strings.

Additionally, as a consequence of their ASCII code values, all of the latin uppercase letters are sorted before all of the lowercase letters. For example, 'B' comes before 'a'. The icmp() function of the std.string module can be used when strings need to be compared regardless of lowercase and uppercase. You can see the functions of this module at its online documentation.

Because strings are arrays (and as a corollary, ranges), the functions of the std.array, std.algorithm, and std.range modules are very useful with strings as well.

Exercises

Browse the documentations of the std.string, std.array, std.algorithm, and std.range modules.
Write a program that makes use of the ~ operator: The user enters the first name and the last name, all in lowercase letters. Produce the full name that contains the proper capitalization of the first and last names. For example, when the strings are "ebru" and "domates" the program should print "Ebru Domates".
Read a line from the input and print the part between the first and last 'e' letters of the line. For example, when the line is "this line has five words" the program should print "e has five".
You may find the indexOf() and lastIndexOf() functions useful to get the two indexes needed to produce a slice.

As it is indicated in their documentation, the return types of indexOf() and lastIndexOf() are not int nor size_t, but ptrdiff_t. You may have to define variables of that exact type:
```
    ptrdiff_t first_e = indexOf(line, 'e');
```
It is possible to define variables with the auto keyword, which we will see in a later chapter:
```
    auto first_e = indexOf(line, 'e');
```

... the solutions

[ ↢ Prev ] [ Next ↣ ]

Strings

readln and strip, instead of readf

formattedRead for parsing strings

Double quotes, not single quotes

string, wstring, and dstring are immutable

Potentially confusing length of strings

String literals

String concatenation

Comparing strings

Lowercase and uppercase are different

Exercises

`readln` and `strip`, instead of `readf`

`formattedRead` for parsing strings

`string`, `wstring`, and `dstring` are immutable