Programming in D – Tutorial and Reference
Ali Çehreli

Other D Resources

Strings

We have used strings in many programs that we have seen so far. Strings are a combination of the two features that we have covered in the last three chapters: characters and arrays. In the simplest definition, strings are nothing but arrays of characters. For example, char[] is a type of string.

This simple definition may be misleading. As we have seen in the Characters chapter, D has three separate character types. Arrays of these character types lead to three separate string types, some of which may have surprising outcomes in some string operations.

readln and chomp, instead of readf

There are surprises even when reading strings from the console.

Being character arrays, strings can contain control characters like '\n' as well. When reading strings from the input, the control character that corresponds to the Enter key that is pressed at the end of console input becomes a part of the string as well. Further, because there is no way to tell readf() how many characters to read, it continues to read until the end of the entire input. For these reasons, readf() does not work as intended when reading strings:

import std.stdio;

void main()
{
    char[] name;

    write("What is your name? ");
    readf(" %s", &name);

    writeln("Hello ", name, "!");
}

The Enter key that the user presses after the name does not terminate the input. readf() continues to wait for more characters to add to the string:

What is your name? Mert
   ← The input is not terminated although Enter has been pressed
   ← (Let's assume that Enter is pressed a second time here)

One way of terminating the standard input stream in a console is pressing Ctrl-D under Unix-based systems and Ctrl-Z under Windows systems. If the user eventually terminates the input that way, we see that the new-line characters have been read as parts of the string as well:

Hello Mert
   ← new-line character after the name
!  ← (one more before the exclamation mark)

The exclamation mark appears after those characters instead of being printed right after the name.

readln() is more suitable when reading strings. Short for "read line", readln() reads until the end of the line. It is used differently because the " %s" format string and the & operator are not needed:

import std.stdio;

void main()
{
    char[] name;

    write("What is your name? ");
    readln(name);

    writeln("Hello ", name, "!");
}

readln() stores the new-line character as well. This is so that the program has a way of determining whether the input consisted of a complete line or whether the end of input has been reached:

What is your name? Mert
Hello Mert
!  ← new-line character before the exclamation mark

Such control characters that are at the ends of strings can be removed by std.string.chomp:

import std.stdio;
import std.string;

void main()
{
    char[] name;

    write("What is your name? ");
    readln(name);
    name = chomp(name);

    writeln("Hello ", name, "!");
}

The chomp() expression above returns a new string that does not contain the trailing control characters. Assigning that return value back to name produces the intended output:

What is your name? Mert
Hello Mert!    ← no new-line character

readln() can be used without a parameter. In that case it returns the line that it has just read. Chaining the result of readln() to chomp() enables a shorter and more readable syntax:

    string name = chomp(readln());

I will start using that form after introducing the string type below.

Double quotes, not single quotes

We have seen that single quotes are used to define character literals. String literals are defined with double quotes. 'a' is a character; "a" is a string that contains a single character.

string, wstring, and dstring are immutable

There are three string types that correspond to the three character types: char[], wchar[], and dchar[].

There are three aliases of the immutable versions of those types: string, wstring, and dstring. The characters of the variables that are defined by these aliases cannot be modified. For example, the characters of a wchar[] can be modified but the characters of a wstring cannot be modified. (We will see D's immutability concept in later chapters.)

For example, the following code that tries to capitalize the first letter of a string would cause a compilation error:

    string canNotBeMutated = "hello";
    canNotBeMutated[0] = 'H';             // ← compilation ERROR

We may think of defining the variable as a char[] instead of the string alias but that cannot be compiled either:

    char[] a_slice = "hello";  // ← compilation ERROR

This time the compilation error is due to the combination of two factors:

  1. The type of string literals like "hello" is string, not char[], so they are immutable.
  2. The char[] on the left-hand side is a slice, which, if the code compiled, would provide access to all of the characters of the right-hand side.

Since char[] is mutable and string is not, there is a mismatch. The compiler does not allow accessing characters of an immutable array by a mutable slice.

The solution here is to take a copy of the immutable string by the .dup property:

import std.stdio;

void main()
{
    char[] s = "hello".dup;
    s[0] = 'H';
    writeln(s);
}

The program can now be compiled and print the modified string:

Hello

Similarly, char[] cannot be used where a string is needed. In such cases, the .idup property can be used for producing an immutable string variable from a mutable char[] variable. For example, if s is a variable of type char[], the following line cannot be compiled:

    string result = s ~ '.';          // ← compilation ERROR

When the type of s is char[], the type of the expression on the right-hand side of the assignment above is char[] as well. .idup is used for producing immutable strings from existing strings:

    string result = (s ~ '.').idup;   // ← now compiles
Potentially confusing length of strings

We have seen that some Unicode characters are represented by more than one byte. For example, the letter é is represented by two bytes. This fact is reflected in the .length property of strings:

    writeln("résumé".length);

Although "résumé" contains six letters, the length of the string is the number of characters that it contains:

8

The type of the elements of string literals like "hello" is char and char represents a UTF-8 code unit. A problem that this may cause is when we try to replace a two-code-unit letter with a single-code-unit letter:

    char[] s = "résumé".dup;
    writeln("Before: ", s);
    s[1] = 'e';
    s[5] = 'e';
    writeln("After : ", s);

The two 'e' characters do not replace the two letters é; they replace single code units, resulting in an incorrect UTF-8 encoding:

Before: résumé
After : re�sueé    ← INCORRECT

When dealing with letters, symbols, and other Unicode characters directly as in the code above, the correct type to use is dchar:

    dchar[] s = "résumé"d.dup;
    writeln("Before: ", s);
    s[1] = 'e';
    s[5] = 'e';
    writeln("After : ", s);

The output:

Before: résumé
After : resume

Please note the two differences in the new code:

  1. The type of the string is dchar[].
  2. There is a d at the end of the literal "résumé"d, specifying its type as an array of dchars.
String literals

The optional character that is specified after string literals determines the type of the elements of the string:

import std.stdio;

void main()
{
     string s = "résumé"c;   // same as "résumé"
    wstring w = "résumé"w;
    dstring d = "résumé"d;

    writeln(s.length);
    writeln(w.length);
    writeln(d.length);
}

The output:

8
6
6

Because all of the letters of "résumé" can be represented by a single wchar or dchar, the last two lengths are equal to the number of letters.

String concatenation

Since they are actually arrays, all of the array operations can be applied to strings as well. ~ concatenates two strings and ~= appends to an existing string:

import std.stdio;
import std.string;

void main()
{
    write("What is your name? ");
    string name = chomp(readln());

    // Concatenate:
    string greeting = "Hello " ~ name;

    // Append:
    greeting ~= "! Welcome...";

    writeln(greeting);
}

The output:

What is your name? Can
Hello Can! Welcome...
Comparing strings

Note: Unicode does not define how the characters are ordered other than their Unicode codes. For that reason, you may get results that don't match with your expectations below.

We have used comparison operators <, >=, etc. with integer and floating point values before. The same operators can be used with strings as well, but with a different meaning: strings are ordered lexicographically. This ordering takes each character's Unicode code to be its place in a hypothetical grand Unicode alphabet. The concepts of less and greater are replaced with before and after in this hypothetical alphabet:

import std.stdio;
import std.string;

void main()
{
    write("      Enter a string: ");
    string s1 = chomp(readln());

    write("Enter another string: ");
    string s2 = chomp(readln());

    if (s1 == s2) {
        writeln("They are the same!");

    } else {
        string former;
        string latter;

        if (s1 < s2) {
            former = s1;
            latter = s2;

        } else {
            former = s2;
            latter = s1;
        }

        writeln("'", former, "' comes before '", latter, "'.");
    }
}

Because Unicode adopts the letters of the basic Latin alphabet from the ASCII table, the strings that contain only the letters of the ASCII table appear to be ordered correctly.

Lowercase and uppercase are different

Because each letter has a unique code, every letter is different from every other. For example, 'A' and 'a' are different letters.

Additionally, as a consequence of their ASCII code values, all of the uppercase letters are sorted before all of the lowercase letters. For example, 'B' is before 'a'. The icmp() function of the std.string module can be used when strings need to be compared regardless of lowercase and uppercase. You can see the functions of this module at its online documentation.

Because strings are arrays (and as a corollary, ranges), the functions of the std.array, std.algorithm, and std.range modules are very useful with strings as well.

Exercises
  1. Browse the documentations of the std.string, std.array, std.algorithm, and std.range modules.
  2. Write a program that makes use of the ~ operator: The user enters the first name and the last name, all in lowercase letters. Produce the full name that contains the proper capitalization of the first and last names. For example, when the strings are "ebru" and "domates" the program should print "Ebru Domates".
  3. Read a line from the input and print the part between the first and last 'e' letters of the line. For example, when the line is "this line has five words" the program should print "e has five".
  4. You may find the indexOf() and lastIndexOf() functions useful to get two indexes to produce a slice.

    As it is indicated in their documentation, the return types of indexOf() and lastIndexOf() are not int nor size_t, but sizediff_t. You may have to define variables of that exact type:

        sizediff_t first_e = indexOf(line, 'e');
    

    It is possible to define variables shorter with the auto keyword, which we will see in a later chapter:

        auto first_e = indexOf(line, 'e');