Strings
We have used strings in many programs that we have seen so far. Strings are a combination of the two features that we have covered in the last three chapters: characters and arrays. In the simplest definition, strings are nothing but arrays of characters. For example, char[]
is a type of string.
This simple definition may be misleading. As we have seen in the Characters chapter, D has three separate character types. Arrays of these character types lead to three separate string types, some of which may have surprising outcomes in some string operations.
readln
and strip
, instead of readf
There are surprises even when reading strings from the terminal.
Being character arrays, strings can contain control characters like '\n'
as well. When reading strings from the input, the control character that corresponds to the Enter key that is pressed at the end of the input becomes a part of the string as well. Further, because there is no way to tell readf()
how many characters to read, it continues to read until the end of the entire input. For these reasons, readf()
does not work as intended when reading strings:
import std.stdio; void main() { char[] name; write("What is your name? "); readf(" %s", &name); writeln("Hello ", name, "!"); }
The Enter key that the user presses after the name does not terminate the input. readf()
continues to wait for more characters to add to the string:
What is your name? Mert ← The input is not terminated although Enter has been pressed ← (Let's assume that Enter is pressed a second time here)
One way of terminating the standard input stream in a terminal is pressing Ctrl-D under Unix-based systems and Ctrl-Z under Windows systems. If the user eventually terminates the input that way, we see that the new-line characters have been read as parts of the string as well:
Hello Mert ← new-line character after the name ! ← (one more before the exclamation mark)
The exclamation mark appears after those characters instead of being printed right after the name.
readln()
is more suitable when reading strings. Short for "read line", readln()
reads until the end of the line. It is used differently because the " %s"
format string and the &
operator are not needed:
import std.stdio; void main() { char[] name; write("What is your name? "); readln(name); writeln("Hello ", name, "!"); }
readln()
stores the new-line character as well. This is so that the program has a way of determining whether the input consisted of a complete line or whether the end of input has been reached:
What is your name? Mert
Hello Mert
! ← new-line character before the exclamation mark
Such control characters as well as all whitespace characters at both ends of strings can be removed by std.string.strip
:
import std.stdio; import std.string; void main() { char[] name; write("What is your name? "); readln(name); name = strip(name); writeln("Hello ", name, "!"); }
The strip()
expression above returns a new string that does not contain the trailing control characters. Assigning that return value back to name
produces the intended output:
What is your name? Mert
Hello Mert! ← no new-line character
readln()
can be used without a parameter. In that case it returns the line that it has just read. Chaining the result of readln()
to strip()
enables a shorter and more readable syntax:
string name = strip(readln());
I will start using that form after introducing the string
type below.
formattedRead
for parsing strings
Once a line is read from the input or from any other source, it is possible to parse and convert separate data that it may contain with formattedRead()
from the std.format
module. Its first parameter is the line that contains the data, and the rest of the parameters are used exacly like readf()
:
import std.stdio; import std.string; import std.format; void main() { write("Please enter your name and age," ~ " separated with a space: "); string line = strip(readln()); string name; int age; formattedRead(line, " %s %s", name, age); writeln("Your name is ", name, ", and your age is ", age, '.'); }
Please enter your name and age, separated with a space: Mert 30 Your name is Mert, and your age is 30.
Both readf()
and formattedRead()
return the number of items that they could parse and convert successfully. That value can be compared against the expected number of data items so that the input can be validated. For example, as the formattedRead()
call above expects to read two items (a string
as name and an int
as age), the following check ensures that it really is the case:
uint items = formattedRead(line, " %s %s", name, age); if (items != 2) { writeln("Error: Unexpected line."); } else { writeln("Your name is ", name, ", and your age is ", age, '.'); }
When the input cannot be converted to name
and age
, the program prints an error:
Please enter your name and age, separated with a space: Mert
Error: Unexpected line.
Double quotes, not single quotes
We have seen that single quotes are used to define character literals. String literals are defined with double quotes. 'a'
is a character; "a"
is a string that contains a single character.
string
, wstring
, and dstring
are immutable
There are three string types that correspond to the three character types: char[]
, wchar[]
, and dchar[]
.
There are three aliases of the immutable versions of those types: string
, wstring
, and dstring
. The characters of the variables that are defined by these aliases cannot be modified. For example, the characters of a wchar[]
can be modified but the characters of a wstring
cannot be modified. (We will see D's immutability concept in later chapters.)
For example, the following code that tries to capitalize the first letter of a string
would cause a compilation error:
string cannotBeMutated = "hello"; cannotBeMutated[0] = 'H'; // ← compilation ERROR
We may think of defining the variable as a char[]
instead of the string
alias but that cannot be compiled either:
char[] a_slice = "hello"; // ← compilation ERROR
This time the compilation error is due to the combination of two factors:
- The type of string literals like
"hello"
isstring
, notchar[]
, so they are immutable. - The
char[]
on the left-hand side is a slice, which, if the code compiled, would provide access to all of the characters of the right-hand side.
Since char[]
is mutable and string
is not, there is a mismatch. The compiler does not allow accessing characters of an immutable array through a mutable slice.
The solution here is to take a copy of the immutable string by using the .dup
property:
import std.stdio; void main() { char[] s = "hello".dup; s[0] = 'H'; writeln(s); }
The program can now be compiled and will print the modified string:
Hello
Similarly, char[]
cannot be used where a string
is needed. In such cases, the .idup
property can be used to produce an immutable string
variable from a mutable char[]
variable. For example, if s
is a variable of type char[]
, the following line will fail to compile:
string result = s ~ '.'; // ← compilation ERROR
When the type of s
is char[]
, the type of the expression on the right-hand side of the assignment above is char[]
as well. .idup
is used for producing immutable strings from existing strings:
string result = (s ~ '.').idup; // ← now compiles
Potentially confusing length of strings
We have seen that some Unicode characters are represented by more than one byte. For example, the character 'é' (the latin letter 'e' combined with an acute accent) is represented by Unicode encodings using at least two bytes. This fact is reflected in the .length
property of strings:
writeln("résumé".length);
Although "résumé" contains six letters, the length of the string
is the number of UTF-8 code units that it contains:
8
The type of the elements of string literals like "hello"
is char
and each char
value represents a UTF-8 code unit. A problem that this may cause is when we try to replace a two-code-unit character with a single-code-unit character:
char[] s = "résumé".dup; writeln("Before: ", s); s[1] = 'e'; s[5] = 'e'; writeln("After : ", s);
The two 'e' characters do not replace the two 'é' characters; they replace single code units, resulting in an invalid UTF-8 encoding:
Before: résumé
After : re�sueé ← INCORRECT
When dealing with letters, symbols, and other Unicode characters directly, as in the code above, the correct type to use is dchar
:
dchar[] s = "résumé"d.dup; writeln("Before: ", s); s[1] = 'e'; s[5] = 'e'; writeln("After : ", s);
The output:
Before: résumé After : resume
Please note the two differences in the new code:
- The type of the string is
dchar[]
. - There is a
d
at the end of the literal"résumé"d
, specifying its type as an array ofdchar
s.
In any case, keep in mind that the use of dchar[]
and dstring
does not solve all of the problems of manipulating Unicode characters. For instance, if the user inputs the text "résumé" you and your program cannot assume that the string length will be 6 even for dchar
strings. It might be greater if e.g. at least one of the 'é' characters is not encoded as a single code point but as the combination of an 'e' and a combining accute accent. To avoid dealing with this and many other Unicode issues, consider using a Unicode-aware text manipulation library in your programs.
String literals
The optional character that is specified after string literals determines the type of the elements of the string:
import std.stdio; void main() { string s = "résumé"c; // same as "résumé" wstring w = "résumé"w; dstring d = "résumé"d; writeln(s.length); writeln(w.length); writeln(d.length); }
The output:
8 6 6
Because all of the Unicode characters of "résumé" can be represented by a single wchar
or dchar
, the last two lengths are equal to the number of characters.
String concatenation
Since they are actually arrays, all of the array operations can be applied to strings as well. ~
concatenates two strings and ~=
appends to an existing string:
import std.stdio; import std.string; void main() { write("What is your name? "); string name = strip(readln()); // Concatenate: string greeting = "Hello " ~ name; // Append: greeting ~= "! Welcome..."; writeln(greeting); }
The output:
What is your name? Can Hello Can! Welcome...
Comparing strings
Note: Unicode does not define how the characters are ordered other than their Unicode codes. For that reason, you may get results that don't match your expectations below.
We have used comparison operators <
, >=
, etc. with integer and floating point values before. The same operators can be used with strings as well, but with a different meaning: strings are ordered lexicographically. This ordering takes each character's Unicode code to be its place in a hypothetical grand Unicode alphabet. The concepts of less and greater are replaced with before and after in this hypothetical alphabet:
import std.stdio; import std.string; void main() { write(" Enter a string: "); string s1 = strip(readln()); write("Enter another string: "); string s2 = strip(readln()); if (s1 == s2) { writeln("They are the same!"); } else { string former; string latter; if (s1 < s2) { former = s1; latter = s2; } else { former = s2; latter = s1; } writeln("'", former, "' comes before '", latter, "'."); } }
Because Unicode adopts the letters of the basic Latin alphabet from the ASCII table, the strings that contain only the letters of the ASCII table will always be ordered correctly.
Lowercase and uppercase are different
Because each character has a unique code, every letter variant is different from the others. For example, 'A' and 'a' are different letters, when directly comparing Unicode strings.
Additionally, as a consequence of their ASCII code values, all of the latin uppercase letters are sorted before all of the lowercase letters. For example, 'B' comes before 'a'. The icmp()
function of the std.string
module can be used when strings need to be compared regardless of lowercase and uppercase. You can see the functions of this module at its online documentation.
Because strings are arrays (and as a corollary, ranges), the functions of the std.array
, std.algorithm
, and std.range
modules are very useful with strings as well.
Exercises
- Browse the documentations of the
std.string
,std.array
,std.algorithm
, andstd.range
modules. - Write a program that makes use of the
~
operator: The user enters the first name and the last name, all in lowercase letters. Produce the full name that contains the proper capitalization of the first and last names. For example, when the strings are "ebru" and "domates" the program should print "Ebru Domates". - Read a line from the input and print the part between the first and last 'e' letters of the line. For example, when the line is "this line has five words" the program should print "e has five".
You may find the
indexOf()
andlastIndexOf()
functions useful to get the two indexes needed to produce a slice.As it is indicated in their documentation, the return types of
indexOf()
andlastIndexOf()
are notint
norsize_t
, butptrdiff_t
. You may have to define variables of that exact type:ptrdiff_t first_e = indexOf(line, 'e');
It is possible to define variables with the
auto
keyword, which we will see in a later chapter:auto first_e = indexOf(line, 'e');