Chapter 9: Strings and Regular Expressions

9.1

The C++ standard library offers a string type to save most users from C-style manipulation of arrays of characters through pointers. A string_view type allows us to manipulate sequences of characters however they may be stored.

9.2

string is a Regular type for owning and manipulating a sequence of characters of various character types. It has operations/methods

concatenate: "string"s + '\n' + "abc";
substr: string s = name.substr(6, 10);
replace: name.replace(0, 5, "nicholas");
toupper: name[0] = toupper(name[0]);
assignment(=), subscripting([]), comparison(==, !=), lexicographical ordering(<=>), iteration, input, streaming.
to C-style string: auto c_style_str = s.to_str();
A string literal is by definition a const char *
"Cat"s is a std::string literal

9.2.1

These days, string is usually implemented using the short-string optimization. That is, short string values are kept in the string object itself and only longer strings are placed on free store.

To handle multiple character sets, string is really an alias for a general template basic_string with the character type char:

template<typename Char>
class basic_string {
  // ... string of char
};
using string = basic_string<char>;

A user can define strings of arbitrary character types. For example,

using Jstring = basic_string<Jchar>; // japanese characters.

Now we can do all the ususal string operations on Jstring.

9.3

The most common use of a sequence of characters is to pass it ot some function to read. This can be achieved by passing a string by value, a reference to a string, or a C-style string. In many systems, there are further alternatives, such as string types not offered by the standard.

The standard library offers string_view; a string_view is basically a (pointer, length) (or { begin(), size() }) pair denoting a sequence of characters.

A string_view gives access to a contiguous sequence of characters. A string_view is like a pointer or a reference in that it does not own the characters it points to. In that, it resembles an STL pair of iterators.

// GOOD
string cat(string_view sv1, string_view sv2){
  string reserve(sv1.length() + sv2.length());  // take the final ownership.
  char* p = &reserve[0];
  copy(sv1.begin(), sv1.end(), p);
  copy(sv2.begin(), sv2.end(), p);
  return reserve;
}

// NOT GOOD. Here C-style string argument for s1 and/or s2 will create a temporary string arguments, but `cat` above won't create a temporary string argument.
string compose(const string& s1, const stirng& s2);

// different example of `cat`
auto s1 = cat("Hello", "World"); 
auto s2 = cat("Hello"s, "World"); 
auto s2 = cat("Hello"s, "World"sv);  // sv is string view literal suffix.

What is the difference between sending "World" and "World"sv to cat()? The reason is that when we pass "World" we need to construct a string_view from a const char* and that requires counting the characters. For "World"sv the length is computed at compile time.

When returning a string_view, remember that it is very much like a pointer. It is very easy to point to objects that are destroyed before we use them, such as pointing to a local variables inside a function.

One significant restriction of string_view is that it is a read-only view of its characters. You cannot use a string_view to pass characters to a function that modifies its arguments to lowercase. Consider gsl::span or gsl::string_span.

The behavior of out-of-range access to a string_view is unspecified (UB!). If you want guaranteed range checking, use at().

9.4

In <regex>, the standard library provides support for regular expressions in the form of the std::regex class and its supporting functions.

regex pat{R"(\w{2}\s*\d{5}(-\d{4})?)"};  //XXXdddddd-dddd and variants. Post code

If you are not familiar with regular expressions, this may be a good time to learn about them (Maddock, 2009; Friedl, 1997; Stroustrup, 2009).

R"()" is a raw string literal. Raw strings are particularly suitable for regular expressions because they tend to contain a lot of backslashes.

In <regex>, the standard library provides:

regex_match(): match a regular expression against a string
regex_search(): search for a string that matches a regular expression in an (arbitrarily long) stream of data
regex_replace(): search for strings that match a regular expression in an stream of data and replace them
regex_iterator(): iterate over matches and submatches
regex_token_iterator(): iterate over non-matches

9.4.1

The regex_search(line, matches, pattern) searches the line for anything that matches the regular expression in pattern and if it finds any, it stores them in matches. If no match was found, regex_search(line, matches, pattern) returns false. The matches variable is of type std::smatch. The "s" stands for "sub" or "string", and an smatch is a vector of submatches of type string. The first element, here matches[0], is the complete match, the others are submatches.

9.4.2

The regex library can recognize several variants of the notation for regular expression. One of them is the default notation, a variant of the ECMA standard used for ECMAScripts.

Regular Expression Special Characters

character	meaning
`.`	Any single character
`[`	Begin character class
`]`	End character class
`{`	Begin count
`}`	End count
`(`	Begin grouping
`)`	End grouping
`\`	Next character has a special meaning
`*`	as a suffix: Zero or more
`+`	as a suffix: One or more
`?`	as a suffix: Optional (zero or one)
`\\|`	Alterantive ("or")
`^`	Start of line; negation
`$`	End of line.
`{n}`	Exact `n` times. e.g. `A{3}`
`{n, }`	At least `n` times. e.g. `A{3, }`
`{n,m}`	Between [n, m] times. e.g. `A{3, 5}`

A suffix ? after any of the repetition notations makes the pattern matcher "lazy" or "non-greedy". That is, when looking for a pattern, it will look for the shortest match rather than the longest. By default, the pattern matches always looks for the longest match; this is known as the Max Munch rule. So, for abababab, the pattern (ab)+ will match all of abababab, but (ab)+? will match only the first ab.

Regular Expression Character Classes

token	meaning
`alnum`	Any alphanumeric character
`alpha`	Any alphabetic character
`blank`	Any whitespace character that is not a line separator
`cntrl`	Any control character
`d`	Any decimal digit
`digit`	Any decimal digit
`graph`	Any graphical character
`lower`	Any lower case character
`print`	Any printable character
`punct`	Any punctuation character
`s`	Any whitespace character
`space`	Any whitespace character
`upper`	Any upper case character
`w`	Any word character ( (a-z,A-Z,0-9) plus the underscore )
`xdigit`	ANy hexadecimal digit character

In a regualr expression, a character class name must be bracketed bty [::], e.g. [:digit:]. Furthermore, they must be used within a [] pair defining a character class, so ultimately it is something like [[:digit::]]. This is quite a mounthful, so there comes the short-hand

short-hand	original	meaning
`\d`	`[[:digit:]]`	A decimal digit
`\s`	`[[:space:]]`	A space (space, tab, etc)
`\w`	`[_[:alnum]]`	A letter (a-z) or digit or underscore(`_`)
`\D`	`[^[:digit:]]`	Not `\d`
`\S`	`[^[:space:]]`	Not `\s`
`\W`	`[^_[:alnum]]`	Not `\w`
`\l`	`[[:lower:]]`	A lower case character
`\u`	`[[:upper:]]`	An upper case character
`\L`	`[^[:lower:]]`	Not `\l`
`\U`	`[^[:upper:]]`	Not `\u`

For full portability, use the character class names rather than short-hands.

a group (a subpattern) potentially to be represented by a sub_match is delimited by parentheses. If you need parentheses that should not define a subpattern, use (?: rather than plain (. Compare

(\s|:|,)*(\d*)      # two subpattern, need both
(?:\s|:|,)*(\d*)    # only one subpattern, the second part.

The second pattern would save the regular expression engine from having to store the first characters.

There is another example for parsing xml

<(.*?)>(.*?)</\1>   # \1 means "the same as group 1". lazy match <tag>...</tag> pattern.

9.4.3

We can use a sregex_iterator (a regex_iterator<string>) to output all whitespace separated words in a string. A regex_iterator is a bidirectional iterator, so we cannot directly iterate over an istream(which offers only an input stream), Also, we cannot write through a regex_iterator, and the default regex_iterator{} is the only possible end_of_sequence.