Having fun with string literal suffixes in C++

The C++14 standard introduced string suffixes. E.g., the expression "abcde\0\0fgh"s is an std::string with 10 characters. You also have user-defined string suffixes: you can create your own.

C++11 also added regular expressions to the C++ language as a standard feature. I wanted to have fun and see whether we could combine these features.

Regular expressions are useful to check whether a given string matches a pattern. For example, the expression \d+ checks that the string is made of one or more digits. Unfortunately, the backlash character needs to be escaped in C++, so the string \d+ may need to be written as "\\d+" or you may use a raw string: a raw string literal starts with R"( and ends in )" so you can write R"(\d+)". For complicated expressions, a raw string might be better.

A user-defined string literal is a way to specialize a string literal according to your own needs. It is effectively a convenient way to design your own “string types”. You can code it up as:

myclass operator"" _mysuffix(const char *str, size_t len) {
  return myclass(str, len);
}

And once it is defined, instead of writing myclass("mystring", 8), you can write "mystring"_mysuffix.

In any case, we would like to have a syntax such as this:

bool is_digit = "\\d+"_re("123");

I can start with a user-defined string suffix:

convenience_matcher operator "" _re(const char *str, size_t) {
return convenience_matcher(str);
}

I want my convenience_matcher to construct a regular expression instance, and to call the matching function whenever a parameter is passed in parenthesis. The following class might work:

#include <regex>
struct convenience_matcher {
  convenience_matcher(const char *str) : re(str) {}
  bool match(const std::string &s) {
    std::smatch base_match;
    return std::regex_match(s, base_match, re);
  }
  bool operator()(const std::string &s) { return match(s); }
  std::regex re;
};

And that is all. The following expressions will then return a Boolean value indicating whether we have the required pattern:

 "\\d+"_re("123") // true
 "\\d+"_re("a23") // false
 R"(\d+)"_re("123") // true
 R"(\d+)"_re("a23") // false

I have posted a complete example. It is just for illustration and I do not recommend using this code for anything serious. I am sure that you can do better!

Daniel Lemire, "Having fun with string literal suffixes in C++," in Daniel Lemire's blog, July 5, 2023.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

7 thoughts on “Having fun with string literal suffixes in C++”

  1. The very first sentence contains a typo ‘used-defined’ which should be changed to ‘user-defined’. The second paragraph contains another typo ‘so you can write R”(\d+)” ‘ which should be changed to ‘so you can write R”(\d+)” ‘

  2. Thanks Daniel! Do you know of any practical applications for operator””? Operator overloading has become frowned upon for its obscurity, but this one has the potential to carry meaningful names.

    1. Operator overloading has become frowned upon

      I think that’s different from standard operator overloading. You have to specify a suffix, so you can’t use it by accident.

      Do you know of any practical applications for operator””?

      We use it in the simdjson library.

  3. Off topic, but what’s your opinion of adding operator overloading to C, in a way that the user would name the function that implements that operator, and therefore doesn’t require name mangling?

    It’s an idea I’ve been kicking around.

      1. In regular, old fashioned C.

        I’m thinking something like: _Operator = UTF8String_Init;

        Where UTF8String_Init is a previously declared function.

        Implementation details could still be hidden, the _Operator declaration could be in a header too, and no name mangling necessary since the function has already been named by the programmer.

        No need for a class to contain it either, since the types could be desuced from the parameters of the named function e.g: UTF8String UTF8String_Init(char8_t *Characters);

        And for strings, I don’t see why there couldn’t be multiple variants for the same operator.

        Like:

        UTF8String UTF8String_InitFromChars(char8_t *Chars);

        UTF8String UTF8String_InitFromChar(char8_t Char);

        _Overload = UTF8String_InitFromChar;

        _Overload = UTF8String_InitFromChars;

        Basically, soft function overloading, but with better names.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.