Generating arrays at compile-time in C++ with lambdas

Suppose that you want to check whether a character in C++ belongs to a fixed set, such as ‘\0’, ‘\x09’, ‘\x0a’,’\x0d’, ‘ ‘, ‘#’, ‘/’, ‘:’, ‘<‘, ‘>’, ‘?’, ‘@’, ‘[‘, ‘\\’, ‘]’, ‘^’, ‘|’. A simple way is to generate a 256-byte array of Boolean values and lookup the value. This approach is sometimes called memoization (and not memorization!!!). You might do it as follows:

constexpr static bool is_forbidden_host_code_point_table[] = {
  1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
bool is_forbidden_host_code_point(char c) {
  return is_forbidden_host_code_point_table[uint8_t(c)];
}

It is reasonably efficient in practice. Some people might object to how the table is generated. Can you have the C++ compiler generate the array at compile-time from a function?

Using C++17, you might do it with an std::array as follows:

constexpr static std::array<uint8_t, 256> is_forbidden_array = []() {
  std::array<uint8_t, 256> result{};
  for (uint8_t c : {'\0', '\x09', '\x0a','\x0d', ' ', '#', '/', ':',
    '<', '>', '?', '@', '[', '\\', ']', '^', '|'}) {
   result[c] = true;
  }
  return result;
}();

bool is_forbidden_host_code_point_array(char c) {
  return is_forbidden_array[uint8_t(c)];
}

These two approaches should be equivalent in practice. This might compile down to a single lookup instruction, or the equivalent.

You may want to compare it against the default (without memoization) which might be…

bool is_forbidden_host_code_point_default(char c) noexcept {
  return c == '\0' || c == '\x09' || c == '\x0a'
   || c == '\x0d' || c == ' ' || c == '#'
   || c == '/'|| c == ':' || c == '<'|| c == '>'
   || c == '?' || c == '@' || c == '['|| c == '\\'
  || c == ']'|| c == '^' || c == '|';
}

A compiler like GCC might compile this routine to a bitset approach such as …

 cmp dil, 62
ja .L2
movabs rax, 6052978675329017345
bt rax, rdi
setc al
ret
.L2:
sub edi, 63
cmp dil, 61
ja .L4
movabs rax, 2305843013240225795
bt rax, rdi
setc al
ret

The default approach certainly generates more instructions, and might be less efficient in some cases.

Further reading. The evolution of constexpr: compile-time lookup tables in C++

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

2 thoughts on “Generating arrays at compile-time in C++ with lambdas”

  1. This is really cool!
    It doesn’t sound like memoization to me, though. I think this is just an ordinary lookup vocabulary an immutable table (albeit an optimized one in the case of a bit-test instruction). Memoization is about caching values as they are discovered and requires the use of a mutable table to cache the newly discovered values over time.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.