Quickly checking that a string belongs to a small set

Suppose that I give you a set of reference strings (“ftp”, “file”, “http”, “https”, “ws”, “wss”). Given a new string, you want to quickly tell whether it is part of this set.

You might use a regular expression but it is unlikely to be fast in general:

const std::regex txt_regex("https?|ftp|file|wss?");
// later... 
bool match = std::regex_match(v.begin(), v.end(), txt_regex);

A sensible solution might be to create a set and then to ask whether the string is in the set. In C++, a default set type is the unordered_set thus your code might look as follows:

static const std::unordered_set<std::string_view> special_set = {
    "ftp", "file", "http", "https", "ws", "wss"};

bool hash_is_special(std::string_view input) {
  return special_set.find(input) != special_set.end();
}

You can also use a tool like gperf which constructs a perfect-hash function for you.

You might also be more direct about it, and just do several comparisons:

bool direct_is_special(std::string_view input) {
  return (input == "https") || (input == "http") || (input == "ftp") ||
         (input == "file") || (input == "ws") || (input == "wss");
}

I call it a branchy version because of the ‘||’ operator which suggests to the compiler that you want to evaluate the comparisons one by one, exiting once one is true; if you replace the ‘||’ operator with the operator ‘|’ then you the result is ‘branchless’ in the sense that you entice the compiler to evaluate all comparisons.

If you look at how the code gets compiled, you may notice that the compiler is forced to do comparisons and jumps, because it is not allowed to read in the provided string beyond its reported size.

You might be able to do slightly better if you can tell your compiler that the string you receive is ‘padded’ so that you can read eight bytes safely from it. I could not find a very elegant way to do it, but the following code works:

static inline uint64_t string_to_uint64(std::string_view view) {
  uint64_t val;
  std::memcpy(&val, view.data(), sizeof(uint64_t));
  return val;
}

uint32_t string_to_uint32(const char *data) {
  uint32_t val;
  std::memcpy(&val, data, sizeof(uint32_t));
  return val;
}


bool fast_is_special(std::string_view input) {
  uint64_t inputu = string_to_uint64(input);
  if ((inputu & 0xffffffffff) == string_to_uint64("https\0\0\0")) {
    return input.size() == 5;
  }
  if ((inputu & 0xffffffff) == string_to_uint64("http\0\0\0\0")) {
    return input.size() == 4;
  }
  if (uint32_t(inputu) == string_to_uint32("file")) {
    return input.size() == 4;
  }
  if ((inputu & 0xffffff) == string_to_uint32("ftp\0")) {
    return input.size() == 3;
  }
  if ((inputu & 0xffffff) == string_to_uint32("wss\0")) {
    return input.size() == 3;
  }
  if ((inputu & 0xffff) == string_to_uint32("ws\0\0")) {
    return input.size() == 2;
  }
  return false;
}

Though I did not do it, you can extend the comparison so that it is case-insensitive (simply AND the input with the bytes 0xdf instead of the bytes 0xff).

You can use a faster approach if you can assume that the input string has been padded with zeros:

  uint64_t inputu = string_to_uint64(input);
  uint64_t https = string_to_uint64("https\0\0\0");
  uint64_t http = string_to_uint64("http\0\0\0\0");
  uint64_t file = string_to_uint64("file\0\0\0\0");
  uint64_t ftp = string_to_uint64("ftp\0\0\0\0\0");
  uint64_t wss = string_to_uint64("wss\0\0\0\0\0");
  uint64_t ws = string_to_uint64("ws\0\0\0\0\0\0");
  if((inputu == https) | (inputu == http)) {
    return true;
  }
  return ((inputu == file) | (inputu == ftp) 
          | (inputu == wss) | (inputu == ws));

Observe how I have selected what I believe are the two most common cases (among URL protocols).

Finally, we must use a hash function to solve the problem with a single comparison:

static const uint8_t shiftxor_table[128] = {
    'w', 's', 0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   'f', 'i', 'l', 'e', 0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   'f', 't', 'p', 0,   0,   0,   0,   0,
    'w', 's', 's', 0,   0,   0,   0,   0,   'h',
    't', 't', 'p', 0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   'h', 't', 't',
    'p', 's', 0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0};

bool shiftxor_is_special(std::string_view input) {
  uint64_t inputu = string_to_uint64(input);

  return string_to_uint64(
             shiftxor_table +
             (((inputu >> 28) ^ (inputu >> 14)) &
              0x78)) == inputu;
}

If you do not want to assume that the strings are padded (i.e., no cheating), then you can do it the “gperf way” (it is my own code, but it is based on gpref):

std::string_view table_hashnocheat_is_special[] = {"http", "", "https", 
  "ws", "ftp", "wss", "file", ""};

bool hashnocheat_is_special(std::string_view input) {
  if(input.empty()) { return false; }
  int hash_value = (2*input.size() + (unsigned)(input[0])) & 7;
  const std::string_view target = table_craftedhash_is_special[hash_value];
  return (target[0] == input[0]) && (target.substr(1) == input.substr(1));
}

I am sure that there are faster and more clever alternatives!

In any case, how fast are my alternatives? Using GCC 11 on an Intel Ice Lake server, I get the following results:

gperf7.1 ns/string

regex 360 ns/string
std::unordered_map 19 ns/string
direct 16 ns/string
direct (branchy) 13 ns/string
direct (branchless) 18 ns/string
hashnocheat_is_special 7.0 ns/string
fast 2.6 ns/string
faster 1.9 ns/string
hashing (shiftxor_is_special) 1.1 ns/string

On an Apple M2 with LLVM 12, I get similar (but better) results:

regex 450 ns/string
std::unordered_map 15 ns/string
direct (branchy) 7.5 ns/string
direct (branchless) 4.8 ns/string
gperf 8.0 ns/string
hashnocheat_is_special 7.2 ns/string
fast 1.1 ns/string
faster 0.8 ns/string
hashing (shiftxor_is_special) 0.4 ns/string

Care is needed when optimizing such small functions: whether and how the function gets inlined can be critical to the good performance. The results will depend also on the data source and on the compiler.

The hashing method (e.g., shiftxor_is_special) has the benefit of being essentially branch-free which makes its performance having little dependency on the distribution of the input. It is also fastest in these tests.

If you cannot safely pad your strings, I recommend the hashnocheat_is_special function. Having to deal with variable-length strings is significantly slower, but it is sometimes necessary.

My source code is available.

Further reading: I was later informed that my friend Wojciech Muła had looked at the problem earlier in 2022. He did not look at string padding, but for the version with no-cheating (variable-length strings), he comes up with the same conclusion that I do.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

36 thoughts on “Quickly checking that a string belongs to a small set”

    1. Seems like a good place for a bloom filter.

      Presumably the overwhelming majority of words seen don’t match, so you would win by rejecting non-matches quickly.

      1. A Bloom filter will require hashing the string six or seven times with different hash functions. And it has false positives, so you still have to do a real set-membership test if the filter returns true.

        I didn’t see anything in the post stating that matches will be rare, and in the benchmark code 60% of the strings match, so this doesn’t seem like a good trade-off.

        1. There is no need for different hash functions with a bloom filter. You just mix in a different salt for each iteration.

          But if the list of strings is known in advance as in this example, then gperf would be even simpler.

          1. I have added gperf to the benchmark. It is competitive but it is not a match for the fast approaches I propose. This being said, it could maybe be made competitive with some extra engineering.

  1. The ‘direct’ example uses bitwise | instead of logical ||, so no short-circuit evaluation? I guess that will affect the timing to 0.5x what you have now (amortised, if the needle is uniformly distributed)

  2. It is a bit unfair to expect the input strings to be padded to 8 or 4 bytes. In practice, the parameter to memcpy should be input.size() instead of sizeof(uint..), but that seems to tank the performance.
    I played a bit with the example (on godbolt: zd9c9YM8b), and the “fast” implementation was the most stable (similar duration over multiple runs). The branchless implementation is consistently the fastest, and the hash version is the sweet spot between clarity and performance.
    I also tried a lame implementation of a letter graph, and it seems to be somewhere in between “branchless” and “fast”.

    1. It is a bit unfair to expect the input strings to be padded to 8 or 4 bytes.

      For a general-purpose function, I agree, but if it is part of your own system where you control where the strings come from, then I disagree. Requiring padding to your strings is quite doable if you have fine control over your data structures.

    1. In my functional programming language Fexl I’m using an “index” technique which identifies maximal common prefixes among keys, identifying the optimal sequences of individual three-way character comparisons to execute before doing a final definitive full key comparison.

      Here are the operations index_get, index_put, index_del, index_pairs:

      https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index/index.fxl

      Here is a test suite:

      https://github.com/chkoreff/Fexl/blob/master/src/test/b15.fxl

      My next step is to write a Fexl program which, when given a set of key-value pairs, automatically writes an optimal C function which looks up the value associated with a key. By “optimal” I mean using the index technique, not using a hash function. It would write the C function to be given the length of the key up front, not relying on trailing NUL.

      1. I may also write the index_get, index_put, and other operations in native C instead of Fexl, but it’s not a bottleneck for me so there’s no hurry. I just think auto-generating C code for a particular index value will be interesting. I may even use the result to do the lookup for the set of standard Fexl functions already written directly in C.

      2. OK, I’ve written some code that “compiles” a list of key-value pairs into C code. The keys and values must be strings, and I don’t yet do anything to detect special characters which must be escaped, or multi-line strings.

        Here’s a test suite that tries a few different lists:

        https://github.com/chkoreff/Fexl/blob/master/src/test/index_C.fxl

        Here’s the reference output for that:

        https://github.com/chkoreff/Fexl/blob/master/src/out/index_C

        Here’s the Fexl code which does the actual code generation work:

        https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index/render_C.fxl

        This generates a “lookup” function which takes a pointer to the key chars and a separate length. If you have NUL terminated data you can call strlen before calling lookup.

        As far as I can tell the generated code is “optimal” in the sense that it does the fewest number of individual character comparisons needed to reach a point at which strncmp yields a definitive answer.

        1. By the way, I could probably use “switch” instead of successive comparisons of the same key character, but I figure the optimizer will already have that covered.

      1. Not recently, but I did when I built one into my game engine a few years back; once assembled, a critbit trie does (on average) one bit compare per log2 strings interned (which can trivially be made branchless), and a single string compare when the appropriate leaf node is found.
        It’s not faster for a set containing one or perhaps two strings, but for even five strings it was the fastest method I found, and I did compare against various other methods.

  3. > You might be able to do slightly better if you can tell your compiler that the string you receive is ‘padded’ so that you can read eight bytes safely from it. I could not find a very elegant way to do it

    There’s std::assume in C++23 (https://en.cppreference.com/w/cpp/language/attributes/assume) but currently the only way is to use the compiler builtin functions, e.g.: https://godbolt.org/z/xxK939vca
    I’ve also taken some liberties at rewriting your approach, but the generate assembly should be similar

    1. Transcribing your interesting code:

      #include <algorithm>
      #include <array>
      #include <bit>
      #include <ranges>
      #include <string_view>
      
      static constexpr std::array protocols = {
          "https",
          "http",
          "file",
          "ftp",
          "wss",
          "ws"
      };
      
      static constexpr auto string_to_uint64 = [](std::string_view s) constexpr {
          using CharT = decltype(s)::value_type;
          static_assert(sizeof(CharT) == 1);
          std::array<CharT, 8> bytes{};
          const auto copy_size = std::min(s.size(), 8ul);
          std::copy_n(s.cbegin(), copy_size, bytes.begin());
          return std::bit_cast<uint64_t>(bytes);
      };
      
      static constexpr auto protocol_set = [] {
          std::array<uint64_t, protocols.size()> ps{};
          std::ranges::transform(protocols, ps.begin(), string_to_uint64);
          return ps;
      }();
      
      bool is_special(std::string_view s) {
          __builtin_assume(s.size() == 8ul);
          const auto as_uint64{ string_to_uint64(s) };
          return std::ranges::count(protocol_set, as_uint64);
      }
      

  4. I’ve worked on Chromium’s HTML and CSS parsers, and in general, this is an unsolved problem. I’ve seen all of these, depending on number of strings in the set, whether length is available and other circumstances:

    Switch on first letter, test with strcmp/memcmp
    Switch on length, test with strcmp/memcmp
    Test first + last letter of word as a quick filter, test with strcmp/memcmp
    Perfect hashing (through gperf)
    Hash table lookup
    Check four or eight bytes at a time (as in this post)

    Probably others I forgot as well. One notable thing I haven’t seen is special SIMD instructions. I know Intel wanted to people to use PCMPESTRI and such in the past, but it’s a really slow instruction and SSE4.2 is a bit too new (alas!).

  5. Here’s a simple perfect hashing solution which seems to beat all variants mentioned on the blog post on my M1. If one would have vectored 64-bit multiply high and variable vector shuffle instructions, this could be even massaged to work on vectors of padded strings:

    const uint8_t table[64] =
    {
    ‘w’, ‘s’, 0, 0, 0, 0, 0, 0,
    ‘f’, ‘t’, ‘p’, 0, 0, 0, 0, 0,
    ‘w’, ‘s’, ‘s’, 0, 0, 0, 0, 0,
    ‘f’, ‘i’, ‘l’, ‘e’, 0, 0, 0, 0,
    ‘h’, ‘t’, ‘t’, ‘p’, 0, 0, 0, 0,
    ‘h’, ‘t’, ‘t’, ‘p’, ‘s’, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, /* comparison always fails */
    0, 0, 0, 0, 0, 0, 0, 0 /* comparison always fails */
    };

    bool mulhi_is_special(std::string_view input) {
    uint64_t inputu = string_to_uint64(input);

    return *(uint64_t *)
    (table +
    (((inputu *
    (unsigned __int128)39114211812342) >> 64) & 0x38)) ==
    inputu;
    }

    1. The constant above is found by random brute force search, and is actually not very easy to create on ARM assembler. I tried finding constants which would be nicer to load and found them, but interestingly enough this made practically no difference, at least on M1.

    2. By the way, one could also use Intel PEXT (parallel bit extract) to construct a comparison table lookup address in this specific case. There are 285 four-bit bit position subsets for these strings which uniquely identify each string. On those microarchitectures where PEXT is implemented efficiently this might provide even faster checking, but I suspect applicability of the approach doesn’t scale as well as multiplication, not to mention that ARM lacks the corresponding instruction.

    3. A perfect hash index for a (16-entry) lookup table can also be constructed from a masked xor of two right shifts of inputu, but strangely enough it is slower than the mulhi variant on M1 when inlined (and equivalent in speed when not). It is curious how efficient the wide multipliers have become for these sort of tricks; even replacing a wide multiplication with a two-instruction sequence of constant shifts and a xor may be counterproductive.

  6. The problem of converting an eight-byte NUL-terminated or partial string into an uint64_t where data after NUL is zeroed should be pretty straight-forward with modern vectorisation. The code below vectorises, at least as a lone function, pretty well on Clang 14 on my M1 Mac and with Intel ICX compiler (for Skylake, for instance):


    uint64_t string_to_uint64(const char *str)
    {
    const uint64_t ps = 4096; /* minimum common page size */

    uint8_t outputv[8];
    uint64_t v;
    uint64_t shift;

    if (__builtin_expect(((intptr_t)str & (ps - 1)) > ps - 8, 0))
    {
    uint64_t buf = 0;

    strncpy((char *)&buf, str, sizeof(buf));
    return buf;
    }

    for (size_t i = 0; i < 8; i++)
    {
    outputv[i] = str[i] != 0 ? 0 : ~0;
    }

    v = *(uint64_t *)&outputv;
    shift = v ? __builtin_ctzll(v) : 64;

    return *(uint64_t *)str & ~(~0ULL << shift);
    }

    It generates on both about a dozen instruction fast path, of which about four instructions are just checking for the low likelihood of need to use the slow path, which is for strings which might cross the page boundary. (That part could be still optimized, but it’s not really the low-hanging fruit.)

    Problems arise when one attempts to use other compilers or inlining this to existing code. Suddenly autovectorisation becomes effectively impossible task for the compilers to handle gracefully, which is really quite appalling.

    Nonetheless, even if a compiler would work like a charm, this sort of padding is going to take more time than, for instance, my multiply-mask-load check acting on its output, which takes effectively three instructions (with one multiply) to perform the check…

  7. The regular expression versions can be sped up a lot (Though still a lot slower than the other approaches) by using the pattern https?|ftp|file|wss?, eliminating unused capture groups and reducing backtracking. And using a faster engine like RE2 gives an even bigger speedup. I tested a state machine compiled by re2c, too:

    regex 214.626912 ns/string, matches = 6823
    my regex 146.426463 ns/string, matches = 6823
    boost regex 149.997056 ns/string, matches = 6823
    my boost regex 123.283810 ns/string, matches = 6823
    RE2 regex 75.450312 ns/string, matches = 6823
    my RE2 regex 49.995401 ns/string, matches = 6823
    re2c 9.409947 ns/string, matches = 6823

    1. As long as state space explosion is not an issue one can translate regular expressions to DFAs. (Also, it is very hard to get more performant than this on regular CPUs and general-purpose pattern matching which can’t be easily reduced to hashing.)

      On long strings the bottleneck will be the load latency (but one can theoretically interleave processing of multiple strings). On short ones with unpredictable length branch misprediction is likely to be an issue.

      It might actually be beneficial to write automata which are always run a fixed amount of rounds for the buffer, looping back to the same state on every byte after seeing the NUL termination. This would allow much more instructions to be in flight.

    2. I implemented a simple hand-written DFA walker with 14 states. Branchless versions of it run at 1.85 to 2.42 ns per string on my M1 Mac (when regex takes around 400 ns similarly to the original blog post).

      Branchy version is predictably slower if branch predictor doesn’t learn the benchmark pattern, but modern branch predictors can remember surprisingly long sequences…

  8. I think “fast” and “faster” test results are highly dependent of the benchmarking working set size as a result of the capacity of the branch predictor learning the test set, at least on my M1 Mac. When the “simulation” size is increased from 8192 to 65536 their runtimes triple, unlike some other variants in the current repo. See a comparison chart: https://i.imgur.com/KFhuI27.png

    Truly branchless solutions scale in this regard much better, but of course if input is highly skewed and these functions are sufficiently hot in the predictor, “branchy” alternatives are quite competitive. It’s just that even a single branch can make a function “branchy” in this regard.

Leave a Reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com.

You may subscribe to this blog by email.