Suppose that I give you a set of reference strings (“ftp”, “file”, “http”, “https”, “ws”, “wss”). Given a new string, you want to quickly tell whether it is part of this set.
You might use a regular expression but it is unlikely to be fast in general:
const std::regex txt_regex("https?|ftp|file|wss?"); // later... bool match = std::regex_match(v.begin(), v.end(), txt_regex);
A sensible solution might be to create a set and then to ask whether the string is in the set. In C++, a default set type is the unordered_set thus your code might look as follows:
static const std::unordered_set<std::string_view> special_set = { "ftp", "file", "http", "https", "ws", "wss"}; bool hash_is_special(std::string_view input) { return special_set.find(input) != special_set.end(); }
You can also use a tool like gperf which constructs a perfect-hash function for you.
You might also be more direct about it, and just do several comparisons:
bool direct_is_special(std::string_view input) { return (input == "https") || (input == "http") || (input == "ftp") || (input == "file") || (input == "ws") || (input == "wss"); }
I call it a branchy version because of the ‘||’ operator which suggests to the compiler that you want to evaluate the comparisons one by one, exiting once one is true; if you replace the ‘||’ operator with the operator ‘|’ then you the result is ‘branchless’ in the sense that you entice the compiler to evaluate all comparisons.
If you look at how the code gets compiled, you may notice that the compiler is forced to do comparisons and jumps, because it is not allowed to read in the provided string beyond its reported size.
You might be able to do slightly better if you can tell your compiler that the string you receive is ‘padded’ so that you can read eight bytes safely from it. I could not find a very elegant way to do it, but the following code works:
static inline uint64_t string_to_uint64(std::string_view view) { uint64_t val; std::memcpy(&val, view.data(), sizeof(uint64_t)); return val; } uint32_t string_to_uint32(const char *data) { uint32_t val; std::memcpy(&val, data, sizeof(uint32_t)); return val; } bool fast_is_special(std::string_view input) { uint64_t inputu = string_to_uint64(input); if ((inputu & 0xffffffffff) == string_to_uint64("https\0\0\0")) { return input.size() == 5; } if ((inputu & 0xffffffff) == string_to_uint64("http\0\0\0\0")) { return input.size() == 4; } if (uint32_t(inputu) == string_to_uint32("file")) { return input.size() == 4; } if ((inputu & 0xffffff) == string_to_uint32("ftp\0")) { return input.size() == 3; } if ((inputu & 0xffffff) == string_to_uint32("wss\0")) { return input.size() == 3; } if ((inputu & 0xffff) == string_to_uint32("ws\0\0")) { return input.size() == 2; } return false; }
Though I did not do it, you can extend the comparison so that it is case-insensitive (simply AND the input with the bytes 0xdf instead of the bytes 0xff).
You can use a faster approach if you can assume that the input string has been padded with zeros:
uint64_t inputu = string_to_uint64(input); uint64_t https = string_to_uint64("https\0\0\0"); uint64_t http = string_to_uint64("http\0\0\0\0"); uint64_t file = string_to_uint64("file\0\0\0\0"); uint64_t ftp = string_to_uint64("ftp\0\0\0\0\0"); uint64_t wss = string_to_uint64("wss\0\0\0\0\0"); uint64_t ws = string_to_uint64("ws\0\0\0\0\0\0"); if((inputu == https) | (inputu == http)) { return true; } return ((inputu == file) | (inputu == ftp) | (inputu == wss) | (inputu == ws));
Observe how I have selected what I believe are the two most common cases (among URL protocols).
Finally, we must use a hash function to solve the problem with a single comparison:
static const uint8_t shiftxor_table[128] = { 'w', 's', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 'f', 'i', 'l', 'e', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 'f', 't', 'p', 0, 0, 0, 0, 0, 'w', 's', 's', 0, 0, 0, 0, 0, 'h', 't', 't', 'p', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 'h', 't', 't', 'p', 's', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; bool shiftxor_is_special(std::string_view input) { uint64_t inputu = string_to_uint64(input); return string_to_uint64( shiftxor_table + (((inputu >> 28) ^ (inputu >> 14)) & 0x78)) == inputu; }
If you do not want to assume that the strings are padded (i.e., no cheating), then you can do it the “gperf way” (it is my own code, but it is based on gpref):
std::string_view table_hashnocheat_is_special[] = {"http", "", "https", "ws", "ftp", "wss", "file", ""}; bool hashnocheat_is_special(std::string_view input) { if(input.empty()) { return false; } int hash_value = (2*input.size() + (unsigned)(input[0])) & 7; const std::string_view target = table_craftedhash_is_special[hash_value]; return (target[0] == input[0]) && (target.substr(1) == input.substr(1)); }
I am sure that there are faster and more clever alternatives!
In any case, how fast are my alternatives? Using GCC 11 on an Intel Ice Lake server, I get the following results:
gperf7.1 ns/string
regex | 360 ns/string |
std::unordered_map | 19 ns/string |
direct | 16 ns/string |
direct (branchy) | 13 ns/string |
direct (branchless) | 18 ns/string |
hashnocheat_is_special | 7.0 ns/string |
fast | 2.6 ns/string |
faster | 1.9 ns/string |
hashing (shiftxor_is_special) | 1.1 ns/string |
On an Apple M2 with LLVM 12, I get similar (but better) results:
regex | 450 ns/string |
std::unordered_map | 15 ns/string |
direct (branchy) | 7.5 ns/string |
direct (branchless) | 4.8 ns/string |
gperf | 8.0 ns/string |
hashnocheat_is_special | 7.2 ns/string |
fast | 1.1 ns/string |
faster | 0.8 ns/string |
hashing (shiftxor_is_special) | 0.4 ns/string |
Care is needed when optimizing such small functions: whether and how the function gets inlined can be critical to the good performance. The results will depend also on the data source and on the compiler.
The hashing method (e.g., shiftxor_is_special) has the benefit of being essentially branch-free which makes its performance having little dependency on the distribution of the input. It is also fastest in these tests.
If you cannot safely pad your strings, I recommend the hashnocheat_is_special function. Having to deal with variable-length strings is significantly slower, but it is sometimes necessary.
Further reading: I was later informed that my friend Wojciech Muła had looked at the problem earlier in 2022. He did not look at string padding, but for the version with no-cheating (variable-length strings), he comes up with the same conclusion that I do.
Just out of interest, have you tried to measure a regex-based solution as well?
In my tests, regex is 100x slower. Of course, results will vary based on the underlying implementation.
Seems like a good place for a bloom filter.
Presumably the overwhelming majority of words seen don’t match, so you would win by rejecting non-matches quickly.
A Bloom filter will require hashing the string six or seven times with different hash functions. And it has false positives, so you still have to do a real set-membership test if the filter returns true.
I didn’t see anything in the post stating that matches will be rare, and in the benchmark code 60% of the strings match, so this doesn’t seem like a good trade-off.
A conventional Bloom filter would be overkill here, but a hashing based technique could work.
There is no need for different hash functions with a bloom filter. You just mix in a different salt for each iteration.
But if the list of strings is known in advance as in this example, then gperf would be even simpler.
I have added gperf to the benchmark. It is competitive but it is not a match for the fast approaches I propose. This being said, it could maybe be made competitive with some extra engineering.
The ‘direct’ example uses bitwise | instead of logical ||, so no short-circuit evaluation? I guess that will affect the timing to 0.5x what you have now (amortised, if the needle is uniformly distributed)
The code has both functions. Results are sensitive to the compiler.
I have updated the blog post to include both versions.
It is a bit unfair to expect the input strings to be padded to 8 or 4 bytes. In practice, the parameter to memcpy should be input.size() instead of sizeof(uint..), but that seems to tank the performance.
I played a bit with the example (on godbolt: zd9c9YM8b), and the “fast” implementation was the most stable (similar duration over multiple runs). The branchless implementation is consistently the fastest, and the hash version is the sweet spot between clarity and performance.
I also tried a lame implementation of a letter graph, and it seems to be somewhere in between “branchless” and “fast”.
It is a bit unfair to expect the input strings to be padded to 8 or 4 bytes.
For a general-purpose function, I agree, but if it is part of your own system where you control where the strings come from, then I disagree. Requiring padding to your strings is quite doable if you have fine control over your data structures.
I wonder when perfect hashing becomes the winner.
Wojciech Mula has written a note about a similar problem in the past. He found that using a perfect hash function was the fastest way, and that switching character by character was faster than SWAR techniques like the one you mention in this post.
I.e. see
http://0x80.pl/notesen/2022-01-29-http-verb-parse.html
Damn it. Wojciech is always one step ahead of me.
(For people reading this, Wojciech is a collaborator of mine.)
In my functional programming language Fexl I’m using an “index” technique which identifies maximal common prefixes among keys, identifying the optimal sequences of individual three-way character comparisons to execute before doing a final definitive full key comparison.
Here are the operations index_get, index_put, index_del, index_pairs:
https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index/index.fxl
Here is a test suite:
https://github.com/chkoreff/Fexl/blob/master/src/test/b15.fxl
My next step is to write a Fexl program which, when given a set of key-value pairs, automatically writes an optimal C function which looks up the value associated with a key. By “optimal” I mean using the index technique, not using a hash function. It would write the C function to be given the length of the key up front, not relying on trailing NUL.
I may also write the index_get, index_put, and other operations in native C instead of Fexl, but it’s not a bottleneck for me so there’s no hurry. I just think auto-generating C code for a particular index value will be interesting. I may even use the result to do the lookup for the set of standard Fexl functions already written directly in C.
OK, I’ve written some code that “compiles” a list of key-value pairs into C code. The keys and values must be strings, and I don’t yet do anything to detect special characters which must be escaped, or multi-line strings.
Here’s a test suite that tries a few different lists:
https://github.com/chkoreff/Fexl/blob/master/src/test/index_C.fxl
Here’s the reference output for that:
https://github.com/chkoreff/Fexl/blob/master/src/out/index_C
Here’s the Fexl code which does the actual code generation work:
https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index/render_C.fxl
This generates a “lookup” function which takes a pointer to the key chars and a separate length. If you have NUL terminated data you can call strlen before calling lookup.
As far as I can tell the generated code is “optimal” in the sense that it does the fewest number of individual character comparisons needed to reach a point at which strncmp yields a definitive answer.
By the way, I could probably use “switch” instead of successive comparisons of the same key character, but I figure the optimizer will already have that covered.
I restructured the code a little, and here’s the new file where “compile_pairs” is defined:
https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index_C/context.fxl
This seems like exactly the sort of thing a critbit trie is for.
https://github.com/agl/critbit
Did you run a benchmark?
Not recently, but I did when I built one into my game engine a few years back; once assembled, a critbit trie does (on average) one bit compare per log2 strings interned (which can trivially be made branchless), and a single string compare when the appropriate leaf node is found.
It’s not faster for a set containing one or perhaps two strings, but for even five strings it was the fastest method I found, and I did compare against various other methods.
> You might be able to do slightly better if you can tell your compiler that the string you receive is ‘padded’ so that you can read eight bytes safely from it. I could not find a very elegant way to do it
There’s std::assume in C++23 (https://en.cppreference.com/w/cpp/language/attributes/assume) but currently the only way is to use the compiler builtin functions, e.g.: https://godbolt.org/z/xxK939vca
I’ve also taken some liberties at rewriting your approach, but the generate assembly should be similar
Transcribing your interesting code:
I’ve worked on Chromium’s HTML and CSS parsers, and in general, this is an unsolved problem. I’ve seen all of these, depending on number of strings in the set, whether length is available and other circumstances:
Switch on first letter, test with strcmp/memcmp
Switch on length, test with strcmp/memcmp
Test first + last letter of word as a quick filter, test with strcmp/memcmp
Perfect hashing (through gperf)
Hash table lookup
Check four or eight bytes at a time (as in this post)
Probably others I forgot as well. One notable thing I haven’t seen is special SIMD instructions. I know Intel wanted to people to use PCMPESTRI and such in the past, but it’s a really slow instruction and SSE4.2 is a bit too new (alas!).
We took a similar approach in our JSON decoder. We found that 8 bytes was too small, but 16 bytes was just right.
See:
– https://github.com/segmentio/asm/pull/57 (AMD64)
– https://github.com/segmentio/asm/pull/65 (ARM64)
– https://github.com/segmentio/encoding/pull/101
Here’s a simple perfect hashing solution which seems to beat all variants mentioned on the blog post on my M1. If one would have vectored 64-bit multiply high and variable vector shuffle instructions, this could be even massaged to work on vectors of padded strings:
const uint8_t table[64] =
{
‘w’, ‘s’, 0, 0, 0, 0, 0, 0,
‘f’, ‘t’, ‘p’, 0, 0, 0, 0, 0,
‘w’, ‘s’, ‘s’, 0, 0, 0, 0, 0,
‘f’, ‘i’, ‘l’, ‘e’, 0, 0, 0, 0,
‘h’, ‘t’, ‘t’, ‘p’, 0, 0, 0, 0,
‘h’, ‘t’, ‘t’, ‘p’, ‘s’, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, /* comparison always fails */
0, 0, 0, 0, 0, 0, 0, 0 /* comparison always fails */
};
bool mulhi_is_special(std::string_view input) {
uint64_t inputu = string_to_uint64(input);
return *(uint64_t *)
(table +
(((inputu *
(unsigned __int128)39114211812342) >> 64) & 0x38)) ==
inputu;
}
The constant above is found by random brute force search, and is actually not very easy to create on ARM assembler. I tried finding constants which would be nicer to load and found them, but interestingly enough this made practically no difference, at least on M1.
By the way, one could also use Intel PEXT (parallel bit extract) to construct a comparison table lookup address in this specific case. There are 285 four-bit bit position subsets for these strings which uniquely identify each string. On those microarchitectures where PEXT is implemented efficiently this might provide even faster checking, but I suspect applicability of the approach doesn’t scale as well as multiplication, not to mention that ARM lacks the corresponding instruction.
A perfect hash index for a (16-entry) lookup table can also be constructed from a masked xor of two right shifts of inputu, but strangely enough it is slower than the mulhi variant on M1 when inlined (and equivalent in speed when not). It is curious how efficient the wide multipliers have become for these sort of tricks; even replacing a wide multiplication with a two-instruction sequence of constant shifts and a xor may be counterproductive.
When I encountered this very same problem, I’ve generated code to do binary search on the padded strings as numbers. And compared it to various perfect hash generators.
https://blogs.perl.org/users/rurban/2014/08/perfect-hashes-and-faster-than-memcmp.html
The problem of converting an eight-byte NUL-terminated or partial string into an uint64_t where data after NUL is zeroed should be pretty straight-forward with modern vectorisation. The code below vectorises, at least as a lone function, pretty well on Clang 14 on my M1 Mac and with Intel ICX compiler (for Skylake, for instance):
uint64_t string_to_uint64(const char *str)
{
const uint64_t ps = 4096; /* minimum common page size */
uint8_t outputv[8];
uint64_t v;
uint64_t shift;
if (__builtin_expect(((intptr_t)str & (ps - 1)) > ps - 8, 0))
{
uint64_t buf = 0;
strncpy((char *)&buf, str, sizeof(buf));
return buf;
}
for (size_t i = 0; i < 8; i++)
{
outputv[i] = str[i] != 0 ? 0 : ~0;
}
v = *(uint64_t *)&outputv;
shift = v ? __builtin_ctzll(v) : 64;
return *(uint64_t *)str & ~(~0ULL << shift);
}
It generates on both about a dozen instruction fast path, of which about four instructions are just checking for the low likelihood of need to use the slow path, which is for strings which might cross the page boundary. (That part could be still optimized, but it’s not really the low-hanging fruit.)
Problems arise when one attempts to use other compilers or inlining this to existing code. Suddenly autovectorisation becomes effectively impossible task for the compilers to handle gracefully, which is really quite appalling.
Nonetheless, even if a compiler would work like a charm, this sort of padding is going to take more time than, for instance, my multiply-mask-load check acting on its output, which takes effectively three instructions (with one multiply) to perform the check…
The regular expression versions can be sped up a lot (Though still a lot slower than the other approaches) by using the pattern
https?|ftp|file|wss?
, eliminating unused capture groups and reducing backtracking. And using a faster engine like RE2 gives an even bigger speedup. I tested a state machine compiled by re2c, too:As long as state space explosion is not an issue one can translate regular expressions to DFAs. (Also, it is very hard to get more performant than this on regular CPUs and general-purpose pattern matching which can’t be easily reduced to hashing.)
On long strings the bottleneck will be the load latency (but one can theoretically interleave processing of multiple strings). On short ones with unpredictable length branch misprediction is likely to be an issue.
It might actually be beneficial to write automata which are always run a fixed amount of rounds for the buffer, looping back to the same state on every byte after seeing the NUL termination. This would allow much more instructions to be in flight.
I implemented a simple hand-written DFA walker with 14 states. Branchless versions of it run at 1.85 to 2.42 ns per string on my M1 Mac (when regex takes around 400 ns similarly to the original blog post).
Branchy version is predictably slower if branch predictor doesn’t learn the benchmark pattern, but modern branch predictors can remember surprisingly long sequences…
I think “fast” and “faster” test results are highly dependent of the benchmarking working set size as a result of the capacity of the branch predictor learning the test set, at least on my M1 Mac. When the “simulation” size is increased from 8192 to 65536 their runtimes triple, unlike some other variants in the current repo. See a comparison chart: https://i.imgur.com/KFhuI27.png
Truly branchless solutions scale in this regard much better, but of course if input is highly skewed and these functions are sufficiently hot in the predictor, “branchy” alternatives are quite competitive. It’s just that even a single branch can make a function “branchy” in this regard.