Recognizing string prefixes with SIMD instructions

Suppose that I give you a long list of string tokens (e.g., “A”, “A6”, “AAAA”, “AFSDB”, “APL”, “CAA”, “CDS”, “CDNSKEY”, “CERT”, “CH”, “CNAME”, “CS”, “CSYNC”, “DHC”, etc.). I give you a pointer inside a much larger string and I ask you whether you are pointing at one of these tokens, and if so, which one. To make things slightly more complicated, you want the token to be followed by a valid separator (e.g., a space, a semi-colon, etc.) and you want to ignore the case (so that “aaaa” matches “AAAA”).

How might you solve this efficiently?

Comparing against each token might do well if you have few of them, but it is clearly a bad idea if you have many (say 70).

A reasonable approach is to do a binary search through the sorted list of tokens. The C function bsearch is well suited. You need a comparison function as part of the implementation. You may use the C function strncasecmp to compare the strings while ignoring the case, and you add a check to make sure that you only return a match (value 0) when the input has a terminating character at the right position.

Then the linearithmic (O(n log n)) implementation looks like this…

std::string *lookup_symbol(const char *input) {
  return bsearch(input, strings.data(), strings.size(),
  sizeof(std::string), compare);
}

Simple enough.

Another approach is to use a trie. You implement a tree where the first level checks for the first character, the second level for the second character and so forth.

It gets a little bit lengthy, but you can use a script or a library to generate the code. You can use a series of switch/case like so…

switch (s[0]) {
  case 'A': case 'a':
  switch (s[1]) {
    case '6': return is_separator(s[2]) ? 1 : -1;
  case 'A': case 'a':
    switch (s[2]) {
     case 'A': case 'a':
       switch (s[3]) { 
         case 'A': case 'a':
          return is_separator(s[4]) ? 2 : -1;
       default:
         return -1;
...

The running time complexity depends on the data, but is at most the length of the longest string in your set. The trie is a tempting solution but it is branchy: if the processor can predict the upcoming content, it should do well, but if the input is random, you might be unlikely and get poor performance.

We can also use a finite-state machine which requires a relative large table, but has really simple execution:

int s = 71;
while (*str && s >= 0) {
  uint8_t tok = char2token[uint8_t(*str)];
  str++;
  s = statetable[32 * s + tok];
}
*token = (uint8_t)s;
return s != 0;

With SIMD instructions, you can write a tight implementation that is effectively branchless: its execution does not depend on the input data.

The code works in this manner:

  1. We load unconditionally 16 bytes in a register.
  2. We first find the location of the first separator, if any. We can do this with vectorized classification. It is a significant cost.
  3. We set to zero all bytes in the register starting from this first separator. We also switch all characters in A-Z to a lower case.
  4. We use a hash function to map the processed bytes to a table containing our tokens in 16-byte blocks. The hash function is designed in a such a way that if the input matches one of our tokens, then it should be mapped to an identical value. We can derive the hash function empirically (by trial and error). Computing the hash function is a significant cost so we have the be careful. In this instance, I use a simple function:
    (((val >> 32) ^ val)&0xffffffff) * (uint64_t)3523216699) >> 32.
  5. We compare the processed input with the loaded value from the hash function.

The C function written using Intel intrinsic functions is as follows:

bool sse_type(const char *type_string, uint8_t *type) {
  __m128i input = _mm_loadu_si128((__m128i *)type_string);
  __m128i delimiters =
    _mm_setr_epi8(0x00, 0x00, 0x22, 0x00, 0x00, 0x00, 
                  0x00, 0x00, 0x28, 0x09, 0x0a, 0x3b, 
                  0x00, 0x0d, 0x00, 0x00);
  __m128i mask = _mm_setr_epi8(-33, -1, -1, -1, -1, 
                  -1, -1, -1, -1, -33, -1, -1,
                  -1, -1, -1, -1);
  __m128i pattern = _mm_shuffle_epi8(delimiters, input);
  __m128i inputc = _mm_and_si128(input, 
      _mm_shuffle_epi8(mask, input));
  int bitmask = _mm_movemask_epi8(
      _mm_cmpeq_epi8(inputc, pattern));
  uint16_t length = __builtin_ctz(bitmask);
  __m128i zero_mask = _mm_loadu_si128(
       (__m128i *)(zero_masks + 16 - length));
  __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
  input = _mm_andnot_si128(zero_mask, inputlc);
  uint8_t idx = hash((uint64_t)_mm_cvtsi128_si64(input));
  *type = idx;
  __m128i compar = _mm_loadu_si128((__m128i *)buffers[idx]);
  __m128i xorthem = _mm_xor_si128(compar, input);
  return _mm_test_all_zeros(xorthem, xorthem);
}

We expect it to compile to branchfree code, as follows:

sse_type(char const*, unsigned char*):
  movdqu xmm1, XMMWORD PTR [rdi]
  mov edx, 16
  movdqa xmm2, XMMWORD PTR .LC0[rip]
  movdqa xmm0, XMMWORD PTR .LC1[rip]
  pshufb xmm2, xmm1
  pshufb xmm0, xmm1
  pand xmm0, xmm1
  pcmpeqb xmm0, xmm2
  por xmm1, XMMWORD PTR .LC2[rip]
  pmovmskb eax, xmm0
  bsf eax, eax
  cdqe
  sub rdx, rax
  movdqu xmm0, XMMWORD PTR zero_masks[rdx]
  pandn xmm0, xmm1
  movq rax, xmm0
  movq rdx, xmm0
  shr rax, 32
  xor eax, edx
  mov edx, 3523216699
  imul rax, rdx
  shr rax, 32
  mov BYTE PTR [rsi], al
  movzx eax, al
  sal rax, 4
  pxor xmm0, XMMWORD PTR buffers[rax]
  ptest xmm0, xmm0
  sete al
  ret

I wrote a benchmark where we repeatedly try to check for matching tokens, using many thousands random tokens (enough to prevent the processor from having trivial branch prediction). I run it on an Ice Lake server, and the code is compiled with GCC11 (targeting an old processor, Westmere).

technique CPU cycles/string instructions/string
binary search 236 335
trie 71 39
finite state 42 61
SIMD 15 39

In this particular test, the SIMD-based approach is four times faster than the trie despite the fact that it retires as many instructions. The trie struggles with branch mispredictions. The SIMD-based approach has a relatively high number of instructions retired per cycle (2.5). The binary search is disastrous in this case, being more than 10 times slower. The finite-state approach is interesting as it is only three times slower than the SIMD-based approach and significantly faster than the other non-SIMD approaches. It uses near twice as many instructions as  the trie, but it is nearly twice as fast. However, the finite-state approach requires a relatively large table, larger than the alternatives.

The trie can match the SIMD-based approach when the input is predictable. I can simulate this scenario by repeatedly trying to match a small number of tokens (say 100) always in the same order. I get that the trie can then be just as fast in this easy case.

My code is available. It could be easily ported to ARM NEON or to AVX-512.

Credit: The problem was suggested to me by Jeroen Koekkoek from NLnet Labs. He sketched part of the solution. I want to thank GitHub user @aqrit for their comments.

Note: The problem considered in this blog post is not the recognition of strings from a set. I have a blog post on that other topic: Quickly checking that a string belongs to a small set.

Further reading: Is Prefix Of String In Table? by Nelson and Modern perfect hashing for strings by Muła.

Daniel Lemire, "Recognizing string prefixes with SIMD instructions," in Daniel Lemire's blog, July 14, 2023.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

16 thoughts on “Recognizing string prefixes with SIMD instructions”

  1. You could also just build a DFA:

    const int num_tokens = ...;
    const int num_states = ...;
    int char2token[256] = { ... };
    int statetable[num_states][num_tokens] = {...};

    bool match(const char *str)
    {
    int s = 0;
    while (*str && s >= 0) {
    int tok = char2token(*str);
    s = statetable[s][tok];
    }
    return (s == -1);
    }

    A state of -2 might be the non-matching condition.

    The single branch here is way more predictable than your switch tables.

        1. Note that you can’t just process the whole input string. In my model, the input string could be infinite for all we know. So you need more branching to terminate the processing.

          1. No, my code sample will stop when it hits a null or an exit state that indicates the starting place couldn’t possibly be one of your N prefixes. The state machine indicates you are still matching one or more of the prefixes. We can even have N negative states to return which prefix was matched and an additional state for “no match”.

            1. There are several ways to still reduce branch misprediction penalties on DFA code: one can stop advancing the input pointer using branchless code after seeing a specific terminator, or one can use “don’t care” DFA states on a longer buffer, which doesn’t even need to be zero-padded from the end. This way one can effectively run the loop a fixed number of iterations without unpredictable branches – which is practical if the length of possible input strings doesn’t vary much, but if it does this is not such a great idea.

              Of course the issue with general-purpose DFAs is that even the L1 load-to-load latency tends to be like three cycles, and you can’t really parallelise an individual DFA.

              There’s earlier code for a compact DFA with variations in the repo: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/b067aa68e3810de200c2b575e9130d54aeeb118d/2022/12/29/protocol.cpp#L98-L361 – it does only a match instead of separating between different correct matches but it could be easily extended to also return an index of recognised input…

    1. There’s another method which doesn’t demand state table lookup dependency chains. One can create a bit vector where every bit represents a possible matching string position. On a table of masks those bits which correspond to a matching character at specific position for a specific string are set – single-character multiple choices/wildcards/don’t cares are possible by setting more bits. Now a match can be sought simply by looking up mask value for every character position in the input and ANDing these together, and looking for bits which are still set on the input.

      Since the benchmark on this blog post has over 64 candidate matches and compiles gracefully only on x86 I didn’t experiment with this approach in this case, but it should be possible to at least match typical DFA performance for small number of potential string matches.

      1. Because we have lots of strings, with various lengths, it would be hard to avoid doing a bunch of branching in the approach you propose?

        compiles gracefully only on x86

        Yes. It is specific to x86 at this time (old Westmere processors), but can be ported with relative ease to more recent (e.g., AVX-2, AVX-512, NEON) systems.

        1. Well, it depends. One might have only couple unpredictable branches as a compromise. It might help, especially if there would be a practical cut-off for this purpose. I do doubt if there’s much of a benefit, though, as long as efficient hashing works.

          No problem with the fact this code is x86 only – just that formulating my code in a benchmarkable fashion is a bit tedious today. 😐

  2. I fiddled with this problem back then, and I came up with a method of building byte-sized running hashes. The property of the running hash is that it will emit the same bytes for identical prefixes, and it doesn’t need to know where the string ends, so it can be built for a fixed-size chunk of the input string and compared bytewise for matches.

    The matching step is to compare the prefix hash byte at the end of each string in the table (preprocessed) to the corresponding byte in the hashed input string, extracted with pshufb. Any hits then have to be compared against the original of course.

    An easy 8-byte running hash is (little-endian) multiplication by a constant.

  3. I tested out doing this with vectorscan (hyperscan) 5.4.9 on an AMD 7950x3d with the following results:

    sse_type : 1.82 GB/s 481.4 Ma/s 2.08 ns/d 5.48 GHz 11.39 c/d 44.00 i/d 3.0 c/b 11.67 i/b 3.86 i/c
    sse_table : 1.61 GB/s 426.5 Ma/s 2.34 ns/d 5.46 GHz 12.80 c/d 50.00 i/d 3.4 c/b 13.26 i/b 3.91 i/c
    simple_trie : 0.37 GB/s 97.6 Ma/s 10.25 ns/d 5.66 GHz 57.99 c/d 44.58 i/d 15.4 c/b 11.82 i/b 0.77 i/c
    bsearch : 0.09 GB/s 24.8 Ma/s 40.33 ns/d 5.56 GHz 224.30 c/d 332.53 i/d 59.5 c/b 88.19 i/b 1.48 i/c
    sse_length : 2.70 GB/s 716.9 Ma/s 1.39 ns/d 5.23 GHz 7.30 c/d 19.00 i/d 1.9 c/b 5.04 i/b 2.60 i/c
    finite_match : 0.62 GB/s 165.4 Ma/s 6.05 ns/d 5.43 GHz 32.85 c/d 63.02 i/d 8.7 c/b 16.71 i/b 1.92 i/c
    std::lower_bound : 0.05 GB/s 13.6 Ma/s 73.45 ns/d 5.50 GHz 403.78 c/d 665.70 i/d 107.1 c/b 176.55 i/b 1.65 i/c
    vectorscan : 0.10 GB/s 27.6 Ma/s 36.26 ns/d 5.51 GHz 199.84 c/d 537.14 i/d 53.0 c/b 142.46 i/b 2.69 i/c

    I was expecting vectorscan to do better than that and found this result somewhat surprising. I think case insensitive matching is having a significant impact on vectorscan’s performance on this task.

  4. I had a similar problem those days, but my code already knew the position of the separator char, and thus, the length of the string to match.
    So I could just skip when the length was smaller than the shortest keyword, or longer than the longest keyword.
    Also, I had only 1 or 2 keywords with the same length, so using the length as preselection criteria came naturally.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.