Checking for the absence of a string, naive AVX-512 edition

Suppose you would like to check that a string is not present in a large document. In C, you might do the following using the standard function strstr:

bool is_present = strstr(mydocument, needle);

It is simple and likely very fast. The implementation is algorithmically sophisticated.

Can you do better?

Recent Intel and AMD processors have instructions that operate on 512-bit registers. So we can compare 64 bytes using a single instruction. The simplest algorithm to search for a string might look as follows…

  1. Load 64 bytes from our input document, compare them against 64 copies of the first character of the target string.
  2. If we find a match, load the second character of the target string, copy it 64 times within a register. Load 64 bytes from our input document, with an offset of one byte.
  3. Repeat as needed for the second, third, and so forth characters…
  4. Then advance in the input by 64 bytes and repeat.

Using Intel intrinsic functions, the algorithm looks as follows:

  
  for (size_t i = 0; ...; i += 64) {
    __m512i comparator = _mm512_set1_epi8(needle[0]);
    __m512i input = _mm512_loadu_si512(in + i);
    __mmask64 matches = _mm512_cmpeq_epi8_mask(comparator, input);
    for (size_t char_index = 1; matches && char_index < needle_len; 
         char_index++) {
      comparator = _mm512_set1_epi8(needle[char_index]);
      input = _mm512_loadu_si512(in + i + char_index);
      matches =
          _kand_mask64(matches, _mm512_cmpeq_epi8_mask(comparator, input));
    }
    if(matches) { return true; }
  }
  return false;

It is about the simplest algorithm I could think of.

We are now going to benchmark it against strstr. To make it interesting, I will pick a string that it designed to be repeatedly ‘almost’ in the input document, except for its last character. It is a worst-case scenario.

I use GCC 11 on an Intel Icelake processor. The source code of my benchmark is available.

number of characters in the string AVX-512 (naive) strstr
2 33 GB/s 13 GB/s
5 22 GB/s 9 GB/s
10 13 GB/s 8 GB/s
14 10 GB/s 7 GB/s

Unsuprisingly, my naive AVX-512 approach scales poorly in this benchmark with the string length. However, it is somewhat pessimistic, I would expect better results with a more realistic use case.

It should be possible to do much better with some more sophistication. However, for short strings, we are already twice as fast as strstr which is encouraging.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

8 thoughts on “Checking for the absence of a string, naive AVX-512 edition”

  1. We do something similar in Hyperscan – entry point for the most heavily used single string search is at:

    https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/hwlm/noodle_engine.c#L221

    The pragmatic approach is to look for the first 2 “distinct” characters in the string (i.e. if the string is “aabb” use “ab” for a contiguous 2-character search, not “aa” or “bb”) and follow up with a simple AND/CMP type check on a hit. This runs ‘fast enough’ – single-string search is rarely a bottleneck in Hyperscan.

    1. It has been a while since I read it, but https://blog.burntsushi.net/ripgrep/ had a few neat tricks.

      Branches are expensive, so comparing two bytes instead of one before the first branch makes sense. Selecting the most uncommon bytes from the needle reduces the branches taken even further.

    2. Agreed on looking for two chars. Tried various schemes of choosing them (including the above) and performance varied little; memory bandwidth and cache line boundaries seemed to rule. Settled on the two chars whose first occurrences in the pattern string were the closest to the end.

  2. In my functional programming language Fexl I’ve been using the following naive version. Note that I use a “string” structure which is a length followed by the actual bytes, so I could have a string of 1000 NULs for example.

    /* Search x for the first occurrence of y, starting at offset. Return the
    position within x where y was found. If the position returned is past the end
    of x, then y was not found. */
    unsigned long str_search(string x, string y, unsigned long offset)
    {
    unsigned long xn = x->len;
    unsigned long yn = y->len;

    /* Avoid unnecessary work if a match is impossible based on lengths. */
    if (xn < yn || offset > xn – yn) return xn;

    {
    char *xs = x->data;
    char *ys = y->data;
    unsigned long xi = offset;
    unsigned long yi = 0;
    while (1)
    {
    if (yi >= yn) return xi – yi; /* found */
    if (xi >= xn) return xn; /* not found */

    if (xs[xi] == ys[yi])
    yi++;
    else
    {
    xi -= yi;
    yi = 0;
    }

    xi++;
    }
    }
    }

    (Tip of the hat for tohtml.com by the way.)

    I perused the source code for strstr and it’s pretty amazing. I was surprised to see actual calls to memcmp in it, though I suppose they’re leveraging on some known efficiencies in there.

  3. your avx2/avx256 implementation requires the CPU to implement avx-512, which my machine did not have. Hacked around with the following solution for avx2. Note that masking don’t play well with avx2 and chars. What I did was to fall back to memcmp for the remaining few comparisons. If anyone could figure out a way to not fall back to memcmp using only avx2 instructions, lmk!

    inline const char *find_avx256(const char *in, size_t len, const char *needle,
    size_t needle_len) {
    size_t i = 0;
    for (; i + 32 + needle_len – 1 < len; i += 32) {
    __m256i comparator = _mm256_set1_epi8(needle[0]);
    __m256i input = _mm256_load_si256(reinterpret_cast(in + i));
    int matches = _mm256_movemask_epi8(_mm256_cmpeq_epi8(comparator, input));
    for (size_t char_index = 1; matches && char_index < needle_len; char_index++) {
    comparator = _mm256_set1_epi8(needle[char_index]);
    input = _mm256_load_si256(reinterpret_cast(in + i + char_index));
    matches &= _mm256_movemask_epi8(_mm256_cmpeq_epi8(comparator, input));
    }
    if(matches) {
    return in + i + __builtin_ctz(matches);
    }
    }
    while (i + needle_len – 1 < len) {
    if (memcmp(in + i, needle, needle_len) == 0) {
    return in + i;
    }
    ++i;
    }
    return nullptr;
    }

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.