Suppose you would like to check that a string is not present in a large document. In C, you might do the following using the standard function strstr:
bool is_present = strstr(mydocument, needle);
It is simple and likely very fast. The implementation is algorithmically sophisticated.
Can you do better?
Recent Intel and AMD processors have instructions that operate on 512-bit registers. So we can compare 64 bytes using a single instruction. The simplest algorithm to search for a string might look as follows…
- Load 64 bytes from our input document, compare them against 64 copies of the first character of the target string.
- If we find a match, load the second character of the target string, copy it 64 times within a register. Load 64 bytes from our input document, with an offset of one byte.
- Repeat as needed for the second, third, and so forth characters…
- Then advance in the input by 64 bytes and repeat.
Using Intel intrinsic functions, the algorithm looks as follows:
for (size_t i = 0; ...; i += 64) { __m512i comparator = _mm512_set1_epi8(needle[0]); __m512i input = _mm512_loadu_si512(in + i); __mmask64 matches = _mm512_cmpeq_epi8_mask(comparator, input); for (size_t char_index = 1; matches && char_index < needle_len; char_index++) { comparator = _mm512_set1_epi8(needle[char_index]); input = _mm512_loadu_si512(in + i + char_index); matches = _kand_mask64(matches, _mm512_cmpeq_epi8_mask(comparator, input)); } if(matches) { return true; } } return false;
It is about the simplest algorithm I could think of.
We are now going to benchmark it against strstr. To make it interesting, I will pick a string that it designed to be repeatedly ‘almost’ in the input document, except for its last character. It is a worst-case scenario.
I use GCC 11 on an Intel Icelake processor. The source code of my benchmark is available.
number of characters in the string | AVX-512 (naive) | strstr |
---|---|---|
2 | 33 GB/s | 13 GB/s |
5 | 22 GB/s | 9 GB/s |
10 | 13 GB/s | 8 GB/s |
14 | 10 GB/s | 7 GB/s |
Unsuprisingly, my naive AVX-512 approach scales poorly in this benchmark with the string length. However, it is somewhat pessimistic, I would expect better results with a more realistic use case.
It should be possible to do much better with some more sophistication. However, for short strings, we are already twice as fast as strstr which is encouraging.
We do something similar in Hyperscan – entry point for the most heavily used single string search is at:
https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/hwlm/noodle_engine.c#L221
The pragmatic approach is to look for the first 2 “distinct” characters in the string (i.e. if the string is “aabb” use “ab” for a contiguous 2-character search, not “aa” or “bb”) and follow up with a simple AND/CMP type check on a hit. This runs ‘fast enough’ – single-string search is rarely a bottleneck in Hyperscan.
It has been a while since I read it, but https://blog.burntsushi.net/ripgrep/ had a few neat tricks.
Branches are expensive, so comparing two bytes instead of one before the first branch makes sense. Selecting the most uncommon bytes from the needle reduces the branches taken even further.
Agreed on looking for two chars. Tried various schemes of choosing them (including the above) and performance varied little; memory bandwidth and cache line boundaries seemed to rule. Settled on the two chars whose first occurrences in the pattern string were the closest to the end.
See the EPSM search algorithm for a very fast SIMD based search.
https://www.sciencedirect.com/science/article/pii/S1570866714000471
In my functional programming language Fexl I’ve been using the following naive version. Note that I use a “string” structure which is a length followed by the actual bytes, so I could have a string of 1000 NULs for example.
/* Search x for the first occurrence of y, starting at offset. Return the
position within x where y was found. If the position returned is past the end
of x, then y was not found. */
unsigned long str_search(string x, string y, unsigned long offset)
{
unsigned long xn = x->len;
unsigned long yn = y->len;
/* Avoid unnecessary work if a match is impossible based on lengths. */
if (xn < yn || offset > xn – yn) return xn;
{
char *xs = x->data;
char *ys = y->data;
unsigned long xi = offset;
unsigned long yi = 0;
while (1)
{
if (yi >= yn) return xi – yi; /* found */
if (xi >= xn) return xn; /* not found */
if (xs[xi] == ys[yi])
yi++;
else
{
xi -= yi;
yi = 0;
}
xi++;
}
}
}
(Tip of the hat for tohtml.com by the way.)
I perused the source code for strstr and it’s pretty amazing. I was surprised to see actual calls to memcmp in it, though I suppose they’re leveraging on some known efficiencies in there.
your avx2/avx256 implementation requires the CPU to implement avx-512, which my machine did not have. Hacked around with the following solution for avx2. Note that masking don’t play well with avx2 and chars. What I did was to fall back to memcmp for the remaining few comparisons. If anyone could figure out a way to not fall back to memcmp using only avx2 instructions, lmk!
inline const char *find_avx256(const char *in, size_t len, const char *needle,
size_t needle_len) {
size_t i = 0;
for (; i + 32 + needle_len – 1 < len; i += 32) {
__m256i comparator = _mm256_set1_epi8(needle[0]);
__m256i input = _mm256_load_si256(reinterpret_cast(in + i));
int matches = _mm256_movemask_epi8(_mm256_cmpeq_epi8(comparator, input));
for (size_t char_index = 1; matches && char_index < needle_len; char_index++) {
comparator = _mm256_set1_epi8(needle[char_index]);
input = _mm256_load_si256(reinterpret_cast(in + i + char_index));
matches &= _mm256_movemask_epi8(_mm256_cmpeq_epi8(comparator, input));
}
if(matches) {
return in + i + __builtin_ctz(matches);
}
}
while (i + needle_len – 1 < len) {
if (memcmp(in + i, needle, needle_len) == 0) {
return in + i;
}
++i;
}
return nullptr;
}
Not all implementations of strstr() are going to be the same, but certainly a naieve “advance one character if match fails at this position” is far from optimal.
A much smarter string search algorithm is Boyer-Moore which skips past positions where a match cannot exist.
https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/
The strstr function in glibc uses an optimized Boyer-Moore-Horspool algorithm.