In software, it is a common problem to want to remove specific characters from a string. To make the problem precise, let us consider the removal of all ASCII control characters and spaces. In practice, it means the removal of all byte values smaller or equal than 32.
I covered a related problem before, the removal of all spaces from strings. At the time, I concluded that the fastest approach might be to use SIMD instructions coupled with a large lookup table. A SIMD instruction is such that it can operate on many words at any given time: most commodity processors have instructions able to operate on 16 bytes at a time. Thus, using a single instruction, you can compare 16 consecutive bytes and identify the location of all spaces, for example. Once it is done, you must somehow move the unwanted characters. Most instruction sets do not have instructions for that purpose, however x64 processors have an instruction that can move bytes around as long as you have a precomputed shuffle mask (pshufb). ARM NEON has similar instructions as well. Thus you proceed in the following manner:
- Identify all unwanted characters in a block (e.g., 16 bytes).
- Lookup a shuffle mask in a large table.
- Move the unwanted bytes using the shuffle mask.
Such an approach is fast but it requires possibly large tables. Indeed, if you load 16 bytes, you need a table with 65536 shuffle masks. Storing such large tables is not very practical.
Recent Intel processors have handy new instructions that do exactly what we want: they prune out unwanted bytes (vpcompressb). It requires a recent processor with AVX-512 VBMI2 such as Ice Lake, Rocket Lake, Alder Lake, or Tiger Lake processors. Intel makes it difficult to figure out which features is available on which processor, so you need to do some research to find out if your favorite Intel processors supports the desired instructions. AMD processors do not support VBMI2.
On top of the new instructions, AVX-512 also allows you process the data in larger blocks (64 bytes). Using Intel instructions, the code is almost readable. I create a register containing only the space byte, and I then iterate over my data, each time loading 64 bytes of data. I compare it with the space: I only want to keep values that are large (in byte values) than the space. I then call the compress instruction which takes out the unwanted bytes. I read at regular intervals (every 64 bytes) but I write a variable number of bytes, so I advance the write pointer by the number of set bits in my mask: I count those using a fast instruction (popcnt).
__m512i spaces = _mm512_set1_epi8(' '); size_t i = 0; for (; i + 63 < howmany; i += 64) { __m512i x = _mm512_loadu_si512(bytes + i); __mmask64 notwhite = _mm512_cmpgt_epi8_mask (x, spaces); _mm512_mask_compressstoreu_epi8 (bytes + pos, notwhite, x); pos += _popcnt64(notwhite); }
I have updated the despacer library and its benchmark. With a Tiger Lake processor (3.3 GHz) and GCC 9 (Linux), I get the following results:
function | speed |
---|---|
conventional (despace32) | 0.4 GB/s |
SIMD with large lookup (sse42_despace_branchless) | 2.0 GB/s |
AVX-512 (vbmi2_despace) | 8.5 GB/s |
Your results will differ. Nevertheless, we find that AVX-512 is highly useful for this task and the related function surpasses all other such functions. It is not merely the raw speed, it is also the fact that we do not require a lookup table and that the code does not rely on branch prediction: there is no hard-to-predict branches that may harm your speed in practice.
The result should not surprise us since, for the first time, we almost have direct hardware support for the operation (“pruning unwanted bytes”). The downside is that few processors support the desired instruction set. And it is not clear whether AMD will ever support these fancy instructions.
I should conclude with Linus Torvalds take regarding AVX-512:
I hope AVX-512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on
I cannot predict what will happen to Intel or AVX-512, but if the past is any indication, specialized and powerful instructions have a bright future.
Seeing SSE2 used for string ops caused one colleague to remark, about that and AVX, “Why didn’t Intel just make REP SCASB (et al) fast?” Torvalds might be right; meanwhile, we can be opportunistic about odd cases for using odd instruction sets. Or use them for fast APL functions 🙂
I mean…they did for a few things in that family. ERMS or enhanced rep mov sb can be used for fast memsets and memmoves. Glibc is even aware of CPUs with that capability and when using that microcode trick is helpful. I think that family of “rep” semantics has some limitations for how much they can express. That and the micro op sequence it compiles to is probably a bit more complicated.
Thanks for pointing out ERMS. A benchmark in Stackoverflow wasn’t flattering (vs AVX), and the Intel Optimization Ref says ERMS’s advantage is smaller code.
https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy.
For SSE2 memcpy, of blocks > 192 bytes, I saw advantage from prefetchnta. Would that size be the same for AVX?
I believe Linus Torvalds was more about the size of AVX-512 which takes huge amount of data during context switch and also about overheating which causes underclocking CPU if used too much. The same could be achieved if Intel extended 32 bytes AVX-2 and would likely have similar speed. I wrote similar code for matrix multiplication and the performance gain is far from linear, even ignoring underclocking.
AVX-512 extends the ISA for 32-byte registers. In fact, the code I describe in my blog post is easily adapted to 32-byte registers.
You can discover specialized instructions in the Intel CPU using /proc/cpuinfo
Great article! I’m confused why you’re using _mm512_mask_compressstoreu_epi16 rather than _mm512_mask_compressstoreu_epi8 – don’t you want to write/not write at an 8-bit granularity rather than a 16-bit one?