When programming, we often have to ‘escape’ strings. A standard way to do it is to insert the backslash character (\) before some characters such as the double quote. For example, the string
my title is "La vie"
becomes
my title is \"La vie\"
A simple routine in C++ to escape a string might look as follows:
for (...) { if ((*in == '\\') || (*in == '"')) { *out++ = '\\'; } *out++ = *in; }
Such a character-by-character approach is unlikely to provide the best possible performance on modern hardware.
Recent Intel processors have fast instructions (AVX-512) that are well suited for such problems. I decided to sketch a solution using Intel intrinsic functions. The routine goes as follows:
- I use two constant registers containing 64 copies of the backslash character and 64 copies of the quote characters.
- I start a loop by loading 32 bytes from the input.
- I expands these 32 bytes into a 64 byte register, interleaving zero bytes.
- I compare these bytes with the quotes and backslash characters.
- From the resulting mask, I then construct (by shifting and blending) escaped characters.
- I ‘compress’ the result, removing the zero bytes that appear before the unescaped characters.
- I advance the output pointer by the number of written bytes and I continue the loop.
The C++ code roughly looks like this…
__m512i solidus = _mm512_set1_epi8('\\'); __m512i quote = _mm512_set1_epi8('"'); for (; in + 32 <= finalin; in += 32) { __m256i input = _mm256_loadu_si256(in); __m512i input1 = _mm512_cvtepu8_epi16(input); __mmask64 is_solidus = _mm512_cmpeq_epi8_mask(input1, solidus); __mmask64 is_quote = _mm512_cmpeq_epi8_mask(input1, quote); __mmask64 is_quote_or_solidus = _kor_mask64(is_solidus, is_quote); __mmask64 to_keep = _kor_mask64(is_quote_or_solidus, 0xaaaaaaaaaaaaaaaa); __m512i shifted_input1 = _mm512_bslli_epi128(input1, 1); __m512i escaped = _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus); _mm512_mask_compressstoreu_epi8(out, to_keep, escaped); out += _mm_popcnt_u64(_cvtmask64_u64(to_keep)); }
This code can be greatly improved. Nevertheless, it is a good first step. What are the results an Intel icelake processor using GCC 11 (Linux) ? A simple benchmark indicates a 5x performance boost compared to a naive implementation:
regular code | 0.6 ns/character |
AVX-512 code | 0.1 ns/character |
It looks quite encouraging ! My source code is available. I require a recent x64 processor with AVX-512 VBMI2 support.
How would this be extended to quote *all* characters that need quoting? Tabs, backslashes and whatnot.
It gets trickier, evidently. I might do it in a future blog post.
This whole post is outdated. Why do coders fail to keep up with hardware development? The two go hand in hand. Intel have removed avx512 instructions. But the good news is the upcoming 7000 series from AMD have introduced them.
By “removed” you mean willingly installed a BIOS update that removes the AVX512 support? I’ve go an Alder Lake that has AVX512, and it’ll stay that way, because I can read BIOS changelogs.
If you expect Users to change bios settings in order to run your code on their systems, you will not have any users. Of course if you are writing software for private use this is not a concern.
Hi Daniel,
I made a very similar discovery quite a few years ago in one of my university courses. It was a course on string search algorithms in which we all had to implement one of the algorithms. I decided to to a simple sliding window algorithm. Which one exactly I don’t remember but the catch was that I implemented it in x64 Assembler with the use of AVX instructions.
At the end of the course we did a comparision of the different implementations on one machine and needless to say mine blew everything away by far.
More modern implementations with dictionaries and other improvements stood no chance because of the overhead introduced by the programming languages used.
Your approach using C++ is for sure way easier to implement as I spent weeks learning and debugging assembler in NASM. After all that I wondered if there is any company that’s so much in need of performance-optimised code.
`VPEXPANDB` can alternatively be used to do this. You may need to use a PDEP/PEXT combo on the mask to get it to the right place though.
I wanted to use `VPEXPANDB initially but I found it easier to go in the other direction.
Would it still be 5x faster? I am just worrying a bit that someone inexperienced goes and says “yay, five times faster!” and implements a quote without realizing that it does not quote everything. And things like that tend to turn into vulnerabilities.
I wouldn’t be surprised if it’s actually faster, due to reducing shuffle port pressure, but I have no benchmarks. And yes, it fully works. See the following for a bit of a writeup (aim is different, but concept is same):
https://www.reddit.com/r/simd/comments/x10e3x/avx512vbmi2_doubling_space/
Does it go any faster if you use pshufb then cmpeq_mask instead of two cmpeq_mask and an or?
How would that work? Can you provide a code sample?
Looks something like
is_quote_or_solidus = _mm512_cmpeq_epi8_mask(
input1,
_mm512_shuffle_epi8(input1, _mm512_broadcast_i32x4(_mm_set_epi8(
0, 0, 0, ‘\\’, 0, 0, 0, 0, 0, 0, 0, 0, 0, ‘”‘, 0, -1
)))
);
Works better if there’s more characters to match, as long as the bottom 4 bits are unique between them (if not, you can prod the data to make it so).
I think I step 4 you mean “compare”, not “copy these bytes with…’
Thanks.
The blend __m512i escaped = _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus); can be an unconditional bitwise-OR escaped = _mm512_or_si512(shifted_input1, _mm512_set1_epi16(‘\\’)). Later on, the compress applies the mask.
Correct.
Thanks for interesting insights! I (an mechatronics engineer working mainly with Python) spent whole evening reading about SIMD/AVX512 and NEON.
I tried to compile the espace.cpp – but onfortunately it does not build with g++ 9.4 (on my ubuntu20.04-wsl2 windows machine).
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95483
I think you need min. gcc-11.1.0 for your benchmark.
You definitively need a compiler with full support for VBMI2.
This whole post is outdated. Why do coders fail to keep up with hardware development? The two go hand in hand. Intel have removed avx512 instructions. But the good news is the upcoming 7000 series from AMD have introduced them.
Intel has not removed AVX-512 instructions. Some of its processors support AVX-512 instructions, others do not.
Nice article.
Wondering if similar approach can be used implementing KISS frame escapes. I KISS frame every 0xC0 must be replaced into 0xDB and 0xDC and similarly for 0xDB – to another two bytes .