Escaping strings faster with AVX-512

When programming, we often have to ‘escape’ strings. A standard way to do it is to insert the backslash character (\) before some characters such as the double quote. For example, the string

my title is "La vie"


my title is \"La vie\"

A simple routine in C++ to escape a string might look as follows:

  for (...) {
    if ((*in == '\\') || (*in == '"')) {
      *out++ = '\\';
    *out++ = *in;

Such a character-by-character approach is unlikely to provide the best possible performance on modern hardware.

Recent Intel processors have fast instructions (AVX-512) that are well suited for such problems. I decided to sketch a solution using Intel intrinsic functions. The routine goes as follows:

  1. I use two constant registers containing 64 copies of the backslash character and 64 copies of the quote characters.
  2. I start a loop by loading 32 bytes from the input.
  3. I expands these 32 bytes into a 64 byte register, interleaving zero bytes.
  4. I compare these bytes with the quotes and backslash characters.
  5. From the resulting mask, I then construct (by shifting and blending) escaped characters.
  6. I ‘compress’ the result, removing the zero bytes that appear before the unescaped characters.
  7. I advance the output pointer by the number of written bytes and I continue the loop.

The C++ code roughly looks like this…

  __m512i solidus = _mm512_set1_epi8('\\');
  __m512i quote = _mm512_set1_epi8('"');
  for (; in + 32 <= finalin; in += 32) {
    __m256i input = _mm256_loadu_si256(in);
    __m512i input1 = _mm512_cvtepu8_epi16(input);
    __mmask64 is_solidus = _mm512_cmpeq_epi8_mask(input1, solidus);
    __mmask64 is_quote = _mm512_cmpeq_epi8_mask(input1, quote);
    __mmask64 is_quote_or_solidus = _kor_mask64(is_solidus, is_quote);
    __mmask64 to_keep = _kor_mask64(is_quote_or_solidus, 0xaaaaaaaaaaaaaaaa);
    __m512i shifted_input1 = _mm512_bslli_epi128(input1, 1);
    __m512i escaped =
        _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus);
    _mm512_mask_compressstoreu_epi8(out, to_keep, escaped);
    out += _mm_popcnt_u64(_cvtmask64_u64(to_keep));

This code can be greatly improved. Nevertheless, it is a good first step. What are the results an Intel icelake processor using GCC 11 (Linux) ? A simple benchmark indicates a 5x performance boost compared to a naive implementation:

regular code 0.6 ns/character
AVX-512 code 0.1 ns/character

It looks quite encouraging ! My source code is available. I require a recent x64 processor with AVX-512 VBMI2 support.

Daniel Lemire, "Escaping strings faster with AVX-512," in Daniel Lemire's blog, September 14, 2022.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

22 thoughts on “Escaping strings faster with AVX-512”

      1. This whole post is outdated. Why do coders fail to keep up with hardware development? The two go hand in hand. Intel have removed avx512 instructions. But the good news is the upcoming 7000 series from AMD have introduced them.

        1. By “removed” you mean willingly installed a BIOS update that removes the AVX512 support? I’ve go an Alder Lake that has AVX512, and it’ll stay that way, because I can read BIOS changelogs.

          1. If you expect Users to change bios settings in order to run your code on their systems, you will not have any users. Of course if you are writing software for private use this is not a concern.

  1. Hi Daniel,

    I made a very similar discovery quite a few years ago in one of my university courses. It was a course on string search algorithms in which we all had to implement one of the algorithms. I decided to to a simple sliding window algorithm. Which one exactly I don’t remember but the catch was that I implemented it in x64 Assembler with the use of AVX instructions.

    At the end of the course we did a comparision of the different implementations on one machine and needless to say mine blew everything away by far.
    More modern implementations with dictionaries and other improvements stood no chance because of the overhead introduced by the programming languages used.

    Your approach using C++ is for sure way easier to implement as I spent weeks learning and debugging assembler in NASM. After all that I wondered if there is any company that’s so much in need of performance-optimised code.

  2. `VPEXPANDB` can alternatively be used to do this. You may need to use a PDEP/PEXT combo on the mask to get it to the right place though.

    1. Would it still be 5x faster? I am just worrying a bit that someone inexperienced goes and says “yay, five times faster!” and implements a quote without realizing that it does not quote everything. And things like that tend to turn into vulnerabilities.

      1. Looks something like

        is_quote_or_solidus = _mm512_cmpeq_epi8_mask(
        _mm512_shuffle_epi8(input1, _mm512_broadcast_i32x4(_mm_set_epi8(
        0, 0, 0, ‘\\’, 0, 0, 0, 0, 0, 0, 0, 0, 0, ‘”‘, 0, -1

        Works better if there’s more characters to match, as long as the bottom 4 bits are unique between them (if not, you can prod the data to make it so).

  3. The blend __m512i escaped = _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus); can be an unconditional bitwise-OR escaped = _mm512_or_si512(shifted_input1, _mm512_set1_epi16(‘\\’)). Later on, the compress applies the mask.

  4. This whole post is outdated. Why do coders fail to keep up with hardware development? The two go hand in hand. Intel have removed avx512 instructions. But the good news is the upcoming 7000 series from AMD have introduced them.

  5. Nice article.

    Wondering if similar approach can be used implementing KISS frame escapes. I KISS frame every 0xC0 must be replaced into 0xDB and 0xDC and similarly for 0xDB – to another two bytes .

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.