Quickly checking whether a string needs escaping

In software, we often represent strings by surrounding them with quotes ("). What happens if the string itself contains quotes? We then need to escape the string. For example, the quote character (") or the backslash character (\) should be replaced by \" or \\. Most programmers are familiar with this process.

Most strings do not need to be escaped. It is thus useful to quickly check whether a string requires escaping.

In the popular interchange format JSON, strings must be escaped if they contain the quote character, the backslash character or any ASCII control character (i.e., with a code point less than 32).

How might we check such a string? Let us assume that we are using C++. A reasonable function might look as follows.

bool simple_needs_escaping(std::string_view v) {
  for (char c : v) {
    if ((uint8_t(c) < 32) | (c == '"') | (c == '\\')) {
      return true;
    }
  }
  return false;
}

The function takes a std::string_view named v as its argument. It iterates over each character c in the input string v. Inside the loop, we first use the comparison
(uint8_t(c) < 32) which checks if the ASCII value of the character is less than 32. This condition covers control characters (such as newline, tab, etc.). Then the comparison (c == '"') checks if the character is a double quote (") and (c == '\\') checks if the character is a backslash (\). If any of the above conditions are true for a character, the function returns true, indicating that the character needs escaping. If none of the conditions are met for any character, the function returns false, indicating that no escaping is needed.

Importantly, this function should exit as soon as a character needing escaping is found. If we expect that no such character will be found, we might try a different approach where we always scan the whole input. This allows the compiler to try other optimizations. In particular, the compiler is more likely to use autovectorization: the ability of our optimizing compiler to compiler our code using fancy instructions like Single Instruction, Multiple Data (SIMD) instructions. I call such a function “branchless” as a reference to the fact that it does not branch out. The result might look as follows:

bool branchless_needs_escaping(std::string_view v) {
  bool b = false;
  for (char c : v) {
    b |= ((uint8_t(c) < 32) | (c == '"') | (c == '\\'));
  }
  return b;
}

We might also try a variation with table lookups. Instead of doing arithmetic and comparisons, we build a single table containing for all possible byte value whether it requires escaping or not.

It takes bit more effort but the result can be quite fast because each character is checked with a single load from a table, along with maybe or two additional instructions.

static constexpr std::array<uint8_t, 256> json_quotable_character =
    []() constexpr {
  std::array<uint8_t, 256> result{};
  for (int i = 0; i < 32; i++) {
    result[i] = 1;
  }
  for (int i : {'"', '\\'}) {
    result[i] = 1;
  }
  return result;
}
();

bool table_needs_escaping(std::string_view view) {
  uint8_t needs = 0;
  for (uint8_t c : view) {
    needs |= json_quotable_character[c];
  }
  return needs;
}

Can we do better? We might if we use SIMD instructions deliberately such as NEON or SSE2. For the most part, your computer is likely either an ARM machine, supporting at least NEON instructions or an x64 machine supporting at least SSE2 instructions. It is easy to distinguish at compile time between these two scenarios. Of course, your processor might support even fancier instructions, but NEON and SSE2 should be good enough to get a good speedup, especially if the strings are not very long.

A good general strategy is to load the data in blocks of 16 bytes and do a few comparisons over these 16 bytes. The magic of SIMD instructions is that it can do 16 comparisons at once, so it can be much faster than character by character processing. What about the case where we have fewer than 16 characters? If you do not want to read past the string, you can simply fall back on one of our more conventional functions.

The NEON code might look as follows:

bool simd_needs_escaping(std::string_view view) {
  if (view.size() < 16) {
    return simple_needs_escaping(view);
  }
  size_t i = 0;
  static uint8_t rnt_array[16] = {1, 0, 34, 0, 0,  0, 0, 0,
                                  0, 0, 0,  0, 92, 0, 0, 0};
  const uint8x16_t rnt = vld1q_u8(rnt_array);
  uint8x16_t running{0};
  for (; i + 15 < view.size(); i += 16) {
    uint8x16_t word = vld1q_u8((const uint8_t *)view.data() + i);
    running = vorrq_u8(
        running,
        vceqq_u8(vqtbl1q_u8(rnt, vandq_u8(word, vdupq_n_u8(15))), word));
    running = vorrq_u8(running, vcltq_u8(word, vdupq_n_u8(32)));
  }
  if (i < view.size()) {
    uint8x16_t word =
        vld1q_u8((const uint8_t *)view.data() + view.length() - 16);
    running = vorrq_u8(
        running,
        vceqq_u8(vqtbl1q_u8(rnt, vandq_u8(word, vdupq_n_u8(15))), word));
    running = vorrq_u8(running, vcltq_u8(word, vdupq_n_u8(32)));
  }
  return vmaxvq_u32(vreinterpretq_u32_u8(running)) != 0;
}

The SSE2 code might look at follows:

inline bool simd_needs_escaping(std::string_view view) {
  if (view.size() < 16) {
    return simple_needs_escaping(view);
  }
  size_t i = 0;
  __m128i running = _mm_setzero_si128();
  for (; i + 15 < view.size(); i += 16) {
    __m128i word = _mm_loadu_si128((const __m128i *)(view.data() + i));
    running = _mm_or_si128(running, _mm_cmpeq_epi8(word, _mm_set1_epi8(34)));
    running = _mm_or_si128(running, _mm_cmpeq_epi8(word, _mm_set1_epi8(92)));
    running = _mm_or_si128(
        running, _mm_cmpeq_epi8(_mm_subs_epu8(word, _mm_set1_epi8(31)),
                                _mm_setzero_si128()));
  }
  if (i < view.size()) {
    __m128i word =
        _mm_loadu_si128((const __m128i *)(view.data() + view.length() - 16));
    running = _mm_or_si128(running, _mm_cmpeq_epi8(word, _mm_set1_epi8(34)));
    running = _mm_or_si128(running, _mm_cmpeq_epi8(word, _mm_set1_epi8(92)));
    running = _mm_or_si128(
        running, _mm_cmpeq_epi8(_mm_subs_epu8(word, _mm_set1_epi8(31)),
                                _mm_setzero_si128()));
  }
  return _mm_movemask_epi8(running) != 0;
}

You can optimize further the SIMD-based functions to reduce the number of instructions, but they already use far fewer when processing  blocks of 16 bytes than conventional functions.

It can be tricky to benchmark such functions. You will find much difference depending on your compiler and your processor. And the results are sensitive to the data, obviously. However my experience is that the SIMD approaches usually win, by a lot. To test it out, I wrote a small benchmark. In my benchmark, I use a few strings, of different lengths. Some of my strings have only a handful of characters, and some are short sentences. I have dozens of strings. None of the strings require escaping: I believe that this is common scenario.

system simple branchless table SIMD
Linux GCC 12, Intel Ice Lake (3.2 GHz) 0.65 GB/s 0.91 GB/s 1.9 GB/s 5.2 GB/s
Linux LLVM 16, Intel Ice Lake (3.2 GHz) 0.91 GB/s 2.6 GB/s 2.5 GB/s 5.7 GB/s

In these results, the table-based approach is quite competitive.  However, it can be beaten by the branchless  approach when using LLVM/clang due to its good ability to autovectorize the code.

Yet, in all instances, the hand-coded SIMD functions are at least twice as fast. As usual, my source code is available and I invite you to run your own benchmarks.

Note: The character with code point value 127 is also a control character in ASCII. Furthermore, Unicode has many control characters.

Daniel Lemire, "Quickly checking whether a string needs escaping," in Daniel Lemire's blog, May 31, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “Quickly checking whether a string needs escaping”

  1. I don’t understand how the Neon version of this code is intended to work. It does the right thing for control characters of course, but I don’t see how vqtbl1q_u8 as used will do anything useful to detect values (like 34 and 92) that are outwith the range 0..15.

    Handling the excess at the end of the string by checking a few characters twice to make the last block back up to a block of 16 is a very nice trick.

      1. I had added a simple main() routine and run it. In my testing, it returned 0 for input containing ” or \.

        1. Oh, I realise now why 34 and 92 are in the particular rnt_array slots that they are in; the code produces the expected results when all input characters are masked into the table’s domain with e.g.

          … vqtbl1q_u8(rnt, vandq_u8(word, vdupq_n_u8(15))) …

  2. You could probably get even faster results by pipelining onto multiple accumulators or “running” totals, instead of a single accumulator.

    1. Seems to be worth it for the scalar implementations, not for SIMD, given the tests have relatively short strings.

      I also noticed that the NEON version can save an operation via the following, and adjusting the table accordingly:

      running = vorrq_u8(
      running,
      vceqq_u8(
      vqtbl1q_u8(rnt, vshrq_n_u8(word, 5)),
      vmaxq_u8(word, vdupq_n_u8(31))
      ));

  3. Since ” is 34, you can replace (c < 32) | (c == '"') with (c^2) < 33. I can't get consistent results though whether that's any faster, seems to depend on the compiler.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.