Scan HTML faster with SIMD instructions: Chrome edition

Modern processors have instructions to process several bytes at once. Effectively all processors have the capability of processing 16 bytes one once. These instructions are called SIMD, for single instruction, multiple data.

It was once an open question whether these instructions could be useful to accelerate common tasks such as parsing HTML or JSON. However, the work on JSON parsing, as in the simdjson parser, has shown rather decisively that SIMD instructions could, indeed, be helpful in breaking speed records.

Inspired by such work, the engine under the Google Chrome browser (Chromium) has adopted SIMD parsing of the HTML inputs. It is the result of the excellent work by a Google engineer, Anton Bikineev.

The approach is used to quickly jump to four specific characters: <, &, \r and \0. You can implement something that looks a lot like it using regular C++ code as follows:

void NaiveAdvanceString(const char *&start, const char *end) {
  for (;start < end; start++) {
    if(*start == '<' || *start == '&' 
        || *start == '\r' || *start == '\0') {
      return;
    }
  }
}

A ‘naive’ approach using the SIMD instructions available on ARM processors looks as follows. Basically, you just do more or less the same thing as the naive regular/scalar approach, except that instead of taking one character at a time, you take 16 characters at a time.

void AdvanceString(const char *&start, const char *end) {
  uint8x16_t quote_mask = vmovq_n_u8('<');
  uint8x16_t escape_mask = vmovq_n_u8('&');
  uint8x16_t newline_mask = vmovq_n_u8('\r');
  uint8x16_t zero_mask = vmovq_n_u8('\0');
  uint8x16_t bit_mask = {16, 15, 14, 13, 12, 11, 10, 9, 8,
                            7, 6, 5, 4, 3, 2, 1};
  static constexpr auto stride = 16;
  for (; start + (stride - 1) < end; start += stride) {
    uint8x16_t data = vld1q_u8(reinterpret_cast<const uint8_t *>(start));
    uint8x16_t quotes = vceqq_u8(data, quote_mask);
    uint8x16_t escapes = vceqq_u8(data, escape_mask);
    uint8x16_t newlines = vceqq_u8(data, newline_mask);
    uint8x16_t zeros = vceqq_u8(data, zero_mask);
    uint8x16_t mask = vorrq_u8(vorrq_u8(quotes,zeros), 
           vorrq_u8(escapes, newlines));
    uint8x16_t matches = vandq_u8(bit_mask, mask);
    int m = vmaxvq_u8(matches);
    if(m != 0) {
      start += 16 - m;
      return;
    }
  }  
  for (;start < end; start++) {
    if(*start == '<' || *start == '&' 
       || *start == '\r' || *start == '\0') {
      return;
    }
  }
}

If you have a PC with an Intel or AMD processor, you can do the equivalent using SSE2 instructions:

void AdvanceString(const char*& start, const char* end) {
    const __m128i quote_mask = _mm_set1_epi8('<');
    const __m128i escape_mask = _mm_set1_epi8('&');
    const __m128i newline_mask = _mm_set1_epi8('\r');
    const __m128i zero_mask = _mm_set1_epi8('\0');

    static constexpr auto stride = 16;
    for (; start + (stride - 1) < end; start += stride) {
        __m128i data = _mm_loadu_si128(
           reinterpret_cast<const __m128i*>(start));
        __m128i quotes = _mm_cmpeq_epi8(data, quote_mask);
        __m128i escapes = _mm_cmpeq_epi8(data, escape_mask);
        __m128i newlines = _mm_cmpeq_epi8(data, newline_mask);
        __m128i zeros = _mm_cmpeq_epi8(data, zero_mask);
        __m128i mask = _mm_or_si128(_mm_or_si128(quotes, zeros),                   
             _mm_or_si128(escapes, newlines));
        int m = _mm_movemask_epi8(mask);
        if (m != 0) {
            start += __builtin_ctz(m);
            return;
        }
    }

    // Process any remaining bytes (less than 16)
    while (start < end) {
        if (*start == '<' || *start == '&' 
             || *start == '\r' || *start == '\0') {
            return;
        }
        start++;
    }
}

You can do slightly better if you use an approach we call vectorize classification (see Langdale and Lemire, 2019). Basically, you combine a SIMD approach with vectorized table lookups to classify the characters. The ARM NEON version using two table lookups looks as follows:

void AdvanceStringTable(const char *&start, const char *end) {
  uint8x16_t low_nibble_mask = {0b0001, 0, 0, 0, 0, 0, 0b0100, 
          0, 0, 0, 0, 0, 0b0010, 0b1000, 0, 0};
  uint8x16_t high_nibble_mask = {0b1001, 0, 0b0100, 0b0010, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
  uint8x16_t v0f = vmovq_n_u8(0xf);
  uint8x16_t bit_mask = {16, 15, 14, 13, 12, 11, 10, 9, 8,
                            7, 6, 5, 4, 3, 2, 1};
  static constexpr auto stride = 16;
  for (; start + (stride - 1) < end; start += stride) {
    uint8x16_t data = vld1q_u8(reinterpret_cast<const uint8_t *>(start));
    uint8x16_t lowpart = vqtbl1q_u8(low_nibble_mask, vandq_u8(data, v0f));
    uint8x16_t highpart = vqtbl1q_u8(high_nibble_mask,  
           vshrq_n_u8(data, 4));
    uint8x16_t classify = vandq_u8(lowpart, highpart);
    uint8x16_t matchesones = vtstq_u8(classify, vdupq_n_u8(0xFF));
    uint8x16_t matches = vandq_u8(bit_mask, matchesones);
    int m = vmaxvq_u8(matches);
    if(m != 0) {
      start += 16 - m;
      return;
    }
  }  
  for (;start < end; start++) {
    if(*start == '<' || *start == '&' || *start == '\r' 
     || *start == '\0') {
      return;
    }
  }
}

This version is close to Bikineev’s code as it appears in the Google Chrome engine, except that I use standard instrinsics while Google engineers prefer to use the excellent highway SIMD library by Jan Wassenberg.

We can do slightly better in this instance because Bikineev only needs to distinguish between four characters. A single table lookup is needed. We did not elaborate in Langdale and Lemire (2019) but vectorized classification works using one, two, three or more table lookups, depending on the complexity of the target set. The simpler version might look as follows:

void AdvanceStringTableSimpler(const char *&start, const char *end) {
  uint8x16_t low_nibble_mask = {0, 0, 0, 0, 0, 0, 0x26, 0, 0, 
                            0, 0, 0, 0x3c, 0xd, 0, 0};
  uint8x16_t v0f = vmovq_n_u8(0xf);
  uint8x16_t bit_mask = {16, 15, 14, 13, 12, 11, 10, 9, 8,
                            7, 6, 5, 4, 3, 2, 1};
  static constexpr auto stride = 16;
  for (; start + (stride - 1) < end; start += stride) {
    uint8x16_t data = vld1q_u8(reinterpret_cast<const uint8_t *>(start));
    uint8x16_t lowpart = vqtbl1q_u8(low_nibble_mask, vandq_u8(data, v0f));
    uint8x16_t matchesones = vceqq_u8(lowpart, data);
    uint8x16_t matches = vandq_u8(bit_mask, matchesones);
    int m = vmaxvq_u8(matches);
    if(m != 0) {
      start += 16 - m;
      return;
    }
  }  
  for (;start < end; start++) {
    if(*start == '<' || *start == '&' 
     || *start == '\r' || *start == '\0') {
      return;
    }
  }
}

How do these three techniques compare? I wrote a small benchmark where I scan the HTML of the Google home page. I ran the benchmark on my Apple M2 laptop (LLVM 15).

method speed instructions/byte
naive (scalar) 2.0 GB/s 9.8 instructions/byte
naive (SIMD) 5.8 GB/s 2.1 instructions/byte
vectorized classification (2 lookups) 6.0 GB/s 2.0 instructions/byte
vectorized classification (1 lookup) 6.8 GB/s 1.8 instructions/byte

The results follow my expectations: the simplest vectorized classification routine has the best performance. However, you may observe that even a rather naive SIMD approach can be quite fast in this instance.

If you have an old SSE2 PC, only the simple SIMD approach is available. My results suggest that it might be good enough to get good results.

Daniel Lemire, "Scan HTML faster with SIMD instructions: Chrome edition," in Daniel Lemire's blog, June 8, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.