Scan HTML faster with SIMD instructions: .NET/C# Edition

Recently, the two major Web engines (WebKit and Chromium) adopted fast SIMD routines to scan HTML content. The key insight is to use vectorized classification (Langdale and Lemire, 2019): you load blocks of characters and identify the characters you seek using a few instructions. In particular, we use ‘SIMD instructions’, special instructions that are available on practically all modern processors and can process 16 bytes or more at once.

The problem that WebKit and Chromium solve is to jump to the next relevant characters: one of <, &, \r and \0. Thus we must identify quickly whether we have found one of these characters in a block. On my Apple macbook, a fast SIMD-based approach can scan an HTML page at about 7 GB/s, with code written in C/C++.

But what about C#? The recent C# runtime (.NET8) supports fast SIMD instructions.

Let us first consider a simple version of the function:

public unsafe static void NaiveAdvanceString(ref byte* start, 
                                             byte* end)
{
  while (start < end)
  {
    if (*start == '<' || *start == '&' 
         || *start == '\r' || *start == '\0')
    {
       return;
    }
    start++;
  }
}

This function just visits each character, one by one, and it compares it against the target characters. If one target character is found, we return.

Let us consider a SIMD version of the same function. It is slightly more complicated.

public unsafe static void SIMDAdvanceString(ref byte* start, 
                                          byte* end)
{

    Vector128<byte> low_nibble_mask = Vector128.Create(0, 0, 0, 0, 0, 
                  0, (byte)0x26, 0, 0, 0, 0, 0, (byte)0x3c, 
                 (byte)0xd, 0, 0);
    Vector128<byte> v0f = Vector128.Create((byte)0x0F);
    Vector128<byte> bit_mask = Vector128.Create((byte)16, 15, 14, 13, 
                        12, 11, 10, 9, 8,
                        7, 6, 5, 4, 3, 2, 1);

    const int stride = 16;
    while (start + (stride - 1) < end)
    {
        Vector128<byte> data = AdvSimd.LoadVector128((byte*)start);
        Vector128<byte> lowpart 
           = AdvSimd.Arm64.VectorTableLookup(low_nibble_mask, data & v0f);
        Vector128<byte> matchesones = AdvSimd.CompareEqual(lowpart, 
                                            data);
        if (matchesones != Vector128<byte>.Zero)
        {
            Vector128<byte> matches = AdvSimd.And(bit_mask, 
                                          matchesones);
            int m = AdvSimd.Arm64.MaxAcross(matches).ToScalar();
            start += 16 - m;
            return;
        }
        start += stride;
    }


    while (start < end)
    {
        if (*start == '<' || *start == '&' || *start == '\r' 
                || *start == '\0')
        {
            return;
        }
        start++;
    }
}

The function takes two pointers (ref byte* start and byte* end) that mark the beginning and end of the byte array.  The main loop continues  as long as start is at least 16 bytes away from end. This ensures there’s enough data for vectorized operations. We load in the variable ‘data’ 16 bytes from the memory pointed to by start. We use a vectorized lookup table and a comparison to quickly identify the target characters.The code checks if any element in matchesones is not zero. If there’s a match, then we locate the first one (out of 16 characters), we advance start and return. If no match is found, we advance by 16 characters and repeat. We conclude with a fallback look that processes the leftover data (less than 16 bytes).

As an optimization, it is helpful to use a local variable for the reference to the first pointer. Doing so improves the perfomance substantially: C# is not happy when we repeatedly modify a reference. Thus, at the start of the function, you may set byte* mystart = start, use mystart throughout, and then, just before a return, you set start = mystart.

The .NET runtime library has also a fast SearchValues class to help search characters.

SearchValues<byte> searchValues = SearchValues.Create(
   stackalloc byte[] { 0, 13, 38, 60 });

ReadOnlySpan<byte> data = allLinesUtf8;
while (!data.IsEmpty)
{
          int first = data.IndexOfAny(searchValues);
          data = data.Slice(first >= 0 ? first + 1 : 1);
}

I wrote a benchmark (in C#) that you can run if you have an ARM-based processor.

Conventional 1.4 GB/s
SearchValues 4.2 GB/s
SIMD (ARM NEON) 6.3 GB/s

Incredibly, the SIMD-based function is over 4 times faster than the conventional function in these tests, and the accelerated C# function about 15% slower than  the C++ version. The non-SIMD C# version is also slightly slower than the C++ version.

Harold Aptroot provided support for x64 processor (up to AVX2) so I extended my benchmark to an Intel Ice Lake system:

Conventional 1.0 GB/s
SearchValues 3.8 GB/s
SIMD (x64 AVX2) 7.6 GB/s

This time, the SIMD version is over 7 times faster than the scalar. In fact, it matches the performance numbers that I get with C/C++.

It is also important for performance to write the code in such a way that the C# compiler tends to inline the scanning function, since it is called repeatedly. Initially, I had written the benchmark with some abstraction, using a delegate function, but it limited the best possible speed.

In other words, .NET/C# allows you to write fast code using SIMD instructions. It may be well worth the effort.

Daniel Lemire, "Scan HTML faster with SIMD instructions: .NET/C# Edition," in Daniel Lemire's blog, July 5, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

4 thoughts on “Scan HTML faster with SIMD instructions: .NET/C# Edition”

  1. I realise that this is illustrating a general point about SIMD, but it might be worth point out that SearchValues is available in the BCL, and makes vectorized string search super-accessible to ordinary mortals.

    https://learn.microsoft.com/en-us/dotnet/api/system.buffers.searchvalues-1?view=net-8.0

    A colleague of mine has written a post about it and the kind of perf improvements you get.

    https://endjin.com/blog/2024/01/dotnet-8-searchvalues-string-search-performance-boost

  2. I’m the colleague Matthew mentions in the earlier comment. I thought I’d try running your benchmark and pitting it against the .NET runtime class library’s SearchValues implementation.

    SearchValues is necessarily more general, and in .NET 8.0 at least, it doesn’t spot the optimization that your hand-coded implementation uses. If I’ve understood your code correctly, you’ve exploited the fact that the values you’re looking for all happen to have distinct values in their 4 least significant bits. This has enabled you do use a table lookup (‘shuffle’ as AVX2 calls it) indexing off the lower 4 bits. There are searches that would break this. For example, if you wanted to search for either ” ” (space) or “0”, that would defeat the optimization you’ve used.

    This 4-bit lookup optimization is evidently effective on my system (an Intel i9-9900K which supports AVX2) because I see these results:

    SIMD: 7.65 GB/s, Naive: 0.66 GB/s, SearchValues: 5.06 GB/s

    Here’s my code:

    private static readonly SearchValues searchValues = SearchValues.Create(“<&\r\0"u8);
    [Benchmark]
    [BenchmarkCategory("default")]

    public unsafe void SearchValuesScan()
    {
    if (allLinesUtf8 != null)
    {
    ReadOnlySpan data = allLinesUtf8;
    while (!data.IsEmpty)
    {
    int first = data.IndexOfAny(searchValues);
    data = data.Slice(first >= 0 ? first + 1 : 1);
    }
    }
    }

    So the optimization for this ‘unique lower nibble’ case is clearly offering a significant boost on my system, at least. It would be interesting to know this SearchValues implementation fares on the systems you tested on, because those look significant from mine: the naive implementation is *much* slower on my (fairly old) system, but your SIMD one is fairly similar.

    If the .NET runtime hadn’t provided its own vectorised solution (SearchValues) to this problem, the hand-crafted SIMD version would look extremely attractive. But given that there’s a built in implementation that gets you 66% of the perf in this case, and which handles the more general cases (in particular, those where this unique lower nibble characteristic does not hold), it makes direct use of the SIMD instructions looks like a much more rarefied technique: evidently scenarios exist where it offers a non-trivial improvement, but it’s nothing like the 5-10x speedup that’s possible with the naive implementation as your baseline. This is well into the realms of “do we care enough about this 1.5x speedup to expend the engineering effort to develop and maintain the more complex and fussier implementation, and does the special case the optimization depends on hold in enough of the cases we care about?”

    The answer might still be yes for sufficiently performance sensitive work. But I think it’s really useful to know that the .NET runtime libraries already have a built-in implementation that gets you the majority of the benefit for minimal effort.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.