Validating gigabytes of Unicode strings per second… in C#?

We have been working on a fast library to validate and transcode Unicode and other formats such as base64 in C++: simdutf. We wondered: could we achieve the same good results in C#?

Microsoft’s .NET framework has made great strides in leveraging advanced instructions. For instance, if your processor supports AVX-512, you can instantiate 512-bit registers right in C#! The standard .NET runtime library effectively utilizes these features, demonstrating that they practice what they preach.

Most strings on the Internet are Unicode strings stored in UTF-8. When you ingest such strings (from disk or from the network), you need to validate them. To test the waters, we set our eyes on UTF-8 validation. With John Keiser, I helped designed a fast UTF-8 validation algorithm designed for modern-day CPUs. We call the algorithm ‘lookup’. It may require less than one instruction per byte to validate even challenging input. The lookup validation algorithm is part of Oracle GraalVM, Google Fuchsia, the Node.js and Bun JavaScript runtimes and so forth.

The .NET library has its own fast UTF-8 validation function: Utf8Utility.GetPointerToFirstInvalidByte. It is highly optimized. As the name implies, it finds the location of the first byte where an error might occur. It also computes some parameters from which you can tell how the input could be transcoded. The function is an internal function, but we can expose it by copying and pasting it.

Could we beat the .NET runtime, at least some of the time? It seems that we can!

Maybe you want to know what our code looks like? Here is a simplified example where we load 64 bytes, check whether they are ASCII.

Vector512<byte> currentBlock = Avx512F.LoadVector512(pInputBuffer + processedLength);
ulong mask = currentBlock.ExtractMostSignificantBits();
if (mask == 0) {
  // got 64 ASCII bytes
} else {
  // oh oh, got non-ASCII bytes
}

Of course, the whole algorithm is more complicated, but not that much more… It is maybe 30 lines of code. We implement various versions of the algorithm, one for ARM processors, one for older x64 processors, and so forth.

For benchmarking, we use valid strings. The first one we check is twitter.json, a JSON file that is mostly ASCII with some non-trivial Unicode content within strings. We also use various synthetic strings representative of various languages.

On an Intel Ice Lake system, our validation function is up to 13 times faster than the standard library. On twitter.json, we are 2.4 times faster.

data set SimdUnicode AVX-512 (GB/s) .NET speed (GB/s) speed up
Twitter.json 29 12 2.4 x
Arabic-Lipsum 12 2.3 5.2 x
Chinese-Lipsum 12 3.9 3.0 x
Emoji-Lipsum 12 0.9 13 x
Hebrew-Lipsum 12 2.3 5.2 x
Hindi-Lipsum 12 2.1 5.7 x
Japanese-Lipsum 10 3.5 2.9 x
Korean-Lipsum 10 1.3 7.7 x
Latin-Lipsum 76 76
Russian-Lipsum 12 1.2 10 x

On an Apple M2 system, our validation function is 1.5 to four times faster than the standard library.

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 25 14 1.8 x
Arabic-Lipsum 7.4 3.5 2.1 x
Chinese-Lipsum 7.4 4.8 1.5 x
Emoji-Lipsum 7.4 2.5 3.0 x
Hebrew-Lipsum 7.4 3.5 2.1 x
Hindi-Lipsum 7.3 3.0 2.4 x
Japanese-Lipsum 7.3 4.6 1.6 x
Korean-Lipsum 7.4 1.8 4.1 x
Latin-Lipsum 87 38 2.3 x
Russian-Lipsum 7.4 2.7 2.7 x

Observe how the standard library provides a function that is already quite fast: it can run at gigabytes per second. We are several times faster, but evidently, C# makes it possible to write highly optimized functions.

You can run your own benchmarks by grabbing our code from https://github.com/simdutf/SimdUnicode/.

It is a pleasure doing this performance-oriented work in C#. It is definitively one of my favorite programming languages right now.

One difficulty with ARM processors is that they have varied SIMD/NEON performance. For example, Neoverse N1 processors, not to be confused with the Neoverse V1 design used by AWS Graviton 3, have weak SIMD performance. Of course, one can pick and choose which approach is best and it is not necessary to apply SimdUnicode is all cases. I expect good performance on recent ARM-based Qualcomm processors.

The SimdUnicode library is joint work with Nick Nuon.

Daniel Lemire, "Validating gigabytes of Unicode strings per second… in C#?," in Daniel Lemire's blog, June 20, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

2 thoughts on “Validating gigabytes of Unicode strings per second… in C#?”

  1. Great post! If you have ideas on how the standard libraries can be improved, we’d love to hear about it. As you’ve noted, the standard libraries are constantly evolving, in large part due to the ideas and contributions of the community. We’ve had successful engagements with academics and academic-adjacent folks in the past.

    https://github.com/dotnet/runtime is the place to start such a conversation.

    I’m familiar with your work. It is great to hear that you are enjoying using C# and fun to see you sending it through its paces. We very much welcome and appreciate folks demonstrating C# and the accompanying libraries at their limits.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.