Most systems today rely on Unicode strings. However, we have two popular Unicode formats: UTF-8 and UTF-16. We often need to convert from one format to the other. For example, you might have a database formatted with UTF-16, but you need to produce JSON documents using UTF-8. This conversion is often called ‘transcoding’.
In the last few years, we wrote a specialized library that process Unicode strings, with a focus on performance: the simdutf library. The library is used by JavaScript runtimes (Node JS and bun).
The simdutf library is able to benefit from the latest and most powerful instructions on your processors. In particular, it does well with recent processors with AVX-512 instructions (Intel Ice Lake, Rocket Lake, as well as AMD Zen 4).
I do not yet have a Zen 4 processor, but Velu Erwan was kind of enough to benchmark it for me. A reasonable task is to transcode an Arabic file from UTF-8 to UTF-16: it is typically a non-trivial task because Arabic UTF-8 is a mix of one-byte and two-byte characters that we must convert to two-byte UTF-16 characters (with validation). The steps required (under Linux) are as follows:
git clone https://github.com/simdutf/simdutf && cd simdutf && cmake -B build && cmake --build build && wget –content-disposition https://cutt.ly/d2cIxRx && ./build/benchmarks/benchmark -F Arabic-Lipsum.utf8.txt -P convert_utf8
(Ideally, run the last command with privileged access to the performance counters.)
Like Intel, AMD has its own compiler. I did not have access to the Intel compiler for my tests, but Velu has the AMD compiler.
A sensible reference point is the iconv function, as it is provided by the runtime library. The AMD processor is running much faster than the Intel core (5.4 GHz vs. 3.4 GHz). We use GCC 12.
transcoder | Intel Ice Lake (GCC) | AMD Zen 4 (GCC) | AMD Zen 4 (AMD compiler) |
---|---|---|---|
iconv | 0.70 GB/s | 0.97 GB/s | 0.98 GB/s |
simdutf | 7.8 GB/s | 11 GB/s | 12 GB/s |
At a glance, the Zen 4 processor is slightly less efficient on a per-cycle basis when running the simdutf AVX-512 code (2.8 instructions/cycle for AMD versus 3.1 instructions/cycle for Intel) but keep in mind that we did not have access to a Zen 4 processor when tuning our code. The efficiency difference is small enough that we can consider the processors roughly on par pending further investigations.
The big difference that the AMD Zen 4 runs at a much higher frequency. If I rely on wikipedia, I do not think that there is an Ice Lake processor that can match the 5.4 GHz. However, some Rocket Lake processors come close.
In our benchmarks, we track the CPU frequency and we get the same measured frequency when running an AVX-512 as when running conventional code (iconv). Thus AVX-512 can be really advantageous.
These results suggest that AMD Zen 4 is matching Intel Ice Lake in AVX-512 performance. Given that the Zen 4 microarchitecture is the first AMD attempt at supporting AVX-512 commercially, it is a remarkable feat.
Further reading: AMD Zen 4 performance while parsing JSON (phoronix.com).
Note: Raw AMD results are available: GCC 12 and AOCC.
Credit: Velu Erwan got the processor from AMD France. The exact specification is AMD 7950X, 2x16GB DDR5 4800MT reconfig as 5600MT. The UTF-8 to UTF-16 transcoder is largely based on the work of Robert Clausecker.
It’s worth noting that Zen 4 implements AVX-512 by splitting execution into two 256 bit stages, so instructions take twice more cycles (at least those that are 1-cycle on Intel, for complex instructions the difference is less than 2x, and in fact Zen 4 has powerful shuffling units, IIRC).
Evidently, this does not seem to affect the performance negatively in a significant manner, at least in these tests. Note that we make extensive use of AVX-512.
That is not true. It uses two 256-bit units (if available), but it takes only a single amount of cycles.
Intel made their compilers available at no cost some time back. They can be found at https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html .
That’s interesting. I gave up on Intel compilers some time ago because it was tiring to manage the licensing. It seems like great news that they have simplified the process.
Intel switched to LLVM two years ago for their main C/C++ compiler, but they still release and update their Compiler Classic, based on their own compiler internals: https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler#Release_history
They seem to support a lot more optimization and acceleration than competitors, though the branding for the features is hard to keep track of. They might be the only compiler to support the new matrix instructions (AMX), and have lots of support for OpenMP up through 5.x, and their libraries like Threaded Building Blocks and SYCL/OpenCL support.
But it’s hard to keep track of their branding, products, and features. A lot of it is under “OneAPI” now. I think OneAPI is meant to include their compiler, but I’m not sure.
Have you looked at the matrix instructions?
I have not yet had access to a processor with AMX instructions.
Only a tiny adjustment to the instructions.
# will download the file as d2cIxRx
wget https://cutt.ly/d2cIxRx
# will download the file as Arabic-Lipsum.utf8.txt
wget –content-disposition https://cutt.ly/d2cIxRx
Thanks!
Is there any reason for maintaining “a database formatted with UTF-16”? I had thought that the only use for UTF-16 in the modern age is for legacy operating system interfaces.
Last I checked, SQL Server defaulted on UTF-16. It is possible to use UTF-8 with recent versions, but it wasn’t the default when I last looked into it.
Results from a Xeon W-1370P (Rocket Lake); I don’t know which ones you used, so I provide all those that are UTF-8->UTF16 with icelake or iconv:
convert_utf8_to_utf16+icelake, input size: 81685, iterations: 3000, dataset: Arabic-Lipsum.utf8.txt
1.403 ins/byte, 0.440 cycle/byte, 11.871 GB/s (0.3 %), 5.224 GHz, 3.189 ins/cycle
2.505 ins/char, 0.785 cycle/char, 6.651 Gc/s (0.3 %) 1.78 byte/char
convert_utf8_to_utf16+iconv, input size: 81685, iterations: 3000, dataset: Arabic-Lipsum.utf8.txt
32.378 ins/byte, 5.294 cycle/byte, 0.983 GB/s (0.2 %), 5.202 GHz, 6.115 ins/cycle
57.791 ins/char, 9.450 cycle/char, 0.550 Gc/s (0.2 %) 1.78 byte/char
convert_utf8_to_utf16_with_dynamic_allocation+icelake, input size: 81685, iterations: 3000, dataset: Arabic-Lipsum.utf8.txt
1.660 ins/byte, 0.526 cycle/byte, 9.919 GB/s (0.6 %), 5.220 GHz, 3.155 ins/cycle
2.964 ins/char, 0.939 cycle/char, 5.557 Gc/s (0.6 %) 1.78 byte/char
convert_utf8_to_utf16_with_errors+icelake, input size: 81685, iterations: 3000, dataset: Arabic-Lipsum.utf8.txt
1.403 ins/byte, 0.435 cycle/byte, 12.009 GB/s (0.5 %), 5.225 GHz, 3.225 ins/cycle
2.505 ins/char, 0.777 cycle/char, 6.728 Gc/s (0.5 %) 1.78 byte/char