Most strings online are Unicode strings in the UTF-8 format. Other systems (e.g., Java, Microsoft) might prefer UTF-16. However, Latin 1 is still a common encoding (e.g., within JavaScript runtimes). Its relationship with Unicode is simple: Latin 1 includes the first 256 Unicode characters. It is rich enough to convert most of the standard European languages. If something is stored in Latin 1, it can be encoded using Unicode. The reverse is obviously false. Nevertheless, let us assume that you have a Unicode string in UTF-8 that you want to quickly transcode to Latin 1.
The transcoding can be done with a short routine in C:
uint8_t leading_byte = data[pos]; // leading byte if (leading_byte < 0b10000000) { *latin_output++ = leading_byte; pos++; } else if ((leading_byte & 0b11100000) == 0b11000000) { *latin_output++ = (leading_byte & 1) << 6 | (data[pos + 1]); pos += 2; }
It processes the data one byte at a time. There are two cases: ASCII bytes (one byte, one character) and two-byte characters (one leading byte, one continuation byte).
Can we do better?
We can use Single instruction, multiple data (SIMD) and specifically the advanced SIMD instructions available on recent AMD Zen 4 and Intel Ice Lake processors: AVX-512 with VBMI2.
We can then process the data 64 bytes at a time. Using AVX-512, we lead 64 bytes. We identify the location of the ASCII bytes, the leading and the continuation bytes. These are identified using masks. We then modify the leading bytes, keeping one bit (just the least significant one) and shift it up by six positions. We then do two compression: one where we omit the continuation bytes, and one where we omit the newly transformed leading bytes. We then simply using a byte-wise logical OR, and we are done. Using Intel intrinsics, the code might look as follows:
__m512i input = _mm512_loadu_si512((__m512i *)(buf + pos)); __mmask64 ascii = _mm512_cmplt_epu8_mask(input, mask_80808080); __mmask64 continuation = _mm512_cmplt_epi8_mask(input, _mm512_set1_epi8(-64)); __mmask64 leading = _kxnor_mask64(ascii, continuation); __m512i highbits = _mm512_maskz_add_epi8(leading, input, _mm512_set1_epi8(62)); highbits = _mm512_slli_epi16(highbits, 6); // shift in position input = _mm512_mask_blend_epi8(leading, input, highbits); __m512i ascii_continuation = _mm512_maskz_compress_epi8(ascii | continuation, input); __m512i ascii_leading = _mm512_maskz_compress_epi8(ascii | leading, input); __m512i output = _mm512_or_si512(ascii_continuation, ascii_leading); _mm512_storeu_epi8((__m512i*)latin_output, output);
Of course, we must also validate the input, and it adds some complexity, but not too much.
An anonymous reader points out to a significantly faster and simpler approach:
- Load 64 bytes
- Identify the leading bytes (non-continuation non-ASCII bytes) with a single comparison.
- Test whether the leading bytes have their less significant bit set, and construct a mask with it.
- Shift the mask by one position and set the second last bit to 1 in the contibuation bytes that are preceded by a leading byte. We can do it by subtracting 0b11000000 because we have that 0b10000000 – 0b11000000 is 0b11000000.
- Prune all the leading bytes (non-continuation non-ASCII bytes) and write it out.
Using Intel intrinsics, the core implementation might be like so:
__m512i input = _mm512_loadu_si512((__m512i *)(buf + pos)); __mmask64 leading = _mm512_cmpge_epu8_mask(input, _mm512_set1_epi8(-64)); __mmask64 bit6 = _mm512_mask_test_epi8_mask(leading, input, _mm512_set1_epi8(1)); input = _mm512_mask_sub_epi8(input, (bit6<<1) | next_bit6, input, _mm512_set1_epi8(-64)); next_bit6 = bit6 >> 63; __mmask64 retain = ~leading; __m512i output = _mm512_maskz_compress_epi8(retain, input); int64_t written_out = _popcnt64(retain); __mmask64 store_mask = (1ULL << written_out) - 1; _mm512_mask_storeu_epi8((__m512i *)latin_output, store_mask, output);
I use GCC 11 on an Ice Lake server. My source code is available. For benchmarking, I use a French version of the Mars wikipedia entry that has been modified to fit in latin 1.
technique | CPU cycles/byte | instructions/byte |
---|---|---|
conventional | 1.9 | 5.4 |
AVX-512 | 0.3 | 0.75 |
Faster AVX-512 | 0.2 | 0.43 |
On a 3.2 GHz processor, the AVX-512 routine reaches 12 GB/s. It is about 6 times faster than the conventional routine, and it uses 7 times fewer instructions. The faster AVX-512 routine 10 times faster than the conventional routine while using 11 times fewer instructions. It reaches 18 GB/s. It is likely faster than the RAM bandwidth which I estimate to be at least 15 GB/s, but the CPU cache bandwidth is several times higher. Keep in mind that there are disks with a 14 GB/s bandwidth.
This problem illustrates that AVX-512 can really do well on non-trivial string transformations without excessive cleverness.
Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.
I wrote a RVV version:
size_t convert_rvv(uint8_t *utf8, size_t len, uint8_t *latin1)
{
uint8_t *beg = latin1;
uint8_t last = 0;
vuint8m4_t v, s;
vbool2_t ascii, cont, leading, sleading;
for (size_t vl, VL; len > 1; ) {
VL = vl = __riscv_vsetvl_e8m4(len);
v = __riscv_vle8_v_u8m4(utf8, vl);
ascii = __riscv_vmsgtu_vx_u8m4_b2(v, 0x80-1, vl);
if (__riscv_vfirst_m_b2(ascii, vl) < 0)
goto skip;
s = __riscv_vslide1up_vx_u8m4(v, last, vl);
leading = __riscv_vmsltu_vx_u8m4_b2(__riscv_vadd_vx_u8m4(v, 0b11000010, vl), 2, vl);
sleading = __riscv_vmsltu_vx_u8m4_b2(__riscv_vadd_vx_u8m4(s, 0b11000010, vl), 2, vl);
cont = __riscv_vmsne_vx_u8m4_b2(__riscv_vsrl_vx_u8m4(v, 6, vl), 0b10, vl);
if (__riscv_vcpop_m_b2(__riscv_vmand_mm_b2(sleading, cont, vl), vl) != __riscv_vcpop_m_b2(sleading, vl) ||
__riscv_vfirst_m_b2(__riscv_vmnor_mm_b2(ascii, __riscv_vmor_mm_b2(leading, cont, vl), vl), vl) >= 0)
return 0;
s = __riscv_vor_vv_u8m4(__riscv_vsll_vx_u8m4(__riscv_vand_vx_u8m4(v, 1, vl), 6, vl), s, vl);
s = __riscv_vmerge_vvm_u8m4(v, s, ascii, vl);
v = __riscv_vcompress_vm_u8m4(s, cont, vl);
vl = __riscv_vcpop_m_b2(cont, vl);
skip:
__riscv_vse8_v_u8m4(latin1, v, vl);
latin1 += vl; utf8 += VL; len -= VL;
last = utf8[-1];
}
return (latin1 - (uint8_t*)beg);
}
Results from a 2 GHz C920:
SWAR: 0.419896 GiB/s
RVV: 3.707396 GiB/s
(I hope this isn’t a duplicate, I couldn’t tell if the previous posting got through)
Looks like the formatting got a bit messed up, here is a godbolt link: https://godbolt.org/z/6M7T938aE
Thanks for sharing!
Isn’t this just a case of moving the bottom bit of the leading byte to the 6th of the following byte, then stripping out all leading bytes?
__m512i input = _mm512_loadu_si512((__m512i *)(buf + pos));
__mmask64 leading = _mm512_cmpge_epu8_mask(input, _mm512_set1_epi8(-64));
__mmask64 bit6 = _mm512_mask_test_epi8_mask(leading, input, _mm512_set1_epi8(1));
input = _mm512_mask_sub_epi8(input, (bit6<<1) | next_bit6, input, _mm512_set1_epi8(-64));
next_bit6 = bit6 >> 63;
_mm512_mask_compressstoreu_epi8((__m512i*)latin_output, ~leading, input); // WARNING: bad on Zen4
I tried putting it into the full code, and it appears to work: https://pastebin.com/Jbzm16pF
I’m not sure if the test cases can pick up all possible errors though.
Except that my full code validates the input. I don’t think your code does…
Oh. I see that you have completed it. I will run benchmarks.
Blog post updated. Great results.
Thanks for trying it out!
I was estimating 0.5 instructions per byte for an optimized routine but your approach is a tad better which is amazing.
oke intel fanboy. Are you done? Can we shelve complicated avx512 indoctrination completely now? Good heavens…
AMD Zen 4 has superb AVX-512 support.
If you have a competitive alternative, you’re welcome to post it.