Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)

While most of our software relies on Unicode strings, we often still encounter legacy encodings such as Latin 1. JavaScript strings are often stored as either Latin 1, UTF-8 or UTF-16 internally. Before we convert Latin 1 strings to Unicode (e.g., UTF-8), we must compute the size of the UTF-8 string. It is fairly easy: all ASCII characters map 1 byte to 1 byte, while other characters (with code point values from 128 to 256) map 1 Latin byte to 2 Unicode bytes (in UTF-8).

Computers represent strings using bytes. Most often, we use the Unicode standard to represent characters in bytes. The universal format to exchange strings online is called UTF-8. It can represent over a million characters while retaining compatibility with the ancient ASCII format.

Though most of our software stack has moved to Unicode strings, there are still older standards like Latin 1 used for European strings. Thus we often need to convert Latin 1 strings to UTF-8. It is useful to first compute the size (in bytes) of the eventual UTF-8 strings. You can code a simple C function to compute the UTF-8 size from the Latin 1 input as follow:

size_t scalar_utf8_length(const char * c, size_t len) {
  size_t answer = 0;
  for(size_t i = 0; i<len; i++) {
    if((c[i]>>7)) { answer++;}
  }
  return answer + len;
}

In Computing the UTF-8 size of a Latin 1 string quickly (AVX edition), I reviewed faster techniques to solve this problem on x64 processors.

What about ARM processors (as in your recent MacBook)?

Keyhan Vakil came up with a nice solution with relies on the availability for “narrowing instructions” in ARM processors. Basically you can take a 16-byte vector registers and create a 8-byte register (virtually) by truncating or rounding the results. Conveniently, you can also combine bit shifting with narrowing.

Consider pairs of successive 8-bit words as a 16-bit word. E.g., if the 16 bits start out as aaaaaaaabbbbbbbb then a shift-by-four-and-narrow creates the byte value aaaabbbb. Indeed, if you shift a 16-bit word by 4 bits and keep only the least significant 8 bits of the result, then

  1. the most significant 4 bits from the second 8-bit word become the least significant 4 bits in the result
  2. and the least significant 4 bits from the first 8-bit word become the most significant 4 bits.

This is convenient because vectorized comparison functions often generate filled bytes when the comparison is true (all 1s). The final algorithm in C looks as follows:

uint64_t utf8_length_kvakil(const uint8_t *data, uint32_t length) {
  uint64_t result = 0;
  const int lanes = sizeof(uint8x16_t);
  uint8_t rem = length % lanes;
  const uint8_t *simd_end = data + (length / lanes) * lanes;
  const uint8x16_t threshold = vdupq_n_u8(0x80);
  for (; data < simd_end; data += lanes) {
    // load 16 bytes
    uint8x16_t input = vld1q_u8(data);
    // compare to threshold (0x80)
    uint8x16_t withhighbit = vcgeq_u8(input, threshold);
    // shift and narrow
    uint8x8_t highbits = vshrn_n_u16(vreinterpretq_u16_u8(withhighbit), 4);
    // we have 0, 4 or 8 bits per byte
    uint8x8_t bitsperbyte = vcnt_u8(highbits);
    // sum the bytes vertically to uint16_t
   result += vaddlv_u8(bitsperbyte);
  }
  result /= 4; // we overcounted by a factor of 4
  // scalar tail
  for (uint8_t j = 0; j < rem; j++) {
    result += (simd_end[j] >> 7);
  }
  return result + length;
}

Can you beat Vakil? You can surely reduce the instruction count but once you reach speeds like 20 GB/s, it becomes difficult to go much faster without hitting memory and cache speed limits.

Pete Cawley proposed a simpler algorithm which avoids the narrowing shifts, and does a vertical addition instead:

uint64_t utf8_length_simpler(const uint8_t *data, uint32_t length) {
  uint64_t result = 0;
  const int lanes = sizeof(uint8x16_t);
  uint8_t rem = length % lanes;
  const uint8_t *simd_end = data + (length / lanes) * lanes;
  const uint8x16_t threshold = vdupq_n_u8(0x80);
  for (; data < simd_end; data += lanes) {
    // load 16 bytes
    uint8x16_t input = vld1q_u8(data);
    // compare to threshold (0x80)
    uint8x16_t withhighbit = vcgeq_u8(input, threshold);
    // vertical addition
    result -= vaddvq_s8(withhighbit);
  }
  // scalar tail
  for (uint8_t j = 0; j < rem; j++) {
    result += (simd_end[j] >> 7);
  }
  return result + length;
}

Are the hand-tuned NEON functions fast?

On my Apple M2, they are three times as fast as what the compiler produces from the scalar code on large enough inputs. Observe that the compiler already relies on vector instructions even when compiling scalar code.

scalar code ~6 GB/s
NEON code (both functions) ~20 GB/s

On my Apple laptop, both NEON functions are equally fast. Using Graviton 3 processors on AWS (with GCC 11), I can tell them apart:

scalar code ~7 GB/s
NEON code (Vakil) ~27 GB/s
NEON code (Cawley) ~30 GB/s

The Cawley function is slightly better. My source code is available. Your results will vary.

Update: if you check out my source code, you will find two new versions that are quite fast. It seems that there is a lot of room for optimization on this problem.

Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.

Daniel Lemire, "Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)," in Daniel Lemire's blog, May 15, 2023.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

7 thoughts on “Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)”

  1. but once you reach speeds like 20 GB/s, it becomes difficult to go much faster without hitting memory and cache speed limits

    Cache is often faster than 20GB/s, but all that depends on the CPU, RAM and size of data. You tested on an M2, but a Cortex A55 likely gives very different results, so one shouldn’t overly assume general performance with just one test (particularly with ARM).

    In terms of optimisation, I suspect a loop of accum = vsraq_n_u8(accum, input, 7) to be simpler/faster. You’ll want to unroll it a bit, with multiple accumulators, to get around latency limitations of the instruction, and then will need to aggregate results every 255*unroll iterations.

      1. Well that goes to show that your initial assumption of 20GB/s being hard to exceed, can be achieved if you try. =)

        Interested to see what you get with a vsraq_n_u8 strategy!

  2. Your ‘simpler’ variant may not compile without adding a vreinterpretq_s8_u8 to the vaddvq_s8 line.

    Tested on a Neoverse N1:

    scalar (with autovec)
    ns/bytes 0.169661
    GB/s 5.89412
    ns/bytes 0.177977
    GB/s 5.61869
    ns/bytes 0.174651
    GB/s 5.72571

    kvakil
    ns/bytes 0.0565536
    GB/s 17.6824
    ns/bytes 0.0565536
    GB/s 17.6824
    ns/bytes 0.0582169
    GB/s 17.1771

    faster
    ns/bytes 0.0465735
    GB/s 21.4714
    ns/bytes 0.0482369
    GB/s 20.731
    ns/bytes 0.0482369
    GB/s 20.731

    shift
    ns/bytes 0.0149701
    GB/s 66.8
    ns/bytes 0.0166334
    GB/s 60.12
    ns/bytes 0.0166334
    GB/s 60.12

    Shift version: https://godbolt.org/z/azaqfP5ox

    1. Yeah, USRA is the perfect instruction for this task. On M2, with 3-cycle latency and 4-per-cycle throughput, you may want to split it across 12 seperate accumulator registers. However, you’d be memory bound. M2 only has 3-per-cycle load throughput, so it should max-out at 9 separate accumulator registers. (48-bytes/cycle at 3.5GHz gives a theoretical 168GB/s, but that’s just the maximum speed for loads.)

      1. I agree that to maximise throughput on Apple Firestorm it seems like at least 9 accumulators in the innermost loop, with 3 of them being written to by a USRA instruction per cycle is the best you can do. Dougall, I really appreciate https://dougallj.github.io/applecpu/ !

        For Neoverse N1/N2, V1/V2 this is not quite optimal though; only half of their ASIMD pipelines support USRA, so for N1/N2 the maximum throughput is 16B/cycle, and for V1/V2 the maximum throughput is 32B/cycle when using USRA alone. See the software optimization guides

        For these microarchitectures I think it’s better to have 2/3s of the accumulators written with USRA, and 1/3 of the accumulators written with 2 instructions: CMLT (zero) and SUB.

        If scheduled correctly, this should give a 50% throughput improvement vs. using the pure USRA approach (i.e. 24B/cycle and 48B/cycle respectively).
        You need at least 8 USRA accumulator registers to make full use of USRA on V1/V2 (4 cycles latency x 2 pipes throughput). Then you can have 4 accumulators for CMLT+SUB approach.

        As Firestorm is bound by loads (3 ASIMD loads per cycle vs. 4 ASIMD execution units), this variant should run as fast on those uarch as the fully unrolled pure USRA approach.

        (Untested!) example code here:
        https://godbolt.org/z/1M8411fGf

        Note it looks like GCC does what I intended, but clang unfortunately optimizes the 2 instructions back to USRA.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.