Computing the UTF-8 size of a Latin 1 string quickly (AVX edition)

Computers represent strings using bytes. Most often, we use the Unicode standard to represent characters in bytes. The universal format to exchange strings online is called UTF-8. It can represent over a million characters while retaining compatibility with the ancient ASCII format.

Though most of our software stack has moved to Unicode strings, there are still older standards like Latin 1 used for European strings. Thus we often need to convert Latin 1 strings to UTF-8. It is useful to first compute the size (in bytes) of the eventual UTF-8 strings.

Thanfully, it is an easy problem: it suffices to count two output bytes for each input byte having the most significant bit set, and one output byte for other bytes. A relatively simple C++ function suffices:

size_t scalar_utf8_length(const char * c, size_t len) {
  size_t answer = 0;
  for(size_t i = 0; i<len; i++) {
    if((c[i]>>7)) { answer++;}
  return answer + len;

To go faster, we may want to use fancy “SIMD” instructions: instructions that process several bytes at once. Your compiler is likely to already do some autovectorization with the simple C function. At compile-time, it will use some SIMD instructions. However, you can try to do hand-code your own version.

We have several instruction sets to choose from, but let us pick the AVX2 instruction sets, available on most x64 processors today. AVX2 has a fast mask function that can extract all the most significant bits, and then another fast function that can count them (popcount). Thus the following routine should do well (credit to Anna Henningsen):

size_t avx2_utf8_length_basic(const uint8_t *str, size_t len) {
  size_t answer = len / sizeof(__m256i) * sizeof(__m256i);
  size_t i;
  for (i = 0; i + sizeof(__m256i) <= len; i += 32) {
    __m256i input = _mm256_loadu_si256((const __m256i *)(str + i));
   answer += __builtin_popcount(_mm256_movemask_epi8(input));
  return answer + scalar_utf8_length(str + i, len - i);

Can you do better? On Intel processors, both the ‘movemask’ and population count instructions are fast, but they have some latency: it takes several cycles for them to execute. They may also have additional execution constraints. Part of the latency and constraints is not going to get better with instruction sets like AVX-512 because it requires moving from SIMD registers to general registers at every iterations. It will similarly be a bit of challenge to port this routine to ARM processors.

Thus we would like to rely on cheaper instructions, and stay in SIMD registers until the end. Even if it does not improve the speed of the AVX code, it may work better algorithmically with other instruction sets.

For this purpose, we can borrow a recipe from the paper Faster Population Counts Using AVX2 Instructions (Computer Journal 61 (1), 2018). The idea is to quickly extract the bits, and add them to a counter that is within a SIMD register, and only exact the values at the end.

The code is slightly more complicated because we have an inner loop. Within the inner loop, we use 8-bit counters, only moving to 64-bit counters at the end of the inner loop. To ensure that there is no overflow, the inner loop can only run for 255 iterations. The code looks as follows…

size_t avx2_utf8_length_mkl(const uint8_t *str, size_t len) {
  size_t answer = len / sizeof(__m256i) * sizeof(__m256i);
  size_t i = 0;
  __m256i four_64bits = _mm256_setzero_si256();
  while (i + sizeof(__m256i) <= len) {
    __m256i runner = _mm256_setzero_si256();
    size_t iterations = (len - i) / sizeof(__m256i);
    if (iterations > 255) { iterations = 255; }
    size_t max_i = i + iterations * sizeof(__m256i) - sizeof(__m256i);
    for (; i <= max_i; i += sizeof(__m256i)) {
      __m256i input = _mm256_loadu_si256((const __m256i *)(str + i));
      runner = _mm256_sub_epi8(
        runner, _mm256_cmpgt_epi8(_mm256_setzero_si256(), input));
    four_64bits = _mm256_add_epi64(four_64bits, 
      _mm256_sad_epu8(runner, _mm256_setzero_si256()));
  answer += _mm256_extract_epi64(four_64bits, 0) +
    _mm256_extract_epi64(four_64bits, 1) +
    _mm256_extract_epi64(four_64bits, 2) +
    _mm256_extract_epi64(four_64bits, 3);
    return answer + scalar_utf8_length(str + i, len - i);

We can also further unroll the inner loop to save a bit on the number of instructions.

I wrote a small benchmark with 8kB random inputs, with an AMD Rome (Zen2) server and GCC 11 (-O3 -march=native). The results will vary depending on your input, your compiler and your processor.

function cycles/byte instructions/byte instructions/cycle
scalar (no autovec) 0.89 3.3 3.8
scalar (w. autovec) 0.56 0.71 1.27
AVX2 (movemask) 0.055 0.15 2.60
AVX2 (in-SIMD) 0.039 0.15 3.90
AVX2 (in-SIMD/unrolled) 0.028 0.07 2.40

So the fastest method in my test is over 30 times faster than the purely scalar version. If I allow the scalar vector to be ‘autovectorized’ by the compiler, it gets about 50% faster.

It is an interesting scenario because the number of instructions retired by cycle varies so much. The slightly more complex “in-SIMD” function does better than the ‘movemask’ function because it manages to retire more instructions per cycle, in these tests. The unrolled version is fast and requires few instructions per cycle, but it has a ‘relatively’ low number of instructions retired per cycle.

My code is available. It should be possible to further tune this code. You need privileged access to run the benchmark because I rely on performance counters.

Further work: It is not necessary to use SIMD instructions explicitly, you can use SWAR instead for a compromise between portability and performance.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

7 thoughts on “Computing the UTF-8 size of a Latin 1 string quickly (AVX edition)”

  1. “it suffices to count two output bytes for each input byte having the most significant bit set, and one output byte for other bytes.”

    This works until your HTML5 compliant WWW browser (and derived applications) treats ISO-8859-1 as Windows-1252.

  2. Windows-1252 contains a number of characters whose codes exceed the 11-bit limit for two-octet UTF-8. Common example: €, the Euro sign. (There are others.)

    ISO-8859-1 only contains codepoints with one- and two-octet UTF-8 sequences, but you’ll find pages labelled as Latin1 which are actually Windows-1252 or even Latin15 (also with €).

    1. I stand corrected.

      Even so, if you are incorrectly identifying Windows-1252 as Latin 1 then you simply cannot convert it to Unicode properly, and there is no sure way to correct and identify a mislabel of the sort.

      1. You can’t and MUST NO trust a designation “charset=ISO-8859-1”; if your code is used to determine the size of buffer allocations, this can result in buffer overruns. Who will assign a CVE then?

        About 28 years ago, when adding MIME support to Windows, some imbeciles at Microsoft assigned the Codepage “CP_1252” to “charset=ISO-8859-1”:


        From then, every piece of (not just) their crap, like FrontPage or Word, which composes mail or HTML pages AND relies on Windows MIME database, labels text encoded with CP_1252 as ISO-8859-1 (the other “charset=ISO-8859-” have also wrong codepages assigned).
        Years later, Microsoft registered several “charset=Windows-
        ” with IANA, but never fixed their wrong database — a REAL shame.
        When the W3C standardized HTML5, to avoid a misrepresentation of HTML pages generated with Microsoft crap, they specified that ISO-8859-1 must be handled as Windows-1252.

  3. Do you know what performance will those functions have with 64BM strings (when cpu cache is much smaller then the string length) ?

  4. Great read as usual!

    Out of curiosity, I went ahead and wrote a SWAR version using uint_fast32_t. Going by cycles/byte – it’s about 3.5x faster than the autovectorized scalar version on my zen2 machine.

    size_t swar_utf8_len(const uint8_t *s, size_t len) {
    typedef uint_fast32_t V; static_assert(sizeof(V) <= sizeof(uint64_t));
    size_t mask = V{0x80808080} | (V{0x80808080} << 16 << 16);
    size_t answer = 0;
    size_t swar_rounds = len / sizeof(V);
    size_t swar_bytes = swar_rounds * sizeof(V);
    for (size_t i = 0; i < swar_bytes; i += sizeof(V)) {
    V v;
    memcpy(&v, s + i, sizeof v);
    v &= mask;
    answer += std::popcount(v);
    return len + answer + scalar_utf8_length(s + swar_bytes, len - swar_bytes);

    (Note that std::popcount requires c++20.)

    3.5x speed up from just dozen lines of (mostly) portable code seems like a good deal!

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see

You may subscribe to this blog by email.