Validating UTF-8 strings using as little as 0.7 cycles per byte

Most strings found on the Internet are encoded using a particular unicode format called UTF-8. However, not all strings of bytes are valid UTF-8. The rules as to what constitute a valid UTF-8 string are somewhat arcane. Yet it seems important to quickly validate these strings before you consume them.

In a previous post, I pointed out that it takes about 8 cycles per byte to validate them using a fast finite-state machine. After hacking code found online, I showed that using SIMD instructions, we could bring this down to about 3 cycles per input byte.

Is that the best one can do? Not even close.

Many strings are just ASCII, which is a subset of UTF-8. They are easily recognized because they use just 7 bits per byte, the remaining bit is set to zero. Yet if you check each and every byte with silly scalar code, it is going to take over a cycle per byte to verify that a string is ASCII. For much better speed, you can vectorize the problem in this manner:

__m128i mask = _mm_setzero_si128();
for (...) {
    __m128i current_bytes = _mm_loadu_si128(src);
    mask = _mm_or_si128(mask, current_bytes);
}
__m128i has_error =  _mm_cmpgt_epi8(
         _mm_setzero_si128(), mask);
return _mm_testz_si128(has_error, has_error);

Essentially, we are loading up vector registers, computing a logical OR with a running mask. Whenever a character outside the allowed range is present, then the last bit will be set in the running mask. We continue until the very end no matter what, and only then do we examine the mask.

We can use the same general idea to validate UTF-8 strings. My code is available.

finite-state machine (is UTF-8?) 8 to 8.5 cycles per input byte
determining if the string is ASCII 0.07 to 0.1 cycles per input byte
vectorized code (is UTF-8?) 0.7 to 0.9 cycles per input byte

If you are almost certain that most of your strings are ASCII, then it makes sense to first test whether the string is ASCII, and only then fall back on the more expensive UTF-8 test.

So we are ten times faster than a reasonable scalar implementation. I doubt this scalar implementation is as fast as it can be… but it is not naive… And my own code is not nearly optimal. It is not using AVX to say nothing of AVX-512. Furthermore, it was written in a few hours. I would not be surprised if one could double the speed using clever optimizations.

The exact results will depend on your machine and its configuration. But you can try the code.

I have created a C library out of this little project as it seems useful. Contributions are invited. For reliable results, please configure your server accordingly: benchmarking on a laptop is hazardous.

Credit: Kendall Willets made a key contribution by showing that you could “roll” counters using saturated subtractions.

Update: This work ended up making a research paper under the title Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice & Experience (to appear).

Update: For a production-ready UTF-8 validation function, please see the simdjson library.

Further reading: Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition)

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

16 thoughts on “Validating UTF-8 strings using as little as 0.7 cycles per byte”

  1. The counter-rolling can actually be done logarithmically by shifting 1,2,4, etc., eg:

    [4,0,0,0] + ([0,4,0,0]-[1,1,1,1]) = [4,3,0,0]

    [4,3,0,0] + ([0,0,4,3]-[2,2,2,2]) = [4,3,2,1]

    but in this case the distances didn’t seem big enough to beat the linear method.

    The distances can even be larger than the register size I believe if the last value in the register is carried over to the first element of the next. It’s a good way to delineate inline variable-length encodings.

  2. For more fun, combine the _mm{256}_movemask_epi8() intrinsic, which lets you rapidly seek to the next non-ASCII character when there is one or validate that there aren’t any, with unaligned loads.

    1. (on second thought, might be better to stick to aligned loads, less special-case code for the end of the string may be a bigger deal than special-case code for a UTF-8 code point that crosses a vector boundary.)

    2. I have something similar to that in my version — I look for counters that go off the end, and do the next unaligned load at the start of that character instead of carrying intermediate state over. Lemire found an elegant way of shifting in the needed bytes from the previous block with _mm_alignr_epi8 which is likely faster and keeps the 16-byte stride.

      There’s also a slight rewrite I did to examine all 5 bits in the initials; it turns out that splitting into ascii/non-ascii on the high bit, and then doing the mapping on the next 4 bits of the non-ascii chars allows us to cover all 5 bits correctly and do the all-ascii shortcut based on movemask. I’ll see if I can check this into the repo.

  3. et pourquoi pas:

    size_t fast_validate_ascii(unsigned char* src, long len) {
    __m128i minusbyte = _mm_set1_epi8(0x80);
    __m128i current_bytes = _mm_setzero_si128();

    size_t i = 0;
    for (; i + 15 < len; i += 16) {
    //we load our section, the length should be larger than 16
    current_bytes = _mm_loadu_si128((const __m128i *)(src + i));
    if (!_mm_testz_si128(minusbyte, current_bytes))
    return i;
    }

    // last part
    if (i < len) {
    char buffer[16];
    memset(buffer, 0, 16);
    memcpy(buffer, src + i, len - i);
    current_bytes = _mm_loadu_si128((const __m128i *)buffer);
    if (!_mm_testz_si128(minusbyte, current_bytes))
    return i;
    }

    return -1;

    }
    On teste directement le bit de signe pour chaque octet. Si tous les caractères sont en ascii on renvoie -1, sinon, on renvoie la position à partir de laquelle, on sait que dans une zone de 16 caractères se cache peut-être un UTF8.

  4. oups… I forgot that on certain platforms, size_t is unsigned…

    long fast_validate_ascii(unsigned char* src, long len) {
    __m128i minusbyte = _mm_set1_epi8(0x80);
    __m128i current_bytes = _mm_setzero_si128();

    long i = 0;
    for (; i + 15 < len; i += 16) {
    //we load our section, the length should be larger than 16
    current_bytes = _mm_loadu_si128((const __m128i *)(src + i));
    if (!_mm_testz_si128(minusbyte, current_bytes))
    return i;
    }

    // last part
    if (i < len) {
    char buffer[16];
    memset(buffer, 0, 16);
    memcpy(buffer, src + i, len – i);
    current_bytes = _mm_loadu_si128((const __m128i *)buffer);
    if (!_mm_testz_si128(minusbyte, current_bytes))
    return i;
    }

    return -1;

    }

    1. Your code is fine, but it is slower if you expect the content to be valid ASCII…

      validate_ascii_fast(data, N)                                    :  0.086 cycles per operation (best)    0.087 cycles per operation (avg)
      clauderoux_validate_ascii(data, N)                              :  0.106 cycles per operation (best)    0.106 cycles per operation (avg)
      

      So it becomes a data-dependent engineering issue.

      1. I see… That’s quite fascinating. It means that _mm_testz_si128 is much slower than doing a “gt” comparison and a OR bitwise…

        Which is quite surprising since all this instruction does is a AND bitwise then a check on 0.

        Still more than 10 times slower is very weird…

        However, I implemented this routine in my own code, and I have a 25% improvement on checking out if a string is UTF8 or in converting it from UTF8 (or various latin table characters) to Unicode…
        It also works on Mac OS, my main platform of development. I haven’t tested yet on Windows, but I guess I should expect the same results…

        1. According to Agner Fog, the relevant instruction (ptest) counts for two muops and has a latency of 3 cycles. Note that you are also adding an extra branch (that depends on a high latency instruction).

          Still more than 10 times slower is very weird…

          What is 10 times slower than what?

  5. This perform pretty badly if the first character of a large string is invalid utf-8 because it still scans the whole string till the end.

    1. This perform pretty badly if the first character of a large string is invalid utf-8 because it still scans the whole string till the end.

      The blog post assumes that most of your data is valid UTF-8 and that it is an exceptional case when it is invalid.

      If you expect the data to be frequently invalid UTF-8, then you need to proceed differently.

  6. SIMD does not seem required for the ASCII validation function. The following non-SIMD version is almost as fast here (~0.08 cycle per operation on a Ryzen 7 with gcc 7.3.0):

    static bool validate_ascii_fast(const char *src, size_t len) {
    const char* end = src + len;
    uint64_t mask1 = 0, mask2 = 0, mask3 = 0, mask4 = 0;

    for (; src < end - 32; src += 32) {
    const uint64_t* p = (const uint64_t*) src;
    mask1 |= p[0];
    mask2 |= p[1];
    mask3 |= p[2];
    mask4 |= p[3];
    }
    for (; src < end - 8; src += 8) {
    const uint64_t* p = (const uint64_t*) src;
    mask1 |= p[0];
    }
    uint8_t tail_mask = 0;
    for (; src < end; src++) {
    tail_mask |= * (const uint8_t*) src;
    }
    uint64_t final_mask = mask1 | mask2 | mask3 | mask4 | tail_mask;
    return !(final_mask & 0x8080808080808080);
    }

  7. Merci beaucoup!

    Brilliant work. The internet didn’t fail me today when I went looking for fast UTF-8 string validation.

    And poor guy, some of the comments in your other UTF-8 blog posts are funny. You express (and prove) your answers logically/mathematically, yet people humorously counter repeatedly with the least thought out counter-arguments 😛

    très apprécié.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.