How quickly can you check that a string is valid unicode (UTF-8)?

Though character strings are represented as bytes (values in [0,255]), not all sequences of bytes are valid strings. By far the most popular character encoding today is UTF-8, part of the unicode standard. How quickly can we check whether a sequence of bytes is valid UTF-8?

Any ASCII string is a valid UTF-8 string. An ASCII character is simply a byte value in [0,127] or [0x00, 0x7F] in hexadecimal. That is, the most significant bit is always zero.

You can check that a string is made of ASCII characters easily in C:

bool is_ascii(const signed char *c, size_t len) {
  for (size_t i = 0; i < len; i++) {
    if(c[i] < 0) return false;
  }
  return true;
}

However, there are many more unicode characters than can be represented using a single byte. For other characters, outside the ASCII set, we need to use two or more bytes. All of these “fancier” characters are made of sequences of bytes all having the most significant bit set to 1. However, there are somewhat esoteric restrictions:

  • All of the two-byte characters are made of a byte in [0xC2,0xDF] followed by a byte in [0x80,0xBF].
  • There are four types of characters made of three bytes. For example, if the first by is 0xE0, then the next byte must be in [0xA0,0xBF] followed by a byte in [0x80,0xBF].
    • It is all quite boring but can be summarized by the following table:

      First ByteSecond ByteThird ByteFourth Byte
      [0x00,0x7F]
      [0xC2,0xDF][0x80,0xBF]
      0xE0[0xA0,0xBF][0x80,0xBF]
      [0xE1,0xEC][0x80,0xBF][0x80,0xBF]
      0xED[0x80,0x9F][0x80,0xBF]
      [0xEE,0xEF][0x80,0xBF][0x80,0xBF]
      0xF0[0x90,0xBF][0x80,0xBF][0x80,0xBF]
      [0xF1,0xF3][0x80,0xBF][0x80,0xBF][0x80,0xBF]
      0xF4[0x80,0x8F][0x80,0xBF][0x80,0xBF]

      So, how quickly can we check whether a string satisfies these conditions?

      I went looking for handy C/C++ code. I did not want to use a framework or a command-line tool.

      The first thing I found is Björn Höhrmann’s finite-state machine. It looks quite fast. Without getting in the details, given a small table that includes character classes and state transitions, the gist of Höhrmann’s code consists in repeatedly calling this small function:

      bool is_ok(uint32_t* state, uint32_t byte) {
        uint32_t type = utf8d[byte];
        *state = utf8d[256 + *state * 16 + type];
        return (*state != 1); // true on error 
      }
      

      Then I went looking for a fancier, vectorized, solution. That is, I want a version that uses advanced vector registers.

      I found something sensible by Olivier Goffart. Goffart’s original code translates UTF-8 into UTF-16 which is more than I want done. So I modified his code slightly, mostly by removing the unneeded part. His code will only run on x64 processors.

      To test these functions, I wanted to generate quickly some random strings, but to measure accurately the string, I need it to be valid UTF-8. So I simply generated ASCII strings. This makes the problem easier, so I probably underestimate the difficulty of the problem. This problem is obviously dependent on the data type and lots of interesting inputs are mostly just ASCII anyhow.

      Olivier Goffart’s code “cheats” and short-circuit the processing when detecting ASCII code. That’s fine, but I created two versions of his function, one with and one without the “cheat”.

      So, how quickly can these functions check that strings are valid UTF-8?

      string sizeis ASCII?Höhrmann’s finite-state machine Goffart’s (with ASCII cheat) Goffart’s (no ASCII cheat)
      32~2.5 cycles per byte~8 cycles per byte~5 cycles per byte~6 cycles per byte
      80~2 cycles per byte~8 cycles per byte~1.7 cycles per byte~4 cycles per byte
      512~1.5 cycles per byte~8 cycles per byte~0.7 cycles per byte~3 cycles per byte

      My source code is available.

      The vectorized code gives us a nice boost… Sadly, in many applications, a lot of the strings can be quite small. In such cases, it seems that we need to spend something close to 8 cycles per byte just to check that the string is valid?

      In many cases, you could short-circuit the check and just verify that the string is an ASCII string, but it is still not cheap, at about 2 cycles per input byte.

      I would not consider any of the code that I have used to be “highly optimized” so it is likely that we can do much better. How much better remains an open question to me.

      Update: Daniel Dunbar wrote on Twitter

      I would expect that in practice best version would be highly optimized ascii only check for short segments, with fallback to full check if any in the segment fail

      That is close to Goffart’s approach.

19 thoughts on “How quickly can you check that a string is valid unicode (UTF-8)?”

  1. The multibyte sequences basically start (in the high bits) with a unary count of the bytes to follow, and the following bytes all start with 10. So if each byte knows the 3 bytes to its right, it can figure out what it should look like if it’s the initial byte of a sequence. A simple bitmap is sufficient to build this branchlessly (the sequence must be contiguous, ie the first non-10 byte ends the frame, so a bitmap is needed).

    A rough idea:

    mask in all 10xxx bytes, map them to 0x01, 0x00 otherwise.
    shift left 2x and concatenate low bits, eg 0x0101 => 0x03. (not easy in SSE)
    mask off the 3-bit maps and map them to correct unary, eg 101 => 100.
    In byte positions which really are initial, compare constructed unary with original input.

    There are a few other parts to look for ascii and check ranges, but the multibyte structure seems solvable with this method.

      1. Another approach is sort of the reverse of that — translate the upper 4 bits to a following byte count and roll it right with saturating subtract. Output is nonzero where 01xxxxxx bytes should be.

          1. I emailed you some code, but for others here, I implemented this method and got about 16 SSE instructions to validate that the right types of bytes (ascii, continuation, 2-3-4 byte initials) are in the right places, including the lengths of continuations.

            The remaining parts are overlong sequences (basically numbers that should be encoded in fewer bytes, eg a 7-bit number encoded in a 2-byte sequence) and certain excluded ranges.

            I haven’t benchmarked yet, but the work so far looks like about 1 cycle/byte. The remaining checks may push that up quite a bit however.

    1. My code uses a different convention (signed char) than my prose and it can be confusing.

      If you rewite the function so that it takes unsigned char inputs, as is assumed in the text, then you need to change the check for something like “is the value greater than or equal to 128”.

      Otherwise it is all equivalent from a practical point of view.

      1. This isn’t about rewriting the function to take unsigned char inputs, this is about the fact that the plain “char” inputs in your code snippet may be unsigned already (whether “char” is the same as “signed char” or “unsigned char” is not guaranteed by the standard; it is left up to the implementation in question).

        So your function only works some of the time (i.e. on systems+compilers where “char” is “signed char”) and fails on others (i.e. on systems+compilers where “char” is “unsigned char”).

        1. If you want to be pedantic then there is no reason to believe that the system is using two’s complement, which I assume… and, of course, my code assumes support for SSE 4.2 instructions which are in no way provided by the standard. I also use online assembly specific to recent x86 processors. I also assume that chars are octets, not something specified in the standard.

          1. Don’t be so defensive, there’s no need. Someone asked a question (“will your code work where characters are unsigned”) and you misinterpreted the question and answered some other question. I clarified the question for you; no one’s trying to pick your code apart and be pedantic…

            And there are plenty of systems in use where chars are unsigned. PowerPC is one of them. Also gcc has an “-funsigned-char” option that makes chars unsigned everywhere.

            But that’s not the point…

  2. To do an initial check to see if the entire string is ascii, couldn’t you do a bitwise operation to check if all bytes match the pattern 0xxxxxxx? Like run one operation across all of them at once, so they are checking themselves (so to speak) as well as the provided mask? Or am I talking magic talk? (Which isn’t uncommon for when my limited understanding of logic meets my even more limited understanding of lower level programming.)

    1. Goffart’s code uses pmovmskb to move a mask of the high bits of each char into an int for this reason. Since it’s already keyed off of the high bit, it’s one instruction.

  3. I haven’t tried running your code but that checking for UTF-8 representation isn’t simply check the byte ranges like you said above because it has many more rules like the bytes for a code point must be the shortest possible.

    All of the two-byte characters are made of a byte in [0xC2,0xDF] followed by a byte in [0x80,0xBF].
    There are four types of characters made of three bytes. For example, if the first by is 0xE0, then the next byte must be in [0xA0,0xBF] followed by a byte in [0x80,0xBF].

    For example although C0 9F is a correct 2-byte sequence according the above statement, it’s not actually valid UTF-8 because it should be represented as 6F in UTF-8 which is the shortest form. Encoding code points that are surrogate pairs is also prohibited. For example ED A0 81 is also invalid because it decodes into D801 which is a surrogate

    1. 0xC0 < 0xC2.

      I didn’t check too closely, but the overlong sequences are all defined as having 0’s in their high bits down to the next smaller bit width, eg bits [7:10] of the 11-bit 2-byte encoding should have at least one 1 bit, and that translates to a minimum of 0xC2 in the 2-byte initial. So minimum thresholds (sometimes carried across 2 bytes) are sufficient to prevent overlongs.

      The UTF-8 wiki page has a clearer table of bit patterns and forbidden sequences. https://en.wikipedia.org/wiki/UTF-8. One thing they note is that the encoded sequences sort lexicographically to the same order as the integers they encode, so ranges, maxima, etc. translate across as well.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax