Ridiculously fast unicode (UTF-8) validation

One of the most common “data type” in programming is the text string. When programmers think of a string, they imagine that they are dealing with a list or an array of characters. It is often a “good enough” approximation, but reality is more complex.

The characters must be encoded into bits in some way. Most strings on the Internet, including this blog post, are encoded using a standard called UTF-8. The UTF-8 format represents “characters” using 1, 2, 3 or 4 bytes. It is a generalization of the ASCII standard which uses just one byte per character. That is, an ASCII string is also an UTF-8 string.

It is slightly more complicated because, technically, what UTF-8 describes are code points, and a visible character, like emojis, can be made of several code points… but it is a pedantic distinction for most programmers.

There are other standards. Some older programming languages like C# and Java rely on UTF-16. In UTF-16, you use two or four bytes per character. It seemed like a good idea at the time, but I believe that the consensus is increasingly moving toward using UTF-8 all the time, everywhere.

What most character encodings have in common is that they are subject to constraints and that these constraints must be enforce. To put it another way, not any random sequence of bits is UTF-8. Thus you must validate that the strings you receive are valid UTF-8.

Does it matter? It does. For example, Microsoft’s web server had a security vulnerability whereas one could send URIs that would appear to the security checks as being valid and safe, but once interpreted by the server, would allow an attacker to navigate on the disk of the server. Even if security is not a concern, you almost surely want to reject invalid strings before you store them in your database as it is a form of corruption.

So your programming languages, your web servers, your browsers, your database engines, all validate UTF-8 all of the time.

If your strings are mostly just ASCII strings, then checks are quite fast and UTF-8 validation is no issue. However, the days when all of your strings were reliably ASCII strings are gone. We live in the world of emojis and international characters.

Back in 2018, I started wondering… How fast can you validate UTF-8 strings? The answer I got back then is a few CPU cycles per character. That may seem satisfying, but I was not happy.

It took years, but I believe we have now arrived at what might be close to the best one can do: the lookup algorithm. It can be more than ten times faster than common fast alternatives. We wrote a research paper about it: Validating UTF-8 In Less Than One Instruction Per Byte (to appear at Software: Practice and Experience). We have also published our benchmarking software.

Because we have a whole research paper to explain it, I will not go into the details, but the core insight is quite neat. Most of the UTF-8 validation can be done by looking at pairs of successive bytes. Once you have identified all violations that you can detect by looking at all pairs of successive bytes, there is relatively little left to do (per byte).

Our processors all have fast SIMD instructions. They are instructions that operate on wide registers (128 bits, 256 bits, and so forth). Most of them have a “vectorized lookup” instruction that can take, say, 16 byte values (in the range 0 to 16) and look them up in a 16-byte table. Intel and AMD processors have the pshufb instruction that match this description. A value in the range 0 to 16 is sometimes called a nibble, it spans 4 bits. A byte is made of two nibbles (the low and high nibble).

In the lookup algorithm, we call a vectorized lookup instruction three times: once on the low nibble, once on the high nibble and once on the high nibble of the next byte. We have three corresponding 16-byte lookup tables. By choosing them just right, the bitwise AND of the three lookups will allow us to spot any error.

Refer to the paper for more details, but the net result is that you can validate almost entirely a UTF-8 string using roughly 5 lines of fast C++ code without any branching… and these 5 lines validate blocks as large as 32 bytes at a time…

simd8 classify(simd8 input, simd8 previous_input) {
  auto prev1 = input.prev<1>(previous_input);
  auto byte_1_high = prev1.shift_right <4>().lookup_16(table1);
  auto byte_1_low = (prev1 & 0x0F).lookup_16(table2);
  auto byte_2_high = input.shift_right <4>().lookup_16(table3); 
  return (byte_1_high & byte_1_low & byte_2_high);

It is not immediately obvious that this would be sufficient and 100% safe. But it is. You only need a few inexpensive additional technical steps.

The net result is that on recent Intel/AMD processors, you need just under an instruction per byte to validate even the worse random inputs, and because of how streamlined the code is, you can retire more than three such instructions per cycle. That is, we use a small fraction of a CPU cycle (less than 1/3) per input byte in the worst case on a recent CPU. Thus we consistently achieve speeds of over 12 GB/s.

The lesson is that while lookup tables are useful, vectorized lookup tables are fundamental building blocks for high-speed algorithms.

If you need to use the fast lookup UTF-8 validation function in a production setting, we recommend that you go through the simdjson library (version 0.5 or better). It is well tested and has features to make your life easier like runtime dispatching. Though the simdjson library is motivated by JSON parsing, you can use it to just validate UTF-8 even when there is no JSON in sight. The simdjson supports 64-bit ARM and x64 processors, with fallback functions for other systems. We package it as a single header file along with a single source file; so you can pretty much just drop it into your C++ project.

Credit: Muła popularized more than anyone the vectorized classification technique that is key to the lookup algorithm. To my knowledge, Keiser first came up with the three-lookup strategy. To my knowledge, the first practical (non hacked) SIMD-based UTF-8 validation algorithm was crafted by K. Willets. Several people, including Z. Wegner showed that you could do better. Travis Downs also provided clever insights on how to accelerate conventional algorithms.

Further reading: If you like this work, you may like Base64 encoding and decoding at almost the speed of a memory copy (Software: Practice and Experience 50 (2), 2020) and Parsing Gigabytes of JSON per Second (VLDB Journal 28 (6), 2019).

Daniel Lemire, "Ridiculously fast unicode (UTF-8) validation," in Daniel Lemire's blog, October 20, 2020.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

24 thoughts on “Ridiculously fast unicode (UTF-8) validation”

    1. This seems to require SSE3. What is the best that can be done using only SSE2? Thanks!

      On Intel processors, you need SSSE3 which came out in 2006. Unfortunately, I do not have access to hardware that old. I think that the oldest x64 processor I have is the Jaguar in my PS4 and it has SSSE3. Even old embedded Intel Atom processors have SSSE3.

      My expectation is that a processor so old as to not support SSSE3 would not be supported by Windows 10.

      What types of computers are you working on? Pentium 4 systems maybe? From a legacy architecture?

      1. Windows 10 only requires SSE2, FWIW, and plenty of consumer software still targets SSE2 as its baseline.

        The Steam Hardware Survey shows 98.88% support for SSSE3 (versus 100% for SSE2, which I believe is inevitable given Steam requires SSE2), so it’s not totally universal ever today. (And Steam users possibly have quicker hardware than on average?)

        1. *Windows 10 only requires SSE2 (…) The Steam Hardware Survey shows 98.88% support for SSSE3 *

          The same survey shows that Windows 10 represents 89% of all OSes with somewhere between 5% to 10% of PCs running older versions of Windows. Out of the 1.2% of folks with PCs that fail to support SSSE3, what fraction of them run on Windows 10? I’d guess a small fraction, certainly less than 1%. My understanding is that Microsoft only commits to supporting Windows 10 on hardware that falls within an OEM support agreement. It is almost certainly the case that a system that does not have SSSE3 is not part of a supported licence agreement with Microsoft. No matter how you look at it, we are talking about the legacy tail end.

          (And Steam users possibly have quicker hardware than on average?)

          I am sure that’s true. My sons’ school runs on Windows 2000. And a University I will not name runs its infrastructure on Windows NT. It is still legacy. Do not expect big exciting contracts building high speed software for Windows 2000. These customers using Windows 2000 are not shopping for new software to improve the performance of their machines.

          You are absolutely not going to 100% SSE2 out in the real world. In fact, you have many 16-bit x86 software systems out there. There are people running companies on Apple II computers or the like.

          There are many people who will laugh at the idea of using anything so modern as C++, and let us not even get into recent standards. C99 is esoteric. They are coding in C with a vendor-supplied compiler that has not seen any update in 20 years.

          There is no end to the tail end… it goes far back and will always do so.

          But do not for a minute think that you are going to spin out a 20-year old system and run Doom in your browser.

          1. Can you provide some evidence? Some of these examples are flirting with a gossipy tone that suggests campfire stories, while others are simply irrelevant and would never be considered in scope of the “100%” target.

            1. Can you provide some evidence?

              Evidence of what?

              That some people run 16-bit software to this day? Come on. Live a little… go in the underbelly of large and old organizations. Just google about how one can run 16-bit applications under Windows 10. There is plenty of discussions on the topic.

              Pascal code written in the 1980s is still the backbone of large organizations.

              But even Pascal is too modern for some. Look at this 2020 article about unemployment system and ATM still relying on COBOL code from people long since retired.

              That people use really old computers to run businesses? See this 2016 article about a business relying on a Commodore 64.

              Schools that still run Windows 2000 or Windows NT? Here is an answer on Quora from 2018:

              they where still running 5 NT 4 servers on machines with P2
              processors, 2 IBM Thinkcentres running NT4 workstation and about 10
              PC’s running XP. Unbelievable. Looking at the dates on the system, I
              think some of the NT4 servers where installed in 2004.

              That you are not going to get 100% SSE2 in the real world on x86 processors? There are people maintaining Firefox forks for CPUs without SSE.

              Look at this quote from the COBOL article…

              Some COBOL experts bristle at state leaders’ blaming current backlogs
              of unemployment claims on a lack of coders. (…) The explanation,
              they say, is that governments have not updated the bulky and sluggish
              computers that run on COBOL, according to Derek Britton, product
              director at England-based Micro Focus, which helps clients navigate
              such updates.

              Do you think that the bulky and slugglish computers they refer to have SSE2 support? No, they do not.

              If you doubt that there are really old systems still running today, they have simply never worked for an old conservative organization. Period.

              The legacy tail end is very long.

            2. One of Paris’ airports had to shutdown for several hours in 2015 because a computer crashed. They were running windows 3.1.

      2. AMD K10 (Phenom, Phenom II and similar) was the last uArch to not support SSSE3. I think they were first released in 2007 and the last ones were released around 2010 (the 6 core Phenom IIs, and probably some APUs).

        The CPUs were fairly popular in the day, as they were good value, and its successor, AMD’s Bulldozer line, often performed worse, so many people kept buying K10s even after BD’s release. As such, there’s still quite a number of them around (in fact, I have one right here – mostly for testing SSE2 fallback performance).

        Note that while the early Atoms (and AMD Bobcat) supported SSSE3, PSHUFB performance on them was fairly poor (5-6 cycle recip. throughput). Still, PSHUFB is difficult to emulate on SSE2, and you may be better of falling back to scalar code there.

        1. Thanks for the informative comment.

          Still, PSHUFB is difficult to emulate on SSE2, and you may be better of falling back to scalar code there.

          Right. If someone follows my advice regarding production use, that is what is going to happen. If you are using 10-year old processors from AMD, you will get a fallback code routine that should serve you well, but will be nowhere close to what a Zen 2 can do.

    2. To answer more directly your question, if someone has legacy hardware, I would not bother with any of software optimization. It is seems misplaced to try to optimize software for hardware that cannot be purchased anymore.

      Software and hardware are in a closed loop of self-improvements. We upgrade the hardware, then upgrade the software to benefit from the hardware which further motivates new hardware, and so forth.

      People are willing to pay large extra for relatively small gains. A 20% boost in hardware performance often translates into twice the price or more. To these people, who pay extra for small performance gains, the added software performance is valuable.

      The counterpart to this observation is that for applications where the extra cost of upgrading your server every decade is too much, we can infer that the value of software performance is low.

      If you are in a cost minimization routine, that is you are willing to trade hardware performance for low cost, then you probably want to do as few software upgrades as you can. My view is that if you have legacy hardware, you are probably well served with older software.

  1. Travis Downs’s tests indicate that simply touching the upper 128 bits of an AVX register will trigger a (latency inducing) voltage transition. Did you observe that? This kind of fear has kept me from chasing SIMD acceleration of UTF-8 validation that’s in the middle of a bunch of other work.

    1. Well this is a one time cost, so it is hard to measure it in any type of “typical” benchmark which executes the code under test millions of times. E.g., if your benchmark runs for at least 10 milliseconds (most run much longer), a transition period of 20 us (at the high end of what I measured) would add only 0.2% to the benchmark and be totally lost in the noise.

      Furthermore, any benchmark which does any type of “warmup” iteration and doesn’t include this in the timing would never see the effect at all.

      Overall, I don’t expect the transition timing to make much of a difference in most use cases that care about “milliseconds” (rather than nano or microseconds) or about throughput. The max throughput loss is small and the max latency penalty is small. Only people who might lose a bunch of money for occasional latency spikes in the order of 10s of microseconds are likely to care.

      This is different from (but related to) frequency downclocking, where the CPU runs at a lower frequency for a significant period of time: this is a more significant effect and more people would care about this, but you aren’t likely to experience it just by using 256-bit instructions.

      Finally, you might not care about this at all in the context of the algorithms presented here because you expect the rest of your binary to already be full of 256-bit instructions, as it will be if you compile with the CPU architecture set to something modern like -march=skylake or -march=native (the latter depending on where you compile it).

  2. I wonder why Intel, who likes creating new instructions all the time, didn’t create a couple of instructions for vectorized UTF8 parsing and validation. It should be cheap in hardware resources, and UTF8 is here to stay.

    1. They did create string comparisons functions and it did not end well.

      In the broader scheme of things, you are right that hardware support for UTF-8 is not at all silly. In fact, I bet it already exists… just not in mainstream Intel processors.

  3. A value in the range 0 to 16 is sometimes called a nibble

    Did you mean “0 to 15”? Or is range meant to be right-open (“[0, 16)”)?

  4. The benefit of SSE2 is that you can assume it is available. The problem with SSE3 is that you have to write a bit of application prolog code which checks for whether or not SSE3 is available, at runtime. And then you need to provide a non-SSE3 implementation if necessary. This adds an additional run-time check. And it makes designing the library more difficult because you have a global variable (“static bool isSSE3Available;”). This is done with a “magic static” which adds costs.

  5. It really just shows that UTF-16 is a much simpler encoding to deal with. Only 2 different encoding lengths and only mismatched surrogates are error, not 4 different encoding lengths and all those pesky combinations of two and three and four byte mismatches, overlong sequence rejecting, etc. a total of 13 different error cases in UTF-8. It is also easier to parse a minus sign by checking for 0x2212 than checking three consecutive bytes. UTF-8 is the bloatware text encoding.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.