Avoid character-by-character processing when performance matters

When processing strings, it is tempting to view them as arrays of characters (or bytes) and to process them as such.

Suppose that you would like to determine whether a string is ASCII. In ASCII, every character must be a byte value smaller than 128. A fine C++17 approach to check that a string is ASCII might look as follows.

bool is_ascii_branchy(const std::string_view v) {
   for (size_t i = 0; i < v.size(); i++) {
     if (uint8_t(v[i]) >= 128) {
       return false;
   return true;

It is important to consider at the logic of this code. What you are telling the compiler is to access all characters in sequence, check whether it is an ASCII character, and bail out if not. Thus if the string contains no ASCII character, only the first character should be read.

It might be high performance code if you are expecting the strings to mostly start with non-ASCII characters. But if you are expecting the string to be almost always ASCII, then this code is not going to be optimal.

You might complain that the compiler should be able to optimize it for you, and it will, but only within the constraints of the code you provided. Compilers are typically not in the business of redesigning your algorithms.

If you are expecting ASCII inputs, then you should just run through the string  using as few steps as possible. The following code relies on the fact that our processors can process 64-bit blocks using single instructions:

bool is_ascii_branchless(const std::string_view v) {
  uint64_t running = 0;
  size_t i = 0;
  for(; i + 8 <= v.size(); i+=8) {
    uint64_t payload;
    memcpy(&payload, v.data() + i, 8);
    running |= payload;
  for(; i < v.size(); i++) {
      running |= v[i];
  return (running & 0x8080808080808080) == 0;  

It is an optimistic function: if you encounter a non-ASCII character early on, you will end up doing a lot of needless work if the string is long.

You could try a hybrid between the two. You read 8 characters, check whether they are ASCII, and bail out if they aren’t.

bool is_ascii_hybrid(const std::string_view v) {

  size_t i = 0;
  for(; i + 8 <= v.size(); i+=8) {
    uint64_t payload;
    memcpy(&payload, v.data() + i, 8);
    if((payload & 0x8080808080808080) != 0) return false;
  for(; i < v.size(); i++) {
      if((v[i] & 0x80) != 0) return false;
  return true;

How do these functions compare? I wrote a quick benchmark with short ASCII strings (fewer than 128 characters). I get that the character-by-character runs at about half the speed. Your results vary, feel free to run my benchmark on your machine with your compiler.

character-by-character 2.0 GB/s
optimistic 3.5 GB/s
hybrid 3.4 GB/s

With some work, you can probably go much faster but be mindful of the fact that I deliberately chose a benchmark with small, fragmented, strings.

Daniel Lemire, "Avoid character-by-character processing when performance matters," in Daniel Lemire's blog, July 21, 2020.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

20 thoughts on “Avoid character-by-character processing when performance matters”

  1. Can eliminate the second loop. Check if i > 0 and if so, slice last 8 bytes and or-them to the result

      1. For the left-over string (smaller than 8 characters but greater than 0) still do a memcpy of into a zeroed 64 bits unsigned integer just use v.size() as the length parameter. Your check against the 64-bits mask should still be right as the memcpy won’t touch the zeroed bits of the payload that are outside of the length.

        1. A variable length memcpy isn’t much better than this short loop and internally may use a loop or some sort of indirect branch dispatching to a copy routine specialized for the length.

          I think a better approach, if you expect the total input length to usually be at least 8, is to do a final length 8 memcpy, but aligned to the end of the string. If the length of the string wasn’t a multiple of 8, this checks some characters “twice” but this check is very fast in any case.

          With this change you can also change the primary loop condition to i + 8 < v.size() since the final 1 to 8 bytes can be handled by this last check, and you’d rather do it unconditionally (at least for the v.size() >= case) to avoid an extra branch misprediction.

  2. Something like this, I’d presume…

    if (v.size() > 8) {
    memcpy(&payload, v.data() + v.size() - 8, 8);
    running |= payload;

    But then you’re still left with cases where v.size() < 8, where you’d need the second loop anyway…

    1. I wonder if it would be an improvement to switch on the number of remaining characters rather than use a second loop (assuming that string length is evenly distributed).

      1. The way Stuart suggests I think is almost strictly better than the loop or the switch: both involve an unpredictable branch for unpredictable sizes. Between the two, the switch is probably somewhat faster at the cost of larger code size.

        Stuart’s approach only involves an unpredictable branch in the case the size varies between less than 8 and greater than 8 unpredictably, which seems much rarer and as a binary predicate would be somewhat predictable in almost any non-contrived case. If you really want to get rid of this final source of unpredictability, you can use a masked load.

  3. This is a fun exercise, but does this ever occur in real life? I can’t say I’ve ever needed to test a large block of arbitrary bytes to determine if they were ASCII.

    1. I can’t say I’ve ever needed to test a large block of arbitrary bytes to determine if they were ASCII.

      I have. I do / did on a regular basis. You could consider it a ‘large block of arbitrary bytes’ but also lots and lots of small block (strings) of arbitrary bytes. They promise some field should be ASCII (it’s actually called “AsciiName”) for old(er) systems that don’t support anything else but ASCII but non-ASCII values keep returning in their exports. Up to this day. Exactly why I based my .Net Benchmarks based on this post on that exact data.

      1. It sounds like the real problem you’re facing is that you’re trying to maintain a ‘database’ which has no schema enforcement (giant binary file) and no access control (unknown users can edit it at any time).

        Given those circumstances, writing a program to parse hundreds of megabytes to check for the high-bit being set can never be a complete or reliable solution. (No matter how fast you can check it, there will always be a race condition, as you note.) Wouldn’t a better solution be to create an interface to this data which enforced the desired schema during edits? You only need to check the diffs.

        No other system I use — and certainly none which cares about performance or security — uses the approach of letting anyone make arbitrary edits to raw data at any time, and then trying to verify the entire database after the fact. I don’t run a script every night to make sure all my JPEGs are still JPEGs, and an unknown user didn’t log in and put bogus data in them. The closest I can think of is Wikipedia, and even they don’t analyze hundreds of megabytes on every edit.

        I’d say: “Avoid doing O(n) work unnecessarily when performance matters”. You can justify all sorts of craziness if you begin by using the wrong data structure. Unfixable legacy system designs are one possible use of this, true, but I wouldn’t say they’re common.

        1. Except that I have no control over the data (3rd party data). And I have told them on several occasions that their “ASCII” input/validation is broken. But there’s nothing I can do to fix it. All you can do is download dumps.

          And ofcourse you can create (or download) diffs and ofcourse you can do all kinds of smart optimisations. That’s all besides the point. If you have to, for -whatever- occasion, do this then it’s nice to know what works. Hell, even if it’s just for fun. Because we can. Just because it’s beyond your imagination doesn’t mean it’s not real.

    2. Yes, it does. If you want to validate UTF-8, and your strings are almost always ASCII, it’s much faster to check for ASCII first and skip validation if so.

  4. Yes, treating the last 8 and overlaying a few bytes is quicker in my tests. Here is a version that also handles the < 8 bytes case.

    inline bool is_ascii_branchless2(const std::string_view v) {
    uint64_t less_than_8 = 0;
    uint64_t* payload = &less_than_8;
    uint8_t r = (v.size() & 7);
    size_t d = (v.size() >> 3);

    if (v.size() <= 8) { // Equal to 8 to skip checking the block twice.
    memcpy(&less_than_8, v.data(), r);
    } else {
    payload = (uint64_t*)( v.data() );
    uint64_t running = 0;

    for(; d; payload++, d--) {
    running |= *payload;

    uint8_t* remaining = (uint8_t*)payload;
    remaining += r;
    running |= *(uint64_t*)remaining;

    return ((running & 0x8080808080808080) == 0);

  5. That line where character data is memcpy-ed

    memcpy(&payload, v.data() + i, 8);

    ‘payload’ variable might have whatever data that comes after the buffer if the memcpy transfers 8 bytes every time. In your test examples you used an input of 1000000 chars which %8 == 0.

    1. That’s why it’s inside a loop with condition “i + 8 <= v.size()”, to protect from ever over-reading the buffer.

      1. Yep, my bad, I was thinking 2 things at once – over-reading and under-reading thats why I misunderstood it at first, thanks.

  6. Thanks for this post, I enjoy the examples of optimisations on this blog.

    If you try a mix of non-ascii strings, the branchless approach still performs well when a high percentage of strings are ascii. I found the crossover between the branchless and nearly branchless approaches to be at about 83% ascii strings for this benchmark.

    To test that, I replaced the following on line 107:

    buffer[i] = (rand() % 128);

    With the following to get a mix of nonascii characters created within the strings:

    if (rand() % 160 == 0)
    buffer[i] = (rand() % 256);
    buffer[i] = (rand() % 128);

    I get the following output on an i7-3770 with gcc 9.3.0:

    Creating 1000000 strings
    branchy:1.56017 GB/s
    branchless:3.32739 GB/s
    nearly branchless:3.32863 GB/s

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.