Quickly identifying a sequence of digits in a string of characters

Suppose that you want to quickly determine a sequence of eight characters are made of digits (e.g., ‘9434324134’). How fast can you go?

In software, characters are mapped to integer values called the code points. The ASCII and UTF-8 code points for the digits 0, 1,…, 9 are the consecutive integers 0x30, 0x31, …, 0x39 in hexadecimal notation.

Thus you can check whether a character is a digit by comparing with 0x30 and 0x39: ((c < 0x30) || (c > 0x39)). It is even cheaper than it looks because optimizing compilers simply take the code point, subtract 0x30 from it and compare the result with 9. So there is a single comparison!

Given a stream of characters, the conventional approach in C or C-like language is to loop over the sequence of characters and check that each one is a digit.

bool is_made_of_eight_digits(unsigned char * chars) {
    for(size_t j = 0; j < 8; j++) {
          if((chars[j] < 0x30) || (chars[j] > 0x39)) {
              return false;
          }
    }
    return true;
}

Can we do better? Instead of doing eight comparisons (one per character), we would like to do only one or two. For this, we can use SIMD within a register (SWAR): load the eight characters into a 64-bit integer and do some operations on the result integers.

Here is a simple “branchless” approach first… it does a single comparison:

bool is_made_of_eight_digits_branchless(const unsigned char  * chars) {
    uint64_t val;
    memcpy(&val, chars, 8);
    return ( val & (val + 0x0606060606060606) &0xF0F0F0F0F0F0F0F0) 
             == 0x3030303030303030;
}

(Credit: Travis Downs simplified and improved the version I had.)

In some instances, it might be faster to do two comparisons instead of one, especially if the results are predictible.

bool is_made_of_eight_digits_branchy(unsigned char  * chars) {
    uint64_t val;
    memcpy(&val, chars, 8);
    return (( val & 0xF0F0F0F0F0F0F0F0 ) == 0x3030303030303030)
      && (( (val + 0x0606060606060606) & 0xF0F0F0F0F0F0F0F0 )
      == 0x3030303030303030); 
}

How do they work? One the one hand, they check that the most significant 4 bits of each character is the value 0x3. Once this is done, you know that the characters value must range in 0x30 to 0x3F, but you want to exclude the values from 0x3a to 0x3f. If you add 0x06 to the integers from 0x30 to 0x39 you get the integers 0x36 to 0x3f, but adding 0x06 to 03a gives you 0x40. So you can add 0x06 to each byte and check again that the most significant 4 bits of each character is the value 0x3.

It is crazily hard to benchmark such routines because their performance is highly sensitive to the data inputs. You really want to benchmark them on your actual data. And compilers matter a lot. Still, we can throw some synthetic data at it and see how well they fare (on a skylake processor).

compiler conventional SWAR 1 comparison SWAR 2 comparisons
gcc 8 (-O2) 11.4 cycles 3.1 cycles 2.5 cycles
gcc 8 (-O3) 5.2 cycles 3.1 cycles 3 cycles
clang 6 (-O2) 5.3 cycles 2.4 cycles 2.2 cycles
clang 6 (-O3) 5.3 cycles 2.4 cycles 2.1 cycles

The table reports the average time (throughput) to check that a sequence of eight characters is made of digits. In my tests, the branchless approach is not the fastest. Yet it might be a good bet in practice because it is going to run at the same speed, irrespective of the data.

Let us some consider less regular data where the processors cannot easily guess the result of the comparisons:

compiler conventional SWAR 1 comparison SWAR 2 comparisons
gcc 8 (-O2) 12.1 cycles 3.1 cycles 5.2 cycles
gcc 8 (-O3) 7.2 cycles 3.1 cycles 5.1 cycles
clang 6 (-O2) 7.3 cycles 2.4 cycles 4.3 cycles
clang 6 (-O3) 7.1 cycles 2.4 cycles 4.5 cycles

My source code is available.

Further reading. After working on this problem a bit, and finding a workable approach, I went on the Internet to check whether someone had done better, and I found an article by my friend Wojciech Muła who has an article on the exact same problem. It is a small world. His approach is similar although he has no equivalent to my single-comparison function.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

10 thoughts on “Quickly identifying a sequence of digits in a string of characters”

  1. It would be interesting to know why there’s an almost factor of two difference between gcc and clang in the conventional case!

    1. Loop unrolling. Newer gcc doesn’t unroll at all at -O2, but clang unrolls aggressively. The -O2 loop on gcc contains only a single check and is about half loop overhead. If you use -O3 they end up equivalent.

      This observation applies to other types of optimizations as well: -O2 in clang is in no way equivalent to -O2 in gcc. For example gcc never vectorizes at -O2 but clang vectorizes all the time, etc.

      The above doesn’t apply if you use PGO.

      Definitely makes it hard to do apples to apples comparison.

  2. How about:

    ( val & (val + 0x0606060606060606) & 0xF0F0F0F0F0F0F0F0 ) == 0x3030303030303030

    ?

    Cuts out two operations. It is usually faster than the existing branchless version, although nothing ever beat “branchy” for the compilers I tried.

  3. I posted another comment with a possibly improved “branchless” solution, but it never showed up.

    Anyways, is either test supposed to lead to unpredictable results? As far as I can tell the generatefloats and generatevarfloats routines both generate floats with random values but predictable length (16 vs 12) but don’t otherwise differ?

    The branch solution is favored here because the results follow a predictable pattern, and because the failure (the non-digit results) shortcut most of the work. If you create variable length floats, with this one line change:

    pos += sprintf((char *)stroutput + pos, "%.*f,", rand() % 20 + 1, output);

    Then branchy starts to lose.

    It is also possible that the compiler compiles branchy to branchless code: but all the ones I checked went half-way: they do the first check branchy and the second check branchless. Since the first check usually fails for any non-digit character (except for 6 characters right below 0), this works well in the predictable case!

  4. Here’s a relevant link – the approach there is general, but the condition is slightly different (it looks for any byte in the range, rather than all bytes in the range, but you could transform one to the other in a straightforward way.

    The idea is more or less the same (exploit carries to do a specific range check), but a bit slower since it’s more general.

    The “Determine if a word has a byte greater than n” is interesting since its only three op, if you add one op to subtract ‘\0’ then it’s 4 ops, just either tried or one more than my suggestion above, depending on if you treat the implicit == 0 which is not counted on that page as an op or not. On x86 it is probably “not an op” since compare against zero is automatic, unlike comparison against other values: you just use the ZF after your last arithmetic op.

  5. I replied to this on Twitter, but for the record… I think there is a version of this that works with a add/compare on SIMD. Specifically one would add a magic number (70, IIRC) that pushes ‘0’ to 118 in the result and ‘9’ to 127.

    +127 is the maximal signed byte, so signed-gt comparison against 117 should do the trick.

    This doesn’t necessarily seem faster than the best SWAR variant but might be handy if you are looking for more than 8 chars.

  6. i was wondering if a similar approach could be used to validate hexadecimal strings. I tried today, but I’m afraid I failed. Any help, tips or trick appreciated!

    1. It seems more difficult to do it efficiently in portable C code. In our routine, we use the fact that the digits are consecutive (0x30 to 0x39) in ASCII. It is no longer true if you want to consider hexadecimal ‘digits’ which include a,b, A,B… etc. It suggests that the check for hexadecimal strings could be more expensive.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.