Counting the number of matching characters in two ASCII strings

Suppose that you give me two ASCII strings having the same number of characters. I wish to compute efficiently the number of matching characters (same position, same character). E.g., the strings ‘012c’ and ‘021c’ have two matching characters (‘0’ and ‘c’).

The conventional approach in C would look as follow:

uint64_t standard_matching_bytes(char * c1, char * c2, size_t n) {
    size_t count = 0;
    size_t i = 0;
    for(; i < n; i++) {
        if(c1[i]  == c2[i]) { count++; }
    }
    return count;
}

There is nothing wrong with this code. An optimizing compiler can auto-vectorize this code so that it will do far fewer than one instruction per byte, given long enough strings.

However, it does appear that the routine looks at every character, one by one. So it looks like you are loading two values, then you are comparing and then incrementing a counter, for each character. So it might compile to over 5 instructions per character (prior to auto-vectorization).

What you can do instead is load the data in blocks of 8 bytes, into 64-bit integers as in the following code. Do not be mislead by the apparently expensive memcpy calls: an optimizing compiler will turn these function calls into a single load instruction.

uint64_t matching_bytes(char * c1, char * c2, size_t n) {
    size_t count = 0;
    size_t i = 0;
    uint64_t x, y;
    for(; i + sizeof(uint64_t) <= n; i+= sizeof(uint64_t)) {
      memcpy(&x, c1 + i, sizeof(uint64_t) );
      memcpy(&y, c2 + i, sizeof(uint64_t) );
      count += matching_bytes_in_word(x,y);
    }
    for(; i < n; i++) {
        if(c1[i]  == c2[i]) { count++; }
    }
    return count;
}

So we just need a function that can compare two 64-bit integers and find how many matching bytes there are. Thankfully there are fairly standard techniques to do so such as the following. (I borrowed part of the routine from Wojciech Muła.)

uint64_t matching_bytes_in_word(uint64_t x, uint64_t y) {
  uint64_t xor_xy = x ^ y;
  const uint64_t t0 = (~xor_xy & 0x7f7f7f7f7f7f7f7fllu) + 0x0101010101010101llu;
  const uint64_t t1 = (~xor_xy & 0x8080808080808080llu);
  uint64_t zeros = t0 & t1;
  return ((zeros >> 7) * 0x0101010101010101ULL) >> 56;
}

With this routine, you can bring down the instruction count to about 2 per character, including all the overhead and the data loading. It is strictly better than what you could with character-by-character processing by a factor of two (for long strings).

Though I seem to restrict the problem to ASCII inputs, my code actually counts the number of matching bytes. If you know that the input is ASCII, you can further optimize the routine.

I leave it as an exercise for the reader to write a function that counts the number of matching characters within a range, or to determine whether all characters in a given range match.

The proper way to solve this problem is with SIMD instructions, and most optimizing compilers should do that for you starting from a simple loop. However, if it is not possible and you have relatively long strings, then the approach I described could be beneficial.

My source code is available.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

14 thoughts on “Counting the number of matching characters in two ASCII strings”

  1. An optimizing compiler will turn these function calls into a single
    load instruction.

    Mmmh, I would not rely on that. The strinsg have to be aligned to a multiple of 8 byte to do so. How should the compiler know? Maybe in cases where this function is inlined (which would require it to be qualified with static visibility) but not on a general basis I guess.

    1. The strinsg have to be aligned to a multiple of 8 byte to do so.

      Compilers are hardware-aware are most hardware you might care about (x64, 64-bit ARM) can load data from any address, aligned or not. You should expect these memcpy function call to be compiled to a single instruction. And it is a fast instruction. See my post Data alignment for speed: myth or reality? for more details.

      1. You are right. I’m in a different domain with mostly 32-Bit ARMs, PPC, MIPS, AVR, where the compiler would generated particular instructons for byte, word and dword access or you would at least get a significant perfomance penalty in the silicon from unaligned access in case the chip is capable of it. Has to be considered and measured for a specific hardware to be really sure here. Thanks for the article. Good to know modern CPUs seem to be so much more merciful here.

        1. You are correct, on some old or small processors, there might be a significant performance penalty to unaligned loads, but even then, you’d expect the compiler to basically get the job done. If the platform lacks an unaligned load instruction, it will have to do some computations to figure out the alignment, do two loads, and fuse the result. It will be harmful.

          But most people do not program for such small processors.

          It used to be that “mobile” computing meant “small and limited”, but even an old iPhone can do unaligned loads and more.

          1. it will have to do some computations to figure out the alignment

            Well, it can’t. That is what I was referring to initially.
            The memcpy implementation will have to respect the aligment dynamically during runtime.

            The addresses of s1 and s2 are not known since the function is exported. LTO may catch this, though. Unsure.

            1. The memcpy implementation will have to respect the aligment dynamically during runtime.

              Well, it cannot know the alignment at compile-time, but it could figure it out at runtime.

              However, thinking twice about the problem, I think that what an optimizing compiler might do is load 8 bytes (using 8 instructions) and then combine them using shifts. It is going to be expensive at least in terms of instruction count.

              What I described at first would not work because it would entails reading more bytes than necessary.

  2. Pardon my ignorance, but couldn’t the memcpy be avoided if we instead type x and y and uint64_t* and assign the address instead of copying the bytes? For example: x = (uint64_t*)c1 + i; Or would this be the same or worse after being put through an optimizing compiler?

    1. couldn’t the memcpy be avoided if we instead type x and y and uint64_t* and assign the address instead of copying the bytes?

      You have to load the data into a register one way or another. A load cannot avoided. Some x64 instructions can take an address as a parameter, but there is effectively a load happening in any case…

      I use memcpy to avoid undefined behaviours. Doing a cast to a wider type is not good C code. You have to use a memcpy or the equivalent (in C, C++).

      Or would this be the same or worse after being put through an optimizing compiler?

      If you are using undefined behaviour, the risk is that the compiler could do something wrong.

      Otherwise, I have no way to think that avoid a memcpy could ever speed things up.

        1. I do not understand the comment.

          Yes, my code above effectively compares bytes so as-is it assumes ASCII. You can rather easily extend it to UTF-16 if you check for surrogate pairs. Matching UTF-8 characters would be a fun challenge. But then matching UTF-8 would be tricky no matter what.

          If you decode to UTF-32 first, then matching the characters fast is the last of your concerns.

  3. Note that you specified ASCII, so the high bit will always be clear, so t1 will always be 0x8080808080808080LL.

    But (1) your version works for Latin1, and (2) it may be too hard to come up with a new name for matching_bytes_in_word. 😉

    (I also don’t know if compilers can convert the last line of that function into a popcnt instruction on their own?)

    1. You are correct. My code is more general than just ASCII. I actually did not try to optimize for ASCII.

      I also don’t know if compilers can convert the last line of that function into a popcnt instruction on their own?

      I am skeptical.

      I only wanted to rely on standard C for this post. You could indeed just call popcnt which, on recent x64 hardware, is probably much cheaper than a shift followed by a multiplication followed by a shift. But I’d be impressed if a compiler could figure that out.

      The trouble is that if I can assume popcnt, I may as well assume SIMD instructions and the whole problem takes a different turn.

  4. return ((zeros >> 7) * 0x0101010101010101ULL) >> 56;

    That one line is quite a gem. I haven’t seen that trick before.

    Thanks for this post Daniel.

  5. zeros in matching_bytes_in_word can be calculated with one instruction less (although it’s unlikely to affect speed much):

    uint64_t matching_bytes_in_word(uint64_t x, uint64_t y) {
    uint64_t xor_xy = x ^ y;
    uint64_t zeros = (((xor_xy >> 1) | 0x8080808080808080) - xor_xy) & 0x8080808080808080;
    return ((zeros >> 7) * 0x0101010101010101ULL) >> 56;
    }

Leave a Reply to Daniel Lemire Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.