Avoid lexicographical comparisons when testing for string equality?

By default, programmers like to compare their bytes and strings using a lexicographical order. “Lexicographical” is a fancy word for “dictionary order”. That is, you compare the first two elements, check if they differ, if they do you report which string is largest, if not you repeat with the next two elements and so forth.

In C and C++, there is a super fast function for this purpose: memcmp. Derrick Stolee reported to me a performance regression in Git (a well-known tool among programmers). The problem has to do with memcmp.

Let us examine the problematic function in Git:

static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
{
	return memcmp(sha1, sha2, the_hash_algo->rawsz);
}

This returns a lexicographical comparison between two hash values, return a negative value when the first is smallest, zero if they are equal, and a positive value otherwise. As it is written, I do not know how to make this faster in general. It seems that we can often assume that the_hash_algo->rawsz will be 20 or 32, but that is not terribly useful.

However, let us look at an instance of how the function is used:

if (!hashcmp(sha1, pdata->objects[pos].idx.oid.hash)) {
	*found = 1;
	return i;
}

Do you see what is happening?

In this particular usage (and in others), we only check whether the two strings of bytes are identical. We do not need a lexicographical comparison.

It can be easier to decide whether two strings of bytes are identical than to compare them lexicographically. Lexicographical sort critically depends on the order of the bytes whereas byte comparisons is order oblivious. Even if you just have 8 bytes to compare lexicographically on an x64 processor, the compiler will need three instructions because it needs to reorder the bytes:

bswap   rcx
bswap   rdx
cmp     rcx, rdx

In contrast, a single instruction (cmp) is needed to determine whether the two 8-byte words are identical. It is even worse than that because x64 processors can compare a register against a memory value, thus potentially saving a load operation.

There used to be a standard C function for this purpose (bcmp) but it has been deprecated, and it is probably not highly optimized.

It is possible that your compiler is smart enough to figure out that checking that the returned value of memcmp is zero is equivalent to checking for equality. And your particular compiler might, indeed, be that smart. It is also possible the the overhead of the lexicographical order is irrelevant. But should you risk it?

So let me write something silly, assuming that we have exactly 20 bytes to compare:

bool memeq20(const char * s1, const char * s2) {
    uint64_t w1, w2;
    memcpy(&w1, s1, sizeof(w1));
    memcpy(&w2, s2, sizeof(w2));
    if(w1 != w2) return false;
    memcpy(&w1, s1 + sizeof(w1), sizeof(w1));
    memcpy(&w2, s2 + sizeof(w2), sizeof(w2));
    if(w1 != w2) return false;
    uint32_t ww1, ww2;
    memcpy(&ww1, s1 + 2 * sizeof(w1), sizeof(ww1));
    memcpy(&ww2, s2 + 2 * sizeof(w1), sizeof(ww2));
    return (ww1 == ww2);
}

That should be safe and portable. I am sure that good hackers can make it faster.

How fast is it already? Quite fast:

memcmp10.5 cycles
hand-rolled memcmp12.8 cycles
bcmp10.5 cycles
check equal only5.2 cycles

My version is twice as fast as memcmp. So while I probably couldn’t roll my own super fast memcmp function in 5 minutes, I certainly can beat memcmp with some basic code if I ask a different question instead: are the two strings of bytes identical?

I am using GCC 5.5. Your results will vary quite a bit depending on the compiler. In some settings, it will not be possible to beat memcmp at all, if the compiler is sufficiently smart. Also, there might be branching involved, so the results will depend on the statistics of your data.

Nate Lawson points out another reason to shy away from unnecessary lexicographical comparison: security. He writes:

The most important concern is if this will encourage unsafe designs. I can’t come up with a crypto design that requires ordering of secret data that isn’t also a terrible idea. Sorting your AES keys? Why? Don’t do that. (…) In any scenario that involves the need for ordering of secret data, much larger architectural issues need to be addressed than a comparison function.

My code is available.

37 thoughts on “Avoid lexicographical comparisons when testing for string equality?”

  1. Interesting. I’d think a naive copy of memcmp would take two operations per word, testing each for (a .lt b) and than if that failed, (a .eq. b), which would make it twice as slow as your memeq in the worst case of identical strings. And indeed, your version appears to be twice as fast.

    But a better memcmp would be to walk down the string and just test test each word for equality, and than only test (a .lt. b) when the equality test failed. In that case, you’d think both would be more or less as fast, as memcmp would only have one more operation then memeq. So I’m kind of surprised by this result.

    1. You are right … “in the long run”. Certainly most memcmp implementations will sequentially check one large word from each input and when they differ do “a bit more” to return the lexicographic result. They do not do the naive thing of comparing < and > for each pair.

      It’s not necessarily just “one more operation” though, especially on little-endian machines: if you know that the two inputs differ for some eight byte word, how do you return the right 3-way result? The easy way is to subtract, but this gives the right answer only on big-endian. On little endian you can swap the byte order and then subtract, but 64-bit bswap is not super fast even on recent Intel.

      There are perhaps other tricks possible, but current compilers prefer bswap.

      Vectorization of memcmp is another challenge: if you compare a 256-bit “word” and find it unequal, you can’t subtract them since there are no 256-bit wide subtractions. So again it’s more complex. Asymptotically they are the same though.

      Finally, memeq gives you another advantage: you could check the bytes in any order you want. One could imagine an adaptive or application-specific memeq that knows strings usually differ at the start or the end so checks them in an order that takes advantage of that. memcmp can’t.

      That said, the tests here show these effects: the memcmp slowdown shown in the numbers above is more or less entirely because gcc decided to call into libc memcpy which is much slower.

      1. Finding the distinguishing bit in little-endian int’s is about 3 instructions:

        z = x^y
        bool x_is_greater = x & (z ^ z - 1)

        1. I don’t think it works.

          z ^ (z – 1) sets all bits up to the lowest set bit, so the whole thing returns “does x have any bits set from bit 0 through the lowest bit that differs with y”? which is clearly wrong. As an example, x = 1, y = 2:

          z = 3
          z ^ (z-1) = 1
          x & (z ^ (z-1)) = 1

          So you get the wrong answer (x is not greater in the only byte x and y differ).

          Maybe you wanted something like (z & -z) which isolates the lowest set bit, then you can do your & to see which of x or y had the differing bit set. It still doesn’t work: you want to look at the top bit in the differing byte.

          That’s the problem with all the bit-tricks: there are two totally different directions you need to look: first you need to scan from the LSB(yte) to the MSB to find the lowest differing byte, but then you need to scan in the opposite direction within that byte from MSB(bit) to LSB(it) to find the distinguishing bit.

          Since most bit tricks don’t do anything special wrt bytes, this is hard, and why bswap is used. You can probably still do it, maybe along the line of the strlen bithack that finds a zero byte, but I don’t think it will be 3 instructions.

          It’s a shame bswap is 2 uops for 64-bit values on Intel, otherwise the bswap approach would be just 3 uops (2 bswap, subtract). On Zen, it’s one op and you can do 4 per cycle! Even better on Zen you can do MOVBE from memory in a single op, versus 3 on Intel.

              1. Yes that’s the problem. Note that even ignoring that issue, the original code didn’t return a meaningful result: it returns true for aost any two values. It returns true for 1,3 and it also returns true for 3,1!

          1. Yup, I didn’t see a better alternative to bswap when I looked into this a few months ago, either. Just make sure to only bswap after you’ve found the unequal words; the standard library implementation I saw made the mistake of bswapping before the equality check.

            1. Yeah I saw that too. On modern Intel it’s definitely better to do the bswap in the finishing up code, but one could argue that on AMD Zen you could do two MOVBE every iteration for free in place of regular mov, since it is only only uop so about the same cost as a plain load “in principle” (I haven’t tested it). Hopefully Intel will up their BSWAP game in the future.

              A big limitation of bswap is that it doesn’t obviously place nice with vectorised comparison loops, since the differing word is greater than 64-bits. Some movmaskb and trailing zero count can give you the different byte and you can load it alone and you can do the subtraction. Reasonable throughput but bad latency. I guess you could do the comparison in the vector domain, too.

  2. Coincidentally I corresponded with Derrick on another possible optimization at the beginning of the year which ran into similar issues with memcmp.

    I tried doing git index search with Manber and Myers’s String Binary Search algorithm, but it turned out that whatever byte comparisons we saved by skipping the longest common prefix were eaten up by using bytewise comparison, since we couldn’t use memcmp to get LCP.

    My recollection is that memcmp is heavily tuned, with versions for each length if known at compile time.

  3. The poor performance of memcmp here has more or less nothing to do with “lexicographic comparison” and almost everything to do with the fact that memcmp doens’t get inline so (a) has to do a function call through the PLT and then has to (b) implement a totally generic algorithm without the benefit of knowing that the string has a fixed length.

    That’s apples and oranges!

    Lexicographic comparison does have some cost, but it is quite small and, importantly is a fixed cost per call: the guts of a typical memcmp does an search for non-equal bytes and then subtracts the non-equal bytes to get the required ternary result. An equality-only memcmp is nearly exactly the same.

    1. Also in this case, a fast path optimization of comparing first word of the hash values (?) and going to slow path only if they match would skip a lot of potential generic logic inside memcmp.

    2. I think reasonably up-to-date glibc source for amd64 generic memcmp can be found at Sourceware repo. By looking at that code it is pretty obvious that generic memcmp is pretty slow for many inputs. For instance deciding that buffers shorter than 32 bytes differ on the first byte take 19 assembler instructions to execute, and more than 32 bytes requires more even on the best case! Good statically generated fast path (which compilers at godbolt.org don’t really seem to generate that eagerly) would be a well-behaving three-instruction sequence…

      1. The trick is to ask for memcmp(..,20) == 0 — it will generate wordwise compares inline instead of looking for the distinguishing prefix.

        Unfortunately I don’t know how this relates to the code you linked — does that hidden builtin macro help with the inlining, or is there some other code generation going on (in gcc)?

        1. At least on gcc, this doesn’t seem to be true. All the gcc versions from about 4.6 onwards I checked on x86 seem to always generate an actual call to memcmp (and before 4.6 they did inline code but did a bad job of it, even with static lengths and when comparing to zero – they always used rep cmp type stuff rather than just a few explicit loads).

          Perhaps this is as a consequence of this bug where someone claims “library memcmp is really fast, just always call the function” where this is plainly not true for small fixed lengths. I’m kind of surprised gcc fails badly here: it does much better on inlining fixed length memcpy for example (and Daniel’s code relies on that by doing several 8-byte memcpy that just get transformed into single 64-bit loads by the compiler).

          clang does better: if I compile the benchmark with clang 5.0, the results reverse: I get that memcmp is faster than memeq20. Note that this test is a bit specific in that the compared strings are randomly generated, so 99.6%+ of the time they will differ at the first byte, so it is really testing the fast path how the first byte or word or whatever is handled.

          The clang generated code isn’t all that great either, but it gets some things right: if you compare against 0, it removes both the special final logic to calculate the ternary result, and also bswap calls in the loop relating to the ternary result (but the bswap calls arguably shouldn’t have been in the loop in the first place).

          1. Two nicely aligned fixed-size buffers in a struct and a fixed comparison length of 64 bytes resulted rather interesting code: it started for gcc with two 64-bit reads from both buffers, XORing corresponding pairs and testing those results with OR. Well, that’s correct and reasonably efficient start (unless you use AVX, which clang would use if architecture is specified correctly), but in the case of hashes that usually differ, just comparing leading words – or probably even bytes, to avoid misalignment penalties – would have been faster.

            So, for specific use cases, hand-written comparison routine beats those reasonably good compiler-generated variants. It’s hard to tell to the compiler that bits on buffers are random and different 99% of the time, but 1% of the time probably all of them match between them! (If this would be the opposite, it would probably be beneficial to write a XOR-OR chain without any conditional branches, at least for reasonably short inputs.)

            1. Is the code you wrote for memcmp or bcmp though? That is, does it return a 3-way or 2-way (boolean) result? How did you do the 3-way part if three way?

              I agree that in this test the first word will almost always differ (always if its 64-bits) so the key is make the fastest possible code for that case. For clang, this means that comparing 24 bytes is actually faster than 16, since doing the “odd” word before vectorizing happens to line up with the fast path you’d want. See my comment here.

                1. That’s really interesting. It seems like I was (somewhat) incorrectly slandering gcc’s ability to inline constant length memcmp() as your example shows.

                  Luckily struct isn’t necessary: it works for plain pointers too. The key seems to be that the result has to be used non-lexicographically: if the calling code differentiates between the < 0 and > 0 cases, it will call memcmp. So this reinforces the idea of this blog post in a way: but partly due to a quirk of gcc: it also implies that the original code that saw the slowdown probably would have been fine on a more recent version of gcc.

                  clang behaves somewhat differently – it can either either style, but with quite different code: it vectorizes the boolean versions, and uses scalar code with movbe for when the full ternary result is used. The clang code seems suboptimal: it could just use plain moves and do the bswap in the tail part, and it should just subtract the differing words rather than the conditional stuff it is doing.

                  1. Note that this is exactly what KWillets said originally, so my objection was out of line. It does work when you compare memcmp against zero, at least for most recent gcc versions (seems like the inlining started around 7.0 or 7.1).

        2. To clarify, the code that foobar linked is an assembly code (.S) file that is used to generate the library version of the memcmp call. It isn’t used by the compiler as a “builtin” or anything to inline memcmp.

            1. So the way gcc (and maybe clang) handles this is specifically by recognizing memcmp and checking whether a only a 2-way result is needed and then essentially replacing it with a memcmp_eq call.

              So it’s hardcoded with knowledge of the semantics of memcmp. This raises the interesting question of if you wrote a function like memcmp, whether the generic optimizer could do this same optimization – this is a harder problem since it has to recognize that the output of your memcmp function is compared to zero and eliminate code that differentiates between < 0 and > 0.

              Here is a test.

              There are three variants of memcmp{0,1,2} which vary only in the return statement when a difference is found.

              memcmp0 is the normal way you’d write it: simply bswaps the mismatched words, and subtracts them.

              memcmp1 is the same, but it adds a __builtin_unreachable “assertion” that ret is non-zero (we know this is true, since if a != b, bswap(a) != bswap(b), but maybe the compiler needs help).

              memcmp2 returns (bswap(a) - bswap(b)) | 1 which doesn’t change the sign of the answer, but is another way to help compiler know the answer is non-zero.

              Then we call each of these functions in my_bcmp{0,1,2} which returns a bool rather than int, and see what happens.

              The score was gcc 2, clang 1, MSVC 0 (it is hopeless here: it doesn’t even inline bswap). Nobody could optimize the memcmp0 case: they all left the bswap calls in and compared the result. gcc was able to optimize either of the other two cases, and clang only memcmp2 | 1. When the optimization worked, it could generate exactly or almost exactly the same code as the explicit bcmp code. Cool.

              Note that the | 1 version isn’t free: when you actually use the 3-way result, you get a useless extra or instruction in the result path.

              1. The optimization that allows is called “range analysis” or something like that: where the compiler keeps track of what it knows about a variable, such as what bits are set, where it is definitely zero or non-zero, or the possible range of values.

                It’s the bswap here that “breaks” range analysis from working in the plain memcmp0 case: the compiler knows that left != right, and it knows that implies left - right != 0, but it doesn’t know that it equally implies bswap(left) - bswap(right) != 0. If you take out the bswap, both gcc and clang are able to optimize my_bcmp0.

                Evidently neither compiler has a

    3. An equality-only memcmp is nearly exactly the same.

      I think that my tests, and yours, show that there is a difference, at least as far as some compilers and data distributions are concerned. An equality comparison is an easier problem (for the compiler, for the programmer, for the hardware).

      The difference might be small, for some definition of small which goes down to zero in some cases. But a small difference multiplied by enough cases ends up making a difference.

      To prove me wrong, someone simply has to propose a piece of lexicographical comparison C code that consistently beats an optimized equality comparison, across a wide range of compilers and data distribution. If someone has that, then I will gladly retract my blog post.

      1. I think that my tests, … show that there is a difference, at least
        as far as some compilers and data distributions are concerned.

        Not at all. They show that gcc is deciding to call library memcmp through the PLT boundary and that a totally generic library call that doesn’t know the fixed length length of the comparison does terribly against an inlined function written for a hardcoded length.

        If you flip the situation around, and write an inlined lexicographic compare and call a library generic memeq function (bcmp could do) the results will approximate reverse themselves.

        Yes, lexicographic comparison is somewhat more difficult than plain equality comparison, but this test doesn’t show that at all except by coincidence.

        To prove me wrong, someone simply has to propose a piece of
        lexicographical comparison C code that consistently beats an optimized
        equality comparison, across a wide range of compilers and data
        distribution. If someone has that, then I will gladly retract my blog
        post.

        This is also wrong.

        No one that I can see is claiming that lexicographic comparison is faster than plain equality comparison (outside of implementation quality), and it is fairly obvious that if you had a super fast lexicographic comparison you can turn it a super fast equality comparison for with approximately zero work, so lexicographic comparison is never really going to be faster.

        Your claim here is that equality search is faster than lexicographic search when only equality is needed. The negation of that is that equality search is slower or equal to lexicographic search. We can toss out the slower part, but all someone has to do to prove you wrong is hit the bar for “equals”.

        You might not get to exactly equals, and the data distribution really matters here, but it will be close and much closer than your misleading test suggests.

        There is also the little issue of the interface between the 3-way result and the ultimate use as a 2-way result: clearly if you needed a 3-way result in the first place, there is no comparison: you need to use a 3-way function like memcmp. The motivating example, however, only needed a 2-way result and that was expressed in the source as !memcmp() so that information is available to the compiler, and indeed better compilers than the one git was using at the time can use this info to produce better code.

        Your test stores all the results into an array and actually uses the 3-wayness of the result so that will never happen in this case, but it isn’t clear if that a more likely scenario than memcmp() == 0 type scenarios.

        IMO when an error is pointed out in the test methodology, you should correct it rather than demanding that someone else write the code that perhaps you should have written in the first place!

  4. Why does your memeq20 function copy the data and do the comparison on the copies? You aren’t modifying the data, so why not just compare it in place?


    uint64_t * w1 = (uint64_t )s1;
    uint64_t * w2 = (uint64_t *)s2;
    bool answer1 = (w1[0] == w2[0]);
    bool answer2 = (w1[1] == w2[1]);
    uint32_t * ww1 = (uint32_t *)(s1 + sizeof(uint64_t) * 2);
    uint32_t * ww2 = (uint32_t *)(s2 + sizeof(uint64_t) * 2);
    bool answer3 = (
    ww1 == *ww2);

    Or something like that.

  5. I was wondering why you were using memcpy instead of a simple pointer arithmetic. With pointer arithmetic, you can get the benefit simpler code and the possibility to get a small speedup if checking previous answer.

    1. To make the code legal C. Although it works in practice, it is not legal to access an array of one type (char *) with a pointer to another type (uint64_t *), except in special circumstances.

      Yes, this is fine in assembly, but C is not assembly (although in practice current compilers often generates the assembly you expect – but who knows about tomorrow!).

      It doesn’t cause a problem here beacuse the memcpy is transformed back into a plain load by most compilers.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax