For case-insensitive string comparisons, avoid char-by-char functions

Sometimes we need to compare strings in a case-insensitive manner. For example, you might want ‘abc’ and ‘ABC’ to be considered. It is a well-defined problem for ASCII strings. In C/C++, there are basically two common approaches. You can do whole string comparisons:

bool isequal = (strncasecmp(string1, string2, N) == 0);

Or you can do character-by-character comparisons, mapping each and every character to a lower-case version and comparing that:

bool isequal{true};
for (size_t i = 0; i < N; i++) {
      if (tolower(string1[i]) != tolower(string2[i])) {
        is_the_same = false;
        break;
      }
}

Intuitively, the second version is worse because it requires more code. We might also expect it to be slower. How much slower? I wrote a quick benchmark to test it out:

strncasecmp Linux/GNU GCC macOS/LLVM
strncasecmp 0.15 ns/byte 1 ns/byte
tolower 4.5 ns/byte 4.0 ns/byte

I got these results with GNU GCC under Linux. And on a different machine running macOS.

So for sizeable strings, the character-by-character approach might be 4 to 40 times slower! Results will vary depending on your standard library and of the time of the day. However, in all my tests, strncasecmp is always substantially faster. (Note: Microsoft provides similar functions under different names, see _strnicmp for example.)

Could you go faster? I tried rolling my own and it runs at about 0.3 ns/byte. So it is faster than the competition under macOS, but slower under Linux. I suspect that the standard library under Linux must rely on a vectorized implementation which might explain how it beats me by a factor of two.

I bet that if we use vectorization, we can beat the standard librairies.

 

My code is available.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

13 thoughts on “For case-insensitive string comparisons, avoid char-by-char functions”

  1. Long ago, in the hot path of a performance critical application, I found a case-insensitive string compare that worked by calling strdup on both strings, then case-smashing them, then comparing the results, then freeing the strings. The typical use case was iterating through a large list of strings comparing a reference string to each of them as a table lookup.

    At the time, I thought per-character comparison would be better. (There wasn’t, yet, a library function for this in that environment.) I think it might have been, given the allocation overhead. These days, I’d have had the objects store a smashed-case version of their strings and compare those.

    1. Note that my post starts with the ASCII context. None of these approaches will work with UTF-8, I think.

      If you assume ASCII, I am not too sure why you’d need to do memory allocations.

      If you need to support unicode, you have other problems that this blog post does not cover. Obviously.

      1. Indeed, what someone explained to me once is that this task (with unicode) is as a matter of principle impossible. There are ‘situations’ where the only way to know what ‘the’ minor is, is to have more context. Let’s just stick to ascii and English.

      2. The Unicode Comparison Algorithm is anything but — it seems to be a list of problems that an algorithm should solve, but not a prescriptive method.

        One approach that seems relevant here is to bet that characters will be binary equal more often than they are case-insensitive equal; you use a strcmp-type loop to find the common prefix and then do the expensive collation only on the first code points that mismatch. If you’ve bet correctly, the first mismatch will collate into inequality; otherwise you continue (with binary, not collated comparisons).

        That’s a performance optimization that ICU only belatedly applied to strcoll, and the UCA unfortunately says almost nothing about it.

        With strcasecmp, you could try a loop or vector comparison to find the longest common binary prefix and then apply case conversion only to the next character. Performance is entirely dependent on how often you get mixed-case equality, eg “Daniel” vs. “DANIEL” would have one fast and five slow comparisons, while “Daniel” vs. “daniel” would have one slow and five fast.

        In this instance I doubt it would be faster (it’s certainly branchy), but with more expensive collations it seems like the way to go.

  2. glibc on my system uses an AVX specialization if the locale is “compatible with ASCII for single bytes” (basically it seems like that means that it does something reasonable if it just uses the ASCII A-Z:a-z mapping and treats all other values without changing the case.

    It uses a 128-bit wide vector algorithm:

    https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/multiarch/strcmp-sse42.S#L201

    You could get 2x as fast with 256-bits, almost certainly. You don’t need the SSE4.2 stuff to detect the null character.

  3. A major practical issue here is that case-insensitive comparisons involve short strings far more often than long ones, so low constant overhead is usually more important for this function than great length->infinity asymptotic behavior.

      1. I do: the glibc versions won’t exhibit their large advantage for short strings because the vectorization advantage goes to zero (and possibly even becomes negative) as the string length goes to zero.

        1. I have updated the C program so that it benchmarks both small and long strings, and though, obviously, the gap closes on small strings, the strncasecmp function is still faster.

  4. Any intelligent library will use a faster string-matching algorithm like Boyer-Moore or similar, doing far fewer than N compare operations for strings of length N.

    By lowercasing the strings first, you guarantee you do at least N*2 operations even if the strings don’t match right at the beginning!

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax