Sorting strings properly is stupidly hard

Programming languages make it hard to sort arrays properly. Look at how JavaScript sorts arrays of integers:

> v = [1,3,2,10]
[ 1, 3, 2, 10 ]
> v.sort()
[ 1, 10, 2, 3 ]

You need a magical incantation to get the right result:

> v.sort((a,b)=>a-b)
[ 1, 2, 3, 10 ]

Though this bad default can create bugs, it is probably not the source of too many frustrations. However, sorting strings alphabetically is a real problem. Let us see how various languages do it.

JavaScript:

> var v = ["e", "a", "é","f"]
> v.sort()
[ 'a', 'e', 'f', 'é' ]
> v = ["a","b","A","B"]
> v.sort()
[ 'A', 'B', 'a', 'b' ]

Python:

>>> x=["e","a","é","f"]
>>> x.sort()
>>> x
['a', 'e', 'f', 'é']
>>> x=["a","A","b","B"]
>>> x.sort()
>>> x
['A', 'B', 'a', 'b']

Swift:

  1> var x = ["e","a","é","f"] 
  2> x.sorted()
$R0: [String] = 4 values {
  [0] = "a"
  [1] = "e"
  [2] = "f"
  [3] = "é"
  1>  var x = ["a","b","A","B"]
  2> x.sorted() 
$R1: [String] = 4 values {
  [0] = "A"
  [1] = "B"
  [2] = "a"
  [3] = "b"
}

As far as I can tell, by default, these languages apply crude code-point sorting. Human beings understand that the characters e, é, E, É, è, ê, and so forth, should be considered as the same letter (e) with accents. There are exceptions to this rule, but the default which consists in sorting accentuated characters after the letter ‘z’ is just not reasonable. The way case is handled is patently stupid. You might prefer A to come before a, or vice versa, but no human being would ever sort the letters as A,B,a,b or a,b,A,B.

There are standards for sorting strings, such as the Unicode Collation Algorithm.

To get a sensible default, programming languages force you to use complicated code. In JavaScript, it is burdensome but easy enough….

> v.sort(Intl.Collator().compare)
[ 'a', 'e', 'é', 'f' ]

However, I am not sure what the equivalent is in Python and Swift. It does not jump at me in the documentation of the respective standard librairies. I did not even look at it for other popular programming languages like Go, C++, and so forth.

It is unacceptably difficult to do the “right thing”. The net result is that many programmers do not sort strings properly. If you use a natural language with non-English characters, you see the effect in many applications. It looks bad.

Thankfully, most major software products get it right. Microsoft Office, Google Docs, Apple apps… all do the right thing. It creeps up in small budget applications. I have to use one in my daily life as an employee of a public university, and it annoys me.

We should do better.

Further reading:International Components for Unicode

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

39 thoughts on “Sorting strings properly is stupidly hard”

    1. IMHO “natural sort” solves a slightly different problem: the problem of sorting text that contains numbers. For example “1000 apples” should be sorted after “200 apples”. I see the library you refer to supports both numbers, as well as locale-specific sorting. Another related problem is sorting date values such as “Mar/29/2018” correctly. For log files, the easiest solution is to the use the format “2018-03-29” (year, month, day), which avoid the problem. Trying to “detect” what is meant is quite hard.

  1. Here it is in R (in my default en_NZ.UTF-8 locale)

    > v<-c("e","a","é","f")
    > sort(v)
    [1] "a" "e" "é" "f"
    > v<-c("a","b","A","B")
    > sort(v)
    [1] "a" "A" "b" "B"

    I note that some languages do sort accented vowels at the end, eg Ø and Å in Danish. And Estonian not only sorts accented vowels to nearly the end of the alphabet, but also puts U after Z.

    As you say, it’s hard.

  2. The problem is, sorting letters and words depends on language. Until 2006 Swedish V letter was equivalent to W. Å, Ä, and Ö are at the end of the alphabet.

    Sorting Chinese is hard – some sort the characters by the number of brush strokes it’s needed to paint it.

      1. While you be aware of this, it is worth pointing out that the Unicode Collation Algorithm is not an algorithm that takes two strings s1 and s2 and sorts them. It takes s1, s2 and a locale and sorts s1 and s2 according to the locale.

        So the UCA is in no way a drop-in replacement for a plain s1 < s2 type comparison, because it implies the complicated question “what locale do you want to do this sort in”?

        I 100% agree that languages should provide locale sensitive sorts, and I would be surprised if most mainstream, modern ones don’t – but it’s hard to argue it should be the default.

        There is a “default” collation order available for UCA, but using this is unlikely to make everyone happy (the concept of a “default” locale is likely to raise some hackles) and you’ll get a slowdown for no good reason – still, using the default UCA locale as the language-level default sort is a better option than using the implicit environment locale, although worse than “binary”, IMO.

        1. I 100% agree that languages should provide locale sensitive sorts, and I would be surprised if most mainstream, modern ones don’t

          Have a look at the answers I got so far:

          1. “Go has x/text which may be included in the standard library (…)” Not exactly encouraging, is it?
          2. Regarding Python, a commenter pointed me to a non-standard library (natsort), and there is questions as to whether it solves the right problem. The locale package seems like a more concrete answer, but it fails on my machine (as the commenter anticipates) and requires non-obvious (to me) invocations like locale.setlocale(locale.LC_ALL, 'fr_FR'). Granted, programming is hard, but some things are more obvious than others.
          3. For Swift, one commenter refers to Objective-C. But the documentation regarding strings in the current Swift does not say anything about collation. The string class has no helpful attribute that I can spot.

          I submit to you that a tricky technical interview question would be “how do you sort unicode strings in Python”. Then “how do you sort unicode strings in Swift”. Even with access to the Internet, I suspect that even good engineers would fail to produce a working solution.

          This should be clearly documented and accessible. It is not.

          I don’t expect C to provide this sort of things. One commenter points out that C++ does. I think Java and C# do. For Swift 4, the answer is unknown to me at this point. For Go, the answer seems to be “no, the language does not support it”. But even when languages do support it, why can’t it just be something straight-forward and documented like thisarray.sort(ThisLocale.comparator)?

          1. Huh, the situation is worse than I expected (I don’t use those languages regularly, or at least I don’t care about collation order when I use them).

            That said, it seems at least Go and Swift do support this, even if the documentation is poor. An example for Swift in Swedish locale, and as far as x/text in Go goes, it seems to me that the /x/ libraries are indeed “part of Go” but not part of “Go core development”. I.e., the X are classification means a different development process and compatibility requirements but they are still “included”. Kind of weird some for something as basic as collation.

            One reason might be the ubiquity of ICU. As far as I know this is the widely available “go to” library for full-on collation and the one I’ve ended up using in any serious setting. So maybe people prefer to just use the consistent ICU library than language facilities for this?

            On Python, the answer seems to be the “locale” package as you point out, but it seems not cross-platform (i.e., the locales are named differently on Windows and Linux). I guess it would at least work for the “implicit locale” case though. Maybe there is yet another way to get standard locale names? Seems messy.

            1. The Swift example you offer is for old Swift versions. I am sure it is possible to do with the current Swift version, but the approach described there does not work.

              That is, it is not that Swift does not support it, it is that it is stupidly hard to find how to do it.

              1. Based on my search, I agree with you. It is both poorly documented and the Swift community is apparently not big enough, or doesn’t care enough (or not well-indexed enough with enough SEO) to provide yet an an easy to find answer for newer versions.

  3. With Apple’s Objective-C (in the Foundation framework I guess) you’d use -localizedCompare: with NSArray’s -sortedArrayUsingSelector: (or -sortUsingSelector: for in-place sort of mutable arrays).

    The comparator uses the user’s current locale. There’s a bunch of string comparison methods that allow you to use specific locales and other options.

    I believe there are corresponding APIs in Swift.

        1. Your code sample assumes that the String struct in Swift has a method called localizedCompare. It does not:

          https://developer.apple.com/documentation/swift/string

          However, I have finally figured out how to do it:

          import Foundation

          ["a","e","é","f"].sorted {($0 as NSString).localizedCompare(($1)) == .orderedAscending}

          The trick is to “cast” the standard Swift strings to NSString. The standard Swift strings do not support localizedCompare.

          To be clear, this means that localized comparison is not part of the Swift standard library, you need “Foundation” and you need to cast your strings over to Foundation’s NSString. Importantly, this means that if you search the standard library, you will not find out how to do localized comparison.

          Thus localized comparisons are clearly viewed as outside of the core functionality of the language.

          I think that’s wrong.

          1. You should not have to bridge String to NSString in order to use localized comparison. As a special rule, methods on NSString magically appear as available for String when you import Foundation. This is because Foundation is a core library and its features for strings are essential to the language.

  4. Go has x/text which may be included in the standard library in the future.
    That includes collation for many languages. I believe that the language.Und (default language) uses the default collator.

    package uca

    import (
    "testing"

    "golang.org/x/text/collate"
    "golang.org/x/text/language"
    )

    func TestUCASort(t *testing.T) {
    v := []string{"e", "a", "é", "f"}
    c := collate.New(language.Und)
    c.SortStrings(v) // [a e é f]
    t.Log(v)
    v = []string{"a", "b", "A", "B"}
    c.SortStrings(v) // [a A b B]
    t.Log(v)
    }

    Additionally, the python standard library uses locale for localization and string collation.

    import locale

    locale.setlocale(locale.LC_ALL, '')
    x=["e","a","é","f"]
    x.sort(key=locale.strxfrm)

    However, I tried to test out the sorting on fr_ca locale and got the incorrect answer, which I found out was due to incorrect locale settings on Max OS X/BSD. On my machine, fr_FR.UTF-8 collation is linked to la_LN.US-ASCII

    ls -l /usr/share/locale/fr_FR.UTF-8/
    lrwxr-xr-x 1 root wheel 28 Oct 31 10:37 LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
    lrwxr-xr-x 1 root wheel 17 Oct 31 10:37 LC_CTYPE -> ../UTF-8/LC_CTYPE
    drwxr-xr-x 3 root wheel 96 Oct 31 10:37 LC_MESSAGES
    lrwxr-xr-x 1 root wheel 30 Oct 31 10:37 LC_MONETARY -> ../fr_FR.ISO8859-1/LC_MONETARY
    lrwxr-xr-x 1 root wheel 29 Oct 31 10:37 LC_NUMERIC -> ../fr_FR.ISO8859-1/LC_NUMERIC
    -r--r--r-- 1 root wheel 364 Aug 17 15:58 LC_TIME

    So the issue with string sorting goes beyond programming languages to properly handling collation in the OS as well. Sorting strings is hard…

    1. Isn’t x/text already part of the standard library? At least, aren’t the x/ packages included with Go but simply “developed apart from Go core” and having some different compatibility requirements?

      Said another way, if I “install Go” does it come with x/text?

      (in case it’s not clear, I am not very familiar with Go)

        1. Well, I guess the best that can be said for Go then is that there is support for locale-aware collation “in the Go project” but not as part of the standard library. Maybe one day? At least they have the excuse of being a relatively young language (but I don’t really buy it because it also puts them in the “modern” category which means they should get this stuff right from day 1). Perhaps the excuse is that Go is more minimalist compared to other languages.

          There is also an argument to be made for how much belongs in the stdlib and how much is left to outside libraries – but I would personally think that basic locale support, including collation should be in the stdlib.

          1. There is also an argument to be made for how much belongs in the stdlib and how much is left to outside libraries – but I would personally think that basic locale support, including collation should be in the stdlib.

            Sorting, comparing and hashing strings in a natural language way definitively belongs to the standard library. I can excuse C, for obvious reasons… but other languages should support this very well.

            I could live with a language that exports this to a well made external library, but it should be clearly documented and easy to find.

            It is the kind of things that are almost impossible to build on your own but that everyone needs to get right from time to time. And when you do need to do it, you should not need to spend days on it.

            Note that the default sort does not even work for ascii-only English. And English does use accents!

            1. Note that the default sort does not even work for ascii-only English.
              And English does use accents!

              Well the crux of my original argument was that it does “work” – because using binary collation is totally reasonable for the “default sort” (i.e., also for default comparison and equality and hashing).

              So there are two separate points of contention here:

              1) Is the default comparison (equality, hashing, etc) behavior of programming languages reasonable? (eg that AaBb is sorted as ABab)

              2) When locale-aware sorting is wanted, is it available and documented?

              I’m willing to concede the answer certainly appears to be “no” for (2) at least for many mainstream languages, and I was wrong here since I had assumed based on my experience from other languages that everyone is getting this right today.

              I don’t agree with you on (1) at all though! You appear to be arguing that in the absence of any global agreement on how to collate, the languages should at least try to get some bits right for English and I guess French speakers and also perhaps indirectly for languages that share some of those conventions, and perhaps also for other languages whose character sets don’t overlap?

              I don’t agree: there is no good default order, so at this point you might as well just use whatever lexicographic sort order is fastest which is what most languages do. Maybe there is an argument for making it “obviously wrong” for human consumption, by doing some xor on the letters or something, but that seems a bit extreme!

              I find this reasoning confusing:

              Human beings understand that the characters e, é, E, É, è, ê, and so
              forth, should be considered as the same letter (e) with accents. There
              are exceptions to this rule, but the default which consists in sorting
              accentuated characters after the letter ‘z’ is just not reasonable.

              Actually human beings do not agree on that! You mention “there are exceptions” – but the exceptions are not some weird letters that everyone agrees should be sorted differently, but that everyone doesn’t even agree how the same letters should be sorted. Just take a look at this list.

              You mention accents, as an obvious case – but a quick scan of that list shows it may not be so obvious at all. There seem to be at least four different strategies that I can pick out: treating an character with a diacritic the same as the unadorned letter (except for tie breaking?), treating it as coming after the unadorned letter, putting it somewhere else in the alphabet entirely (usually after z for Latin-using scripts), or the “French way” which is apparently to treat accented characters the same as unadorned, but to tie break on the accented characters starting from the right end of the string working backwards (this is news to me and I’ve spent my fair share of time looking up words in a French dictionary).

              You do not want this mess in your default string ordering (where it can also infect equality, hashing, etc)!

              You also do not want to be the one choosing between German or French ways of collating, or Russia or Ukrainian or any other list of hard choices for your “best effort global collation to rule them all”. You do not want to be the one who got it working “alright” for French but left out every Asian script, etc.

              Finally, and something I forgot before, proper collating breaks all sorts of relationships you’d expect of a “normal” lexicographic sort. For example, if for two strings a, b you have a < b (and a not a prefix of b), you would expect it to be true if you append some additional (perhaps identical) characters to a and b since after all that’s how lexicographic compare works, but proper collation doesn’t follow this rule and many similar expected relationships (at least it is transitive though, when properly implemented, or all hell would really break loose).

  5. I disagree.

    While it’s true that “default” sorts don’t make sense in a lot of ways, there is no single sort that makes sense for everyone, because different languages have different ways of sorting.

    So while you say:

    You might prefer A to come before a, or vice versa, but no human being would ever sort the letters as A,B,a,b or a,b,A,B.

    This claim may not be true, although a few minutes of Googling dind’t turn up a counter example, although most other obvious orderings are broken by this language. Even if it is true for A and B, it is at definitely not true if you replace A and B by some other letters that you think sort in an obvious way, because someone of another culture has the opposite opinion.

    So for things displayed to people, you need a locale-aware collation. This isn’t a single collation algorithm, but a family, one per locale and is a well-studied problem solved among other ways by the “Unicode Collation Algorithm”.

    So then the question is, should the default sort in a language try to be locale-aware, implicitly pulling in a locale from “somewhere” and using that? I’d strongly argue “no”. For one thing, there is a huge advantage to having library functions that behave identically on every platform and system, and pulling the widely-varying locale implicity into core comparison functions breaks this.

    Second, it’s slow – and the slowness may vary by locale.

    Third, a probably overwhelming amount of the time the purpose of a sort is simply to put some strings in an ordered collection or something like that: not for end-user display. So you would be optimizing for an uncommon case.

    Fourth, if you make the sort locale-aware, do you also make equality location aware? In general, you’d like a <= b && b <= a to imply a == b which means your equality should be locale aware too, but that raises all the same issues as above for equality, and in that case they are even more in favor of “binary compare” I think.

    Fifth, even when locale-awareness is the goal, just pulling whatever locale the process is running under is often the wrong choice. If it’s a web server, you don’t care about the locale of the running machine, but of the remote user, and so on. So locale-awareness isn’t something that can be swept under the covers anyways in many cases. It’s something you have to deal with explicitly, so more or less equally hard whether sort implicitly takes locale into consideration anyways.

    So in my opinion, the ideal scenario is that the language provides “binary” equality and comparison operators by default, and robust locale-sensitive methods as well with configurable locale. That’s exactly what most languages do.

    1. Perhaps I should amend:

      That’s exactly what most languages do.

      to

      That’s exactly what some languages do, and the rest should step up their game.

    2. “Third, a probably overwhelming amount of the time the purpose of a sort is simply to put some strings in an ordered collection or something like that: not for end-user display. So you would be optimizing for an uncommon case.”

      I agree with most of your points, but this one is tricky.
      You have two pools of clients. One is the guys writing linkers and parsers and suchlike, with the set of concerns you’re listing (“I don’t care about the precise details, but I want fast and stable sorts”); the other is developers writing user-facing code who just want the right thing to happen.

      Probably the fault is language minimalism that tries to force two very different functions into the same keyword/library function.
      Perhaps languages should just bite the bullet and accept the necessity for both
      ByteSort(string) and TextSort(string, locale[=by default from OS])
      ?

      1. Note that I probably should have said “comparisons” not “sorts” – since the string comparison behavior is at issue here and sorts are just one (obvious) way the comparison order is surfaced.

        I considered user-facing code when I wrote that. Look at any application that is user facing and see how many string comparisons are in there, and then how many of those end up displayed to the user as part of a sort.

        There are many where nearly all the comparisons are internal, especially if the languages uses ordering as a core component of its containers, as say C++ does (where the default containers like std::map and friends use ordering as their key relation).

        I’ll grant you that there might be some applications, especially small ones, where comparisons are primarily for sorted display to the user, but I think most will be “internal” uses.

        There is even a stronger case for equality – I think most would agree that the overwhelming use for equality for user-facing programs are internal mechanics that won’t be displayed to the user, or even if it affects user visible stuff, “binary equality” (an exact code-point by code-point comparison) works fine. So if you agree that binary equality is desirable, but you want locale-sensitive comparison, you are left with the awkward choice of having comparison operators which are inconsistent with equals. Yuck… stay away from that idea!

        1. Aren’t the sorts of examples you’re giving a sort of “premature optimization”? You’re asserting that user facing apps “need” to utilize high-performance maps and hashes and equates and suchlike, and I’m not sure that’s at all true.

          Sure, I can believe that there are some backends like Twitter and Google that need to do whatever they can to run strings fast in their data centers. But they’re also capable of using code correctly. For the default programmer, and the default app, there seems no reason whatsoever to optimize for speed over “correctness”, however defined.

          Maybe, alternatively, the flaw is in ever allowing a sort() routine that doesn’t FORCE the use of a locale (and thus thinking about the issue)?
          So you can have
          sort(strings, kDefaultLocale[=what OS provides))
          sort(strings, kBytesLocal)
          sort(strings, explicit locale)
          but just sort(strings) does not exist as an option?

          1. No, I don’t think it’s premature to have reasonably fast collections by default in a language – it’s what most people want.

            However, fast or slow collections is a red-herring. It’s actually about fast or slow default string comparisons. I think people want fast string operations in general, yes. The concept of “premature optimization” doesn’t apply in the same way to library or base language features as it does for a finished application. In a finished application you probably understand the input domain and user-facing use cases so you can tell which operations need to be sped up to improve user experience, and the rest can be left alone.

            For a standard library or built-in language feature, you generally have no idea, or, more precisely – you have to support a huge spectrum of uses from people who don’t care about performance to cases where the cost of the operation will dominate runtime. So the tradeoffs are a bit different: you should make it as fast as possible within the constraints of a reasonable API, and if you have to make your API worse to make it faster, you might consider it depending on the goals of your language, your typical user, availability of alternate APIs etc.

            All that said, this isn’t mostly about performance anyways, but largely about correctness and adhering to the principle of least surprise. I would expect all the basic types to compare identically everywhere. Certainly 0 < 1 everywhere, etc. If strings silently start pulling the locale for comparisons, you are often making an invalid assumption (that whatever default locale is being used is at all appropriate, even if the user wanted locale-based collation in the first place), and you just unleashed a bunch of bugs in many locales that aren’t “en_us”. I would never want my default string comparisons to change behavior based on what user the process was running under, what locale was set by some code I wasn’t aware of, what environment variables where set, etc.

            So comparing by unicode code-point value seems entirely an entirely reasonable default to me, and this is exactly what almost every modern and many ancient languages do.

            Now there is a time and place for locale-aware display at the UI layer, and languages should better support this use case – but I definitely disagree you want to involve that with the comparator for the core string type.

  6. I’m surprised how little this and other globalization questions come up. I find it fascinating. If I didn’t have experience in the industry, I would assume it isn’t mentioned in blog posts for the same reason unexpected integer division results aren’t mentioned: you get bit once and learn your lesson.

    Instead, I’m sure that while everybody knows that the C locale (sorting ASCII-betically, as they used to say in Perl) isn’t good enough, very few people know what is. And, if the ICU library’s API is anything to go by, the people who do know how to solve the problem apparently have no taste in API design.

    1. I have some experience with this, and in my experience it definitely “comes up”, but apparently not in a way that results in languages having first-class support for it.

      In any enterprise product localization is a big deal and everyone mostly seems to use third party libraries (read this as “ICU”) to support locale aware everything.

  7. Mathematica (as usual) seems to have put some thought into this:
    Defaults work the way you want:
    Sort[{“e”, “a”, “é”, “f”}]
    {“a”, “e”, “é”, “f”}

    Sort[{“a”, “b”, “A”, “B”}]
    {“a”, “A”, “b”, “B”}

    But there has also been thought (and there exist rules) for ordering of “semantically significant” format/font variants, so

    Sort[{“e”, “a”, “é”, “f”, “[ExponentialE]”}]
    {“a”, “e”, “é”, “f”, “[ExponentialE]”}

    That latter element (“[ExponentialE]”) would display in Mathematica as a typographic variant of e, but it’s the “e” of mathematics, and is sorted as a different class that comes after “text”.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax