Where are all the search trees?

After arrays and linked lists, one of the first data structures computer-science students learn is the search tree. It usually starts with the binary search tree, and then students move on to B-trees for greater scalability.

Search trees are a common mechanism used to implement key-value maps (like a dictionary). Almost all databases have some form of B-tree underneath. In C++, up until recently, default map objects were search trees. In Java, you have the TreeMap class.

In contrast to the search tree, we have the hash map or hash table. Hash maps have faster single look-ups, but because the keys are not ordered physically, traversing the keys in sorted order can be much slower. And it might require fully sorting the keys as part of the iteration process, if you want to go through the keys in order.

In any case, if technical interviews and computer-science classes make a big deal of search trees, you’d think they were ubiquitous. And yet, they are not. Hash maps are what is ubiquitous.

  • JavaScript, maybe the most widely used language in the world, does not have search trees part of the language. The language provides an Object type that can be used as a key-value store, but the keys are not sorted in natural order. Because it is somewhat bug prone to rely on the Object type to provide a map functionality, the language recently acquired a Map type, but it is again a wrapper around what must be a hash map. Maps are “sorted” in insertion order, probably through a linked list so that, at least, the key order is not random.
  • Python, another popular language, is like JavaScript. It provides an all-purpose dict type that is effectively a map, but if you were to store the keys ‘a’, ‘b’, ‘c’, it might give them back to you as ‘a’, ‘c’, ‘b’. (Try {'a': 0, 'b': 0, 'c': 0} for example.) That is, a dict is a hash map. Python has an OrderedDict class, but it merely remembers the order in which the keys were added (like JavaScript’s Map). So there is no search tree to be found!
  • The Go language (golang) provides a map class, but we can be sure there is no underlying search tree, since Go randomizes the key order by design! (To prevent users from relying on key order.)

What does it mean? It means that millions of programmers program all day long without ever using a search tree, except maybe when they access their database.

Though key-value stores are essential to programmers, the functions offered by a search tree are much less important to them. This suggests that programmers access their data mostly in random order, or in order of insertion, not in natural order.

Further reading: Scott Meyers has an article showing that hash maps essentially outperform all other look-ups except for tiny ones where a sequential search is best.

Update: It is commonly stated that a hash map uses more memory. That might not be generally true however. In a test of hash maps against tree maps in Java, I found both to have comparable memory usage.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

21 thoughts on “Where are all the search trees?”

  1. It’s a little surprising to look at textbook algorithms and data structures and see which ones have retained popularity in the real world. A number of sorting algorithms seem also to have fallen by the wayside, to be mentioned in programming interviews and then never again.

    With hashes I’m not sure if it’s a big deal or just a lot of usage of the small case, say an associative array with 5-10 members.

  2. I think it is obvious why search trees not seen often. For a search tree you need a total ordering on all keys which implies either that all keys are of the same type or you extend your ordering to an ordering of types which is only a half-order in most (all?) type systems. So for a mapping between arbitrary keys to values there is no obvious way to implement that in search tree.

    Whereas a hasmap “only” needs a hashing function (that’s unique and uniformly distributed and so on). So besides having that function, no restrictions are placed on the set of all keys.

    In a search tree, the set of all keys need to have a total ordering.

    1. That’s a good point but how often do you have maps where keys have varied types? In JavaScript, all keys to an Object must be a string, and there is a natural (lexicographical) ordering for strings. JavaScript’s Set type supports varied types however. In Go, you have to specify the type, and the type has to implement comparable but may not be ordered. Python natively supports hash maps with varied types.

      1. Not only varied types but compound types also. If you have struct/tuple of an int and a string as keys you have to implement your own compare function that compares first by the int and then by the string (or the other way around) were both ways to sort might not be “natural”. You still need total ordering.

        Using hash maps, just use the hash functions of your elements, bitshift and xor them. If there is no natural way to sort your items then there is no need to traverse them in some order and then you don’t have to write some sorting functions.

        In databases where b-trees are used as indexes, even for compound keys you need to define an order and most other Indexable database types are naturally orderable.

        Of course many things can go wrong in writing a hash function but given that most environments (thinking of java and python here) give good hash functions for simple types and tools to combine them, this might be preferred to every key needing to implement the Comparable Interface.

        1. Yes. I like your answer very much. Clearly the answer has to do with the relevance of having an order to your keys.

          But consider this. If you want to build a search tree in JavaScript, Go or Python, you are basically on your own. There is no easy solution.

          Yet, there are certainly lots of cases where there is a natural ordering. Strings and integers are ordered. Their order is probably relevant in some way to the application.

          Still, there is apparently very little need to go through these keys in sorted order.

  3. Of course in C++, you can use the boost property_tree, for which each node provides an ordered list of its subnodes.

  4. A couple thoughts:

    As already mentioned by another commenter, trees are central to persistent data structures. While persistent data structures are not ubiquitous in mainstream software, my sense is that they are becoming more popular because they help somewhat with making it easier to write parallel and multitasking code.

    And relatively recently (in data structure terms; about 15 years again) it was shown that hash-based structures can be made tree-shaped: hash array mapped tries. For any data structure nerds out there who don’t know it already, learning about the HAMT is a treat.

    In my experience it’s reasonably common to build search trees layered on top of some other data structure, rather than as a core container library. I agree with the sentiment that for sets and maps, hash-based is often more convenient than comparison-based. However, I don’t think search trees as a concept are going the way of the dodo any time soon.

  5. On modern CPUs then navigating a larger tree means navigating many more potentially uncached lines of memory. This can make larger tree structures orders of magnitude slower than hash maps.

    Example, I recently replaced a tree structure holding all IPv4 ranges with a hash map approach. The tree usually needed 32 branches traversed to get an answer. However, the hash map only touched 3 cache lines and is 20 times faster. However, the reason to switch is because the hash map is 30 times smaller.

    1. 1. How could a hash map be 30 times smaller? Surely, there is more to your story…

      2. A red-black tree can scale poorly, because of the log n data accesses, but a B-tree ought to fix that…

      1. There is more to the story. The first 24 bits of each range are used as an array index or perfect hash. The last 8 bits of each range are stored in a sequential blob which has to be brute force searched. However, an interesting side effect of uncached line fetch speeds is that it’s often faster to brute force search (no memory expensive meta data like pointers needed!) within cached lines than to fetch a single uncached line. This brute force approach is unintuitive for most developers. You have to think about the CPU as having a split personality; operations on cached memory are like walking across the street to the store, while operations on uncached memory are like driving to the store and having to deal with parking… You can do a whole lot of walking if you don’t have to drive 🙂

        This leads us to the memory prefetch instruction which most data structures ignore. But that’s another story…

    2. Cache complexity has indeed surpassed instruction count for in-memory sorting and searching performance. Essentially we’ve moved to an external memory model for anything larger than a few megabytes.

      Also, for long keys, such as urls or file paths, with long distinguishing prefixes, structures such as tries do poorly. I suspect there’s been an inflation in key length as memory has become cheaper.

  6. I think this is a merely a symptom of a larger issue:
    In academia, algorithms are evaluated using computer models from the 80s (or earlier). However, actual computers work like @Simon said, which is fundamentally different from the old models.

    Developers care about actual performance. Academia cares about theoretical performance and rarely compares algorithms using more recent models or actual software. If they do, it is often unoptimized (and write-only) and therefore do not suffice to compare properly.

    There are some exceptions (I think you, Daniel, are a particularly good one) but they are quite rare.

    1. I agree with you, Reinhard. However, I think that most developers are not aware of the cost of caching lines of memory. Which might be okay if their Java or C++ built-in library functions were aware. But sadly most of them are not.

      As a high profile developer blog, I would love to see @Daniel blogging about real world experiments illustrating how slow fetching memory is, and how the memory prefetch instruction might be used to make some algorithms ~ 10 times faster.

      1. I would love to see @Daniel blogging about real world experiments illustrating how slow fetching memory is, and how the memory prefetch instruction might be used to make some algorithms ~ 10 times faster.

        This is not a bad idea.

          1. Thanks for posting the link.

            Page 16 talks about explicitly prefetching using the prefetch instruction. I have found this instruction very difficult to use in a regular algorithm and see any big benefit. Why? It’s very difficult to predict how far ahead of time you need to execute it.

            However, I have had great results using the instruction in what I’ll call “batch mode”. Let’s say you have a hash table and have a batch of 20 keys to look up. You hash each key and then in a loop prefetch the random location in memory that you want to access. By the time the last prefetch is done then the code can continue to actually access the memory for the 1st key without any delay because it just got cached. This gives a huge performance boost to the hash table because so much stuff is happening in parallel and not sequentially. So ~ 10 x performance but only when doing batched operations.

          2. It is a very interesting paper. They do seem to make a good use of prefetch in this instance. Note however that regarding binary search, they could have gotten pretty much the same performance characteristics by simply switching from the branchless to the branchy algorithm, depending on data size… and that would have the benefit of using only straight C code.

            So we are not talking about making algorithms 10 times faster through prefetching…

  7. >>> So we are not talking about making algorithms 10 times faster through prefetching… <<<

    Not in that particular paper. But had they implemented a batch lookup version which does e.g. 20 lookups concurrently then I believe they could have achieved such a significant speed-up. Unfortunately, they didn't make optimal use of the prefetch instruction IMO.

  8. Map is a type that specifically represents key-value pairs. The stored values are not inherited from the prototype. The keys can be of any type, not necessarily strings.

    The documentation on MDN ( Map ) lists the differences as follows:
    An Object has a prototype, so there are default keys in the map. However, this can be bypassed using map = Object.create(null).
    The keys of an Object are Strings, where they can be any value for a Map.
    You can get the size of a Map easily while you have to manually keep track of size for an Object.
    Check this out: https://hackr.io/blog/javascript-map

Leave a Reply to Daniel Lemire Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.