How fast does interpolation search converge?

When searching in a sorted array, the standard approach is to rely on a binary search. If the input array contains N elements, after log(N) + 1 random queries in the sorted array, you will find the value you are looking for. The algorithm is well known, even by kids. You first guess that the value is in the middle, you check the value in the middle, you compare it against your target and go either to the upper half of lower half of the array based on the result of the comparison.

Binary search only requires that the values be sorted. What if the values are not only sorted, but they also follow a regular distribution. Maybe you are generating random values, uniformly distributed. Maybe you are using hash values.

In a classical paper, Perl et al. described a potentially more effective approach called interpolation search. It is applicable when you know the distribution of your data. The intuition is simple: instead of guessing that the target value is in the middle of your range, you adjust your guess based on the value. If the value is smaller than average, you aim near the beginning of the array. If the value much larger than average, you guess that the index should be near the end.

The expected search time is then much better: log(log(N)). To gain some intuition, I quickly implemented interpolation search in C++ and ran a little experiment, generating large arrays and search in them using interpolation search. As you can see,  as you multiply the size of the array by 10, the number of hits or comparisons remains nearly constant. Furthermore, interpolation search is likely to quickly get very close to the target. Thus the results are better than they look if memory locality is a factor.

N hits
100 2.9
1000 3.5
10000 3.8
100000 4.0
100000 4.5
1000000 4.6
10000000 4.9

You might object that such a result is inferior to a hash table, and I do expect well implemented hash tables to perform better, but you should be mindful that many hash table implementations gain performance at the expense of higher memory usage, and that they often lose the ability to visit the values in sorted order at high speed. It is also easier to merge two sorted arrays than to merge two hash tables.

This being said, I am not aware of interpolation search being actually used productively in software today. If you have a reference to such an artefact, please share!

 

Update: Some readers suggest that Big table relies on a form of interpolation search.

Update: It appears that interpolation search was tested out in git (1, 2). Credit: Jeff King.

Further reading: Interpolation search revisited by Muła

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

17 thoughts on “How fast does interpolation search converge?”

  1. Thanks for the article – indeed, interpolation search is not something I usually think of when searching in sorted lists. I should probably change that.

    However, there is one major downside that you did not mention: If the data that you search in is in some way generated by a second party (say, a user), an adversary can easily construct worst-case input data, and I think you can force the algorithm to visit every single element in the array. Of course, in this case your initial assumption that you know the distribution of the values is not true anymore.

  2. Howdy, I came up with a variant of interpolation search for when the data set is immutable—use the expected slot as the midpoint to binary search but then choose the low and high guards based off the precomputed maximum error of any item in the set (i.e., distance from its expected slot). In digital forensics/information security, we often have large sets of cryptographic hash values (MD5/SHA-1/SHA2-256) from previously examined files and simply want to know whether a provided hash value is in the set. Since these sets are largely immutable, we just write the hash values as sorted binary, with a simple file header that stores the determined error. The first hash value then begins at offset 4096 and the file can be memory mapped for binary search.

    At least twice as many lookups per second can be done this way than with a naive binary search. With a real world hash set that’s about a GB in size, the maximum error only has an ~8KB radius, so we only need to hit a couple of pages right next to each other. And… the implementation is pleasingly simple.

    1. use the expected slot as the midpoint to binary search but then choose the low and high guards based off the precomputed maximum error of any item in the set (i.e., distance from its expected slot).

      That’s an interesting proposal.

  3. I’d also recommend reading “Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search”

    http://pages.cs.wisc.edu/~chronis/files/efficiently_searching_sorted_arrays.pdf

    It summarizes a few practical tricks of running the search quickly including a 3-point interpolation technique which fits non-linear distributions better.

    3-point interpolation can be useful as a search acceleration method when you’re trying to approximate a complex function; my library, meshoptimizer, uses this in one of the algorithms instead of a binary search to find the optimal parameter in the optimization algorithm to get faster convergence.

    1. Thanks John.

      I used to teach numerical analysis so I am familiar with such techniques, at least at a high level, but I was focusing on the search problem within arrays, as a data structure.

  4. Adrian Colyer recently featured a learned sorting algorithm that is fancy interpolation. It doesn’t actually learn the algorithm, just the distribution of the data.

    The authors don’t seem to understand strings though — they use some kind of vector representation instead of the natural embedding of binary strings as numbers in [0,1).

    1. On second reading, they do encode strings as numbers, but they also appear to do a sample sort to smooth out the distribution, so it’s not clear how important the numeric model is.

      FWIW many data warehouse products use some form of block range index that amounts to a piecewise specification of the distribution. It seems more robust against clumpy distributions.

  5. The comparison you make between a hash table and a sorted array seems flawed to me. I agree that the array is an easier way to traverse the data in order, and I would need to think about the merging distinction more before drawing a conclusion.

    Regarding hash table implementations using more memory to gain performance, what is this in regard to? I can think of a few meanings:

    If you are referring to storing the hash to quicken the comparison, that can be eliminated.
    If you are referring to leaving slots open for new items, that implies that the dataset is mutable. I assume that mutable sorted array implementations have much worse write performance if they don’t leave slots open. Copying the entire array is linear to add an item. Linear-complexity writes, although most-likely much slower, could be achieved in an exact-size array hash table. Given that the extra space is typically included to improve write performance (and to slightly improve scan performance), I suppose that an implementation could eliminate the empty slots if memory usage is more concerning.
    If you are referring to an immutable “perfect” hash table, given enough time, is it not possible to find a hash algorithm that uniquely associates each item with a slot in an exact-size array? The only memory overhead would be constant-sized inputs into the hash algorithm.

    I’m interested in understanding which distinction you mean and, if it’s as substantial as you originally meant, how so.

    Thanks!

    1. The comparison you make between a hash table and a sorted array seems flawed to me. I agree that the array is an easier way to traverse the data in order, and I would need to think about the merging distinction more before drawing a conclusion.

      Have you tried implementing code that merges two sorted arrays, and then benchmark against code that merges two distinct hash tables?

      Regarding hash table implementations using more memory to gain performance, what is this in regard to?

      My statement is ‘you should be mindful that many hash table implementations gain performance at the expense of higher memory usage’. If you disagree then you have to demonstrate that not many hash table implementations use a lot of memory. You can use, say, an unordered_set in C++ or a HashSet in Java. To store integers, you will need tens of bytes per entry.

      1. You can use, say, an unordered_set in C++ or a HashSet in Java. To store integers, you will need tens of bytes per entry.

        It is true that specifically unordered_set and HashSet use a lot of memory per entry and their practical performance often suffers because of it, but I would argue that it is not “fundamental” to hash tables and it is not really related to the memory/performance tradeoff that hash tables make.

        Rather, in the case of Java it is related to a type-erased generics implementation which makes all collection types inefficient for primitives. Only primitive arrays like int[] escape this penalty. If you care, you can use per-primitive type specialized hash sets which use the primitive underlying array type. I would consider this a Java quirk (it also applies in some other language) and nothing to do with hash sets, per se.

        Similarly, C++ made the unfortunate decision to make their hash sets effectively require a node based design: needing a dynamically allocated node per element. Very unfortunate, but not an intrinsic property of hashes. Plenty of good C++ hash sets/maps out there without this problem.

        Now, hash sets/maps do make a space/speed tradeoff: having a load factor < 1, which wastes some space. However, this penalty is usually less than 2x, i.e., it should take less than 8 bytes to store a 4-byte integer in most implementations. I.e., it is much lower than the unfortunate implementations above would suggest.

  6. I think the typical stumbling point for interpolation search approach is that the stored distribution isn’t known or easy to approximate (with sufficient precision). Hash-like keys might make sense in some cases I guess. On theory one can do something like explained at end of section 3.4 in http://v.cunat.cz/theses/master.pdf but perhaps getting the (space) constants low could be difficult, though full dynamic-ity might be worth that cost.

    Maybe if the rate of changes were low, some dynamization of the static array would be worthwile. Perhaps just keeping spaces (with copies of some of the two adjacent keys), as there’s a simple way [1] of keeping even worst-case amortized moves to log-squared factor – which might be OK-ish in practice. Moreover if the distribution is not changing (much) over time, note that all insertions are expected to be uniformly distributed within the array (say… the array index is roughly uniformly distributed, at least if I disregard any spaces or if those are uniform).

    [1] Bender &al.: Two Simplified Algorithms for Maintaining
    Order in a List

    My text in the previous link also contains lots of other theoretical stuff related to interpolation search (for anyone interested). My conclusion overall was that these techniques don’t pay off in practice. Perhaps some variation on that special case above, but otherwise… I expect it would have to be some veeery special use case.

Leave a Reply to Daniel Lemire Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.