Maps and sets can have quadratic-time performance

Swift is a new programming language launched by Apple slightly over two years ago. Like C and C++, it offers ahead-of-time compilation to native code but with many new modern features. It is available on Linux and macOS.

Like C++, Swift comes complete with its own data structures like dictionaries (key-value or associative maps) and sets. It seems that Apple is simply reusing its Objective-C data structures.

How do you implement dictionaries and sets? We typically use sets. The general idea is simple:

  • We map values to “random” integers, a process called hashing. We call the random integer a “hash value”. The key fact is that whenever we have identical values, they should generate the same hash value. This is usually achieved using a funny-looking function that is carefully crafted to generate numbers that look random.
  • We store the values at the location indicated by the hash value.

This latter step is critical. At high level you can just build a large array and store the values with hash value x at index x. This does not quite work because different values might share the same hash value. You get “collisions”.

How do we resolve this problem? There are two popular techniques. One of them is called chaining. In effect, at each location in the big array, we have a pointer to a container where the values are actually stored. This is called chaining. Many popular languages use chaining (Java, Go…). The theory of chaining is simple and the implementation is not difficult.

In Swift, they use linear probing. Linear probing is also simple to implement: if you try to store a value at an occupied slot, you just move to the next available slot. When you seek a value, you start from the index indicated by the hash value, and move forward until you either find your value or an empty slot.

Suppose that you want to insert a new element, and you get a collision. With chaining, this means that you need to append a value to a container (typically an array). Before you do so, you need to check whether your value is already present in the container… This means that if you repeatedly insert to the same container, you will have quadratic-time complexity. That is, it gets very slow very fast. You have to be quite unlucky for this to happen.

Things get somewhat worse with linear probing. In linear probing, not only can you collide with values that have the same hash value, but you also tend to collide with other values to have nearby hash values! This means that the performance of linear probing can be quite a bit worse than the performance of chaining, in the worst case.

In several languages, such as PHP and JavaScript, maps preserve the “insertion order” which means that when iterating through the values stored, you do so in the order in which they were inserted. So if you first inserted a key having to do with “John” and then one having to do with “Mary”, the map will remember and default in this order. However, Swift does not appear to keep track of the insertion order by default. This creates a performance trap.

Let me illustrate it with code. Suppose that I have a set that stores all values between 1 and some large number. My point does not rely on the data taking this exact form, but it is the simplest example I could think of… (It does not matter that I am using integers!)

var d = Set<Int>()
for i in 1...size {
  d.insert(i)
}

Next, maybe I have another set with a few values in it…

var dcopy = Set<Int>()
dcopy.insert(1000)

… and then I want to dump all of the big set into the small set like so…

for i in d {
    dcopy.insert(i)
}

Ah! You have fallen into a performance trap!

You think: “Surely not! I was taught that insertion in a hash table takes constant time…” To this, I object that you were taught no such thing. Or, rather, that you failed to read the small prints!

Basically, in both instances, I am building a hash table starting from nothing. The only difference is the order in which I build the hash table. It turns out that the order in which you insert elements matters.

Let us look at the number. Let us start with a small set made of a million elements… I time the results on my laptop…

  • Build time: 130 ns/element
  • Reinsertion: 670 ns/element

Ok. So reinserting the data from the big set to the small set takes 5 times longer. “No big deal”, you think, “I have CPU cycles to spare.”

Let us grow the set a bit more, to 16 million elements:

  • Build time: 180 ns/element
  • Reinsertion: 3400 ns/element

You see where I am going with this? Roughly speaking, the reinsertion time has quadratic complexity where the build time has linear complexity.

We can run a simulation and count the number of collisions generated on average using either a random insertion, or an adversarial insertion as illustrated by my code example. When we do so, we observe that the number of collisions grows very fast in the adversarial setting.

So the performance issue can be explained entirely due to the fast rising number of collisions with adversarial insertions on larger and larger sets.

Where does the problem come from? When we resize the hash table, we have to reinsert the values into a larger array. Normally, the values would spread out nicely, but our adversarial insertion order is not at all random, and it facilitates expensive collisions.

(As Tim points out in the comments, you can get around this problem by creating Sets having a large capacity. Or you can use a different data structure.)

My source code is available.

Further reading: Rust hash iteration+reinsertion (same problem as I describe, but in Rust)

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

9 thoughts on “Maps and sets can have quadratic-time performance”

  1. In CoreFoundation, there were a tiny number of all-singing-all-dancing collections which tried to intuit what you were using them for, and pick the correct data structure internally: http://ridiculousfish.com/blog/posts/array.html

    Swift isn’t like this. Swift has a much larger number of collections, and it’s up to the programmer to pick the right one for the job.

    In Swift, if you want a set of contiguous ints, the right data structure is IndexSet. It’s super-fast at this test: way less than 1ns/el for all operations, and a build/copy ratio of pretty close to 1 (between 0.4 and 1.5, on my workstation).

    The next question is: why is Set especially slow for this case? Why does it run into linear probing issues at all? Ints in Swift return different hashValues so it’s not obvious why they would collide at all.

    A Set created with no initial value or minimumCapacity will be as small as possible. (The source isn’t available, but one place I found says 1 bucket, and another says 0.) It’s hitting collisions all the time because it’s creating a tiny hash table, then as you add more elements it needs to copy that to a bigger hash table, and so on, ad infinitum. When you add a million items, one at a time, it doesn’t know until the last one that you intended to add a million items all along.

    Sure enough, if I change the Set() inits to Set(minimumCapacity: 2*size), the code runs drastically faster, and the copy/build ratio is under 1.0 in all cases.

    No hash table implementation can grow indefinitely with no performance hit. Either it needs to start out big (and waste memory, if you only needed a few elements), or it needs to copy itself as it grows (and waste time, if you know you need a lot). You can solve this by telling it to start out big.

    You’re seeing quadratic time not because of linear probing, but because of bucket copying.

    1. I used an Int data type just to keep things simple, but the problem is general.

      Ints in Swift return different hashValues so it’s not obvious why they would collide at all.

      It does not matter that they are integers except for the fact that it is a lot easier to benchmark (since at the core, it is much faster).

      Modify my code, it will take you seconds… just cast the Ints to Strings. The ratio blows up just the same.

      A Set created with no initial value or minimumCapacity will be as small as possible. (…) You’re seeing quadratic time not because of linear probing, but because of bucket copying.

      I agree that if you set the capacity high enough, the problem will go away. It is one of several ways to you can solve this problem.

      But in my code, I build two sets starting from an empty set. It is the same problem. The only difference in the second case is the order in which I build it.

      1. “But in my code, I build two sets starting from an empty set. It is the same problem. The only difference in the second case is the order in which I build it.”

        Then why, according to your theory, would linear probing affect one, but not the other? Shouldn’t inserting values in either order be just as susceptible to the linear probing penalty? What’s special about the second ordering that makes it slow? Inserting them backwards is still fast, and so is inserting odds-then-evens.

        The “1000” is a red herring. It doesn’t affect performance at all. Even the cost of iterating over the first Set is negligible. As you say, this is purely about the order in which items are inserted into a Set.

        “just cast the Ints to Strings. The ratio blows up just the same”

        I assume you mean “init”, not “cast” (which would throw an exception and abort the process), and no, the ratios don’t blow up in the same way for me. The first few are 5.17, 7.09, 7.07, 24.59, 18.41. It’s generally going up, but it’s not quadratic, or even monotonic. Sometimes when the data size doubles, it goes the same speed.

        For that matter, the Int version doesn’t show quadratic behavior, either. Why is the n=4M case consistently so much faster than the n=2M case? Why is the n=32M case almost exactly the same speed as the n=16M case? This doesn’t look like any quadratic algorithm I’ve ever seen. I graphed the powers-of-two from n=1M to n=64M and they appear to level off.

        I’m seeing a lot of interesting hypotheses, but no smoking gun yet. You extrapolated from 2 data points to “quadratic”, and “linear probing is the culprit”. But when I ran the same test, incrementing |size| by 10% each time, rather than 100%, I got this:

        https://imgur.com/a/AmA3r

        There’s clearly a lot more going on than just “quadratic complexity”, or that you might guess from looking at 2 points. Half the time, the “rebuild” is slower, but half the time it’s the same speed (or slightly faster). When n=14.4M, for example, the rebuild rate is over 1000 times *faster* than when n=8.95M, despite having over 60% more data.

        It’s true that order of insertion matters, but I don’t think there’s enough information yet to pinpoint the cause of this. This is a fascinating puzzle you’ve stumbled on!

        1. Then why, according to your theory, would linear probing affect one, but not the other?

          Because in the second case, we get an adversarial insertion order.

          Shouldn’t inserting values in either order be just as susceptible to the linear probing penalty?

          I did not refer to a linear probing penalty. Linear probing is fast and efficient, except when it fails.

          What’s special about the second ordering that makes it slow?

          It is special because it is aware of the hash values.

          The “1000” is a red herring.

          It is not necessary to get this effect, you are right, but without this step, I was sure to get a comment to the effect that it is trivial to copy a set quickly without a for-loop.

          I assume you mean “init”, not “cast” (which would throw an exception and abort the process)

          I used the term “cast” as short for “type casting”:

          In computer science, type conversion, type casting, and type coercion are different ways of changing an entity of one data type into another. An example would be the conversion of an integer value into a floating point value or its textual representation as a string, and vice versa. https://en.wikipedia.org/wiki/Type_conversion

          You must be thinking of something else.

          The first few are 5.17, 7.09, 7.07, 24.59, 18.41. It’s generally going up, but it’s not quadratic, or even monotonic.

          I did not get to the notion of quadratic complexity by looking at the numbers. It is very hard to establish the computational complexity of a problem by looking at the actual speed.

  2. This has to do with cache locality. Normally Build and Reinsertion should _not_ have different performance characteristics. In both cases, you are inserting elements. In Build you start with the empty set. In Reinsertion you start with a set with a few elements. Theoretically, if you assume that iterating/reading a set is fast, there would be no difference, since they are doing the same thing (“insert”).

    The performance difference arises because reading from another set before each insertion, results in more cache misses. With such large data sets, the cache is being completed utilized, and reading from another region in memory before each write means more stuff be cached. Thus, more cache misses. That’s what going on here.

    1. In Build you start with the empty set. In Reinsertion you start with a set with a few elements.

      As pointed out in other comments, this is not actually relevant. I started the second phase with an element to avoid comments to the effect that I am copying the set in a stupid manner.

      You can start with empty sets in both instances, it won’t change the result.

      Theoretically, if you assume that iterating/reading a set is fast, there would be no difference, since they are doing the same thing (“insert”).

      There is no theoretical result to says that insertion order is irrelevant.

      The performance difference arises because reading from another set before each insertion, results in more cache misses. With such large data sets, the cache is being completed utilized, and reading from another region in memory before each write means more stuff be cached. Thus, more cache misses. That’s what going on here.

      If you look at the actual benchmarking code, on GitHub, you will see that both read their data from arrays. The only difference is the insertion order.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.