Fast sets of integers

Maintaining a set of integers is a common problem in programming. It can also be implemented in many different ways.

Maybe the most common implementation uses a hashing (henceforth hashset): it provides optimal expected-time complexity. That is, we expect that it takes a constant time to add or remove an integer (O(1)), and it takes a time proportional to the cardinality of the set to iterate through the elements. For this purpose, Java provides the HashSet class.

An alternative implementation is the bitset: essentially, it is an array of boolean values. It ensures that adding and removing integers takes a constant time. However, iterating through the elements could be less than optimal: if your universe contains N integers, it may take a time proportional to N to enumerate the integers irrespective of the number of integers in the set.

So, suppose you expect to have 1000 integers in the range from 0 to N. Which data structure is best? The hashset or the bitset? Clearly, if N is sufficient large, the hashset will be best. But how large must N be?

I decided to implement a quick test to determine the answer. Instead of using the standard Java BitSet, I decided to write my own bitset (henceforth StaticBitSet) that is faster in my tests. For the hashset, I compared both the standard HashSet and TIntHashSet and found that there was little difference in performance in my tests, so I report just the results with the standard HashSet (from the OpenJDK 7).

The following table reports the speed in millions of elements per second for adding, removing and iterating through 1000 elements in the range from 0 to N.

  N    bitset    hashset  

These numbers are consistent with the theory. The speed of the hashset data structure is relatively independent from N whereas the performance of the bitset degrades as N increases. However, what might be surprising, is how large N needs to be before the bitset is beaten. The bitset only starts failing you (in this particular test) when the ratio of the size of the universe to the size of the set exceeds 1,000.

The bitset data structure is more generally applicable than you might think.

Source: My Java source code is available, as usual.

Further reading: In to Sorting is fast and useful, I showed that binary search over sorted array of integers could be a competitive way to test whether a value belongs to a set.

17 thoughts on “Fast sets of integers”

  1. Daniel, this is entirely consistent with my experience. A well-implemented bit set is a nearly optimal data structure for modern CPU architectures. It is brute force but in a manner CPUs are highly optimized for. If you are saturating all the ALUs then the number of entry evaluations per clock cycle is very high. Hash sets have to overcome their cache line overhead and relatively low number of evaluations per clock cycle.

    As you undoubtedly know, when bit sets become large/sparse enough to become inefficient, a very good alternative is compressed bit sets.

  2. Of course. I’m just wondering since I have some extremely performance-sensitive code which uses java.util.BitSet — it’ll be nice to benchmark it there too, given those numbers….

  3. nice results :)!

    this seems to be a bit offtopic but 🙂

    does somebody know of a sparse hashset implementation in Java which is more memory efficient than the THashSet?

    I need some memory efficient datastructure for the case that there are areas of consecutive integer values … or how would you implement that?

  4. Hi,
    would be nice to see some b search + sorted vector in comparison, like in your linked blog post. Even more interesting would be some cache aware or even (if they exist) cache oblivious “kind of b search + sorted vector”. I guess binary tree “flattened” to array would behave nicely with regards to cache perf.

  5. @Daniel

    Sorry for the confusion! I should have thought about my problem a bit more 🙂

    I need a hashmap or a ‘compressed’ integer array which efficiently ‘maps’ ints to ints (or longs to longs)

  6. I found the SparseArray/LongSparseArray of the android project. Still a *full* re-allocation is necessary if the space is not sufficient but the key/values are stored very compact. Access is done via binary search, so not O(1) …

Leave a Reply

Your email address will not be published. Required fields are marked *