Fast sets of integers

Maintaining a set of integers is a common problem in programming. It can also be implemented in many different ways.

Maybe the most common implementation uses a hashing (henceforth hashset): it provides optimal expected-time complexity. That is, we expect that it takes a constant time to add or remove an integer (O(1)), and it takes a time proportional to the cardinality of the set to iterate through the elements. For this purpose, Java provides the HashSet class.

An alternative implementation is the bitset: essentially, it is an array of boolean values. It ensures that adding and removing integers takes a constant time. However, iterating through the elements could be less than optimal: if your universe contains N integers, it may take a time proportional to N to enumerate the integers irrespective of the number of integers in the set.

So, suppose you expect to have 1000 integers in the range from 0 to N. Which data structure is best? The hashset or the bitset? Clearly, if N is sufficient large, the hashset will be best. But how large must N be?

I decided to implement a quick test to determine the answer. Instead of using the standard Java BitSet, I decided to write my own bitset (henceforth StaticBitSet) that is faster in my tests. For the hashset, I compared both the standard HashSet and TIntHashSet and found that there was little difference in performance in my tests, so I report just the results with the standard HashSet (from the OpenJDK 7).

The following table reports the speed in millions of elements per second for adding, removing and iterating through 1000 elements in the range from 0 to N.

  N     bitset     hashset  
100,000 77 18
1,000,000 45 19
10,000,000 11 18

These numbers are consistent with the theory. The speed of the hashset data structure is relatively independent from N whereas the performance of the bitset degrades as N increases. However, what might be surprising, is how large N needs to be before the bitset is beaten. The bitset only starts failing you (in this particular test) when the ratio of the size of the universe to the size of the set exceeds 1,000.

The bitset data structure is more generally applicable than you might think.

Source: My Java source code is available, as usual.

Further reading: In to Sorting is fast and useful, I showed that binary search over sorted array of integers could be a competitive way to test whether a value belongs to a set.

19 thoughts on “Fast sets of integers”

  1. Daniel, this is entirely consistent with my experience. A well-implemented bit set is a nearly optimal data structure for modern CPU architectures. It is brute force but in a manner CPUs are highly optimized for. If you are saturating all the ALUs then the number of entry evaluations per clock cycle is very high. Hash sets have to overcome their cache line overhead and relatively low number of evaluations per clock cycle.

    As you undoubtedly know, when bit sets become large/sparse enough to become inefficient, a very good alternative is compressed bit sets.

  2. Of course. I’m just wondering since I have some extremely performance-sensitive code which uses java.util.BitSet — it’ll be nice to benchmark it there too, given those numbers….

  3. nice results :)!

    this seems to be a bit offtopic but šŸ™‚

    does somebody know of a sparse hashset implementation in Java which is more memory efficient than the THashSet?

    I need some memory efficient datastructure for the case that there are areas of consecutive integer values … or how would you implement that?

  4. Hi,
    would be nice to see some b search + sorted vector in comparison, like in your linked blog post. Even more interesting would be some cache aware or even (if they exist) cache oblivious “kind of b search + sorted vector”. I guess binary tree “flattened” to array would behave nicely with regards to cache perf.

  5. @Daniel

    Sorry for the confusion! I should have thought about my problem a bit more šŸ™‚

    I need a hashmap or a ‘compressed’ integer array which efficiently ‘maps’ ints to ints (or longs to longs)

  6. I found the SparseArray/LongSparseArray of the android project. Still a *full* re-allocation is necessary if the space is not sufficient but the key/values are stored very compact. Access is done via binary search, so not O(1) …

  7. Here’s another approach: randomize the integers using a pseudorandom permutation (aka block cipher), and divide them into buckets indexed by the MSB of their randomized value. (The bucket size should be chosen to fit into a cache line in expectation.) Then to look up an integer, encode it and scan its bucket (linear search should suffice if the bucket fits in a cache line). To intersect two of these sets, intersect each corresponding bucket (sets of different sizes will have different bucket counts, so this requires intersecting each bucket of the smaller set with all buckets in the larger set of which that bucket’s index is a prefix). Again, this is just a linear scan on cache-line-sized buckets (if encoded values are stored in order within a bucket then you can pick up the scan where you left off from the last value you looked up). Note that no decoding is required for intersections if both sets share the same permutation (and you don’t want to enumerate the result). You can enumerate the whole set simply by scanning and decoding each bucket. I’ve been able to get around 60% compression this way with a few million 32-bit integers, and around 78% with a more complex version of the structure (using Elias-Fano coding in the buckets). For the permutation, any 2-round balanced Feistel network with a fast round function works fine; a couple of rounds of RC5 (too weak for crypto!) seems fast enough in practice.

  8. Heh, I forgot to mention the detail that achieves compression in this scheme: each bucket only stores the suffixes of the encoded value (the prefix is implicitly stored in the bucket index). I don’t have Knuth handy but I believe he refers to this technique as “quotienting”. Of course since buckets are variable-sized, an index storing bucket counts or offsets is necessary, but that requires very little space relative to the compression gain. (The distribution of bucket sizes is well described by the Poisson approximation to the binomial distribution.)

Leave a Reply

Your email address will not be published. Required fields are marked *