If you are using a modern C++ (C++11 or better), you have access to set data structures (unordered_set) which have the characteristics of a hash set. The standard does not provide us with built-in functions to compute the union and the intersection of such sets, but we can make our own. For example, the union of two sets A and B can be computed as follow:
out.insert(A.begin(), A.end()); out.insert(B.begin(), B.end());
where out is an initially empty set. Because insertion in a set has expected constant-time performance, the computational complexity of this operation is O(size(A) + size(B)) which is optimal.
If you are a computer scientist who does not care about real-world performance, your work is done and you are happy.
But what if you want to build fast software for the real world? How fast are these C++ sets?
I decided to populate two sets with one million integers each, and compute how how many cycles it takes to compute the intersection and the union, and then I divide by 2,000,000 to get the time spent per input element.
|intersection (unordered_set)||100 cycles/element|
|union (unordered_set)||250 cycles/element|
How good or bad is this? Well, we can also take these integers and put them in sorted arrays. Then we can invoke the set_intersection and set_union methods that STL offers.
|set_intersection (std::vector)||3 cycles/element|
|set_union (std::vector)||5 cycles/element|
That’s an order of magnitude better!
So while convenient, C++’s unordered_set can also suffer from a significant performance overhead.
What about std::set which has the performance characteristics of a tree? Let us use code as follows where out is an initially empty set.
std::set_union(A.begin(), A.end(), B.begin(), B.end(), std::inserter(out,out.begin()));
|set_intersection (std::set)||150 cycles/element|
|set_union (std::set)||750 cycles/element|
As we can see, results are considerably worse.
The lesson is that a simple data structure like that of std::vector can serve us well.