When designing an index, a database or a search engine, you frequently need to compute the union of two sorted sets. When I am not using fancy low-level instructions, I have most commonly computed the union of two sorted sets using the following approach:
v1 = first value in input 1 v2 = first value in input 2 while(....) { if(v1 < v2) { output v1 advance v1 } else if (v1 > v2) { output v2 advance v2 } else { output v1 == v2 advance v1 and v2 } }
I wrote this code while trying to minimize the load instructions: each input value is loaded exactly once (it is optimal). It is not that load instructions themselves are expensive, but they introduce some latency. It is not clear whether having fewer loads should help, but there is a chance that having more loads could harm the speed if they cannot be scheduled optimally.
One defect with this algorithm is that it requires many branches. Each mispredicted branch comes with a severe penalty on modern superscalar processors with deep pipelines. By the nature of the problem, it is difficult to avoid the mispredictions since the data might be random.
Branches are not necessarily bad. When we try to load data at an unknown address, speculating might be the right strategy: when we get it right, we have our data without any latency! Suppose that I am merging values from [0,1000] with values from [2000,3000], then the branches are perfectly predictable and they will serve us well. But too many mispredictions and we might be on the losing end. You will get a lot of mispredictions if you are trying this algorithm with random data.
Inspired by Andrey Pechkurov, I decided to revisit the problem. Can we use fewer branches?
Mispredicted branches in the above routine will tend to occur when we conditionally jump to a new address in the program. We can try to entice the compiler to favour ‘conditional move’ instructions. Such instructions change the value of a register based on some condition. They avoid the jump and they reduce the penalties due to mispredictions. Given sorted arrays, with no duplicated element, we consider the following code:
while ((pos1 < size1) & (pos2 < size2)) { v1 = input1[pos1]; v2 = input2[pos2]; output_buffer[pos++] = (v1 <= v2) ? v1 : v2; pos1 = (v1 <= v2) ? pos1 + 1 : pos1; pos2 = (v1 >= v2) ? pos2 + 1 : pos2; }
You can verify by using the assembly output that compilers are good at using conditional-move instructions with this sort of code. In particular, LLVM (clang) does what I would expect. There are still branches, but they are only related to the ‘while’ loop and they are not going to cause a significant number of mispredictions.
Of course, the processor still needs to load the right data. The address only becomes available in a definitive form just as you need to load the value. Yet we need several cycles to complete the load. It is likely to be a bottleneck, even more so in the absence of branches that can be speculated.
Our second algorithm has fewer branches, but it has more loads. Twice as many loads in fact! Modern processors can sustain more than one load per cycle, so it should not be a bottleneck if it can be scheduled well.
Testing this code in the abstract is a bit tricky. Ideally, you’d want code that stresses all code paths. In practice, if you just use random data, you will often have that the intersection between sets are small. Thus the branches are more predictable than they could be. Still, it is maybe good enough for a first benchmarking attempt.
I wrote a benchmark and ran it on the recent Apple processors as well as on an AMD Rome (Zen2) Linux box. I report the average number of nanoseconds per produced element so smaller values are better. With LLVM, there is a sizeable benefit (over 10%) on both the Apple (ARM) processor and the Zen 2 processor. However, GCC fails to produce efficient code in the branchless mode. Thus if you plan to use the branchless version, you definitively should try compiling with LLVM. If you are a clever programmer, you might find a way to get GCC to produce code like LLVM does: if you do, please share.
system | conventional union | ‘branchless’ union |
---|---|---|
Apple M1, LLVM 12 | 2.5 | 2.0 |
AMD Zen 2, GCC 10 | 3.4 | 3.7 |
AMD Zen 2, LLVM 11 | 3.4 | 3.0 |
I expect that this code retires relatively few instructions per cycle. It means that you can probably add extra functionality for free, such as bound checking, because you have cycles to spare. You should be careful not to introduce extra work that gets in the way of the critical path, however.
As usual, your results will vary depending on your compiler and processor. Importantly, I do not claim that the branchless version will always be faster, or even that it is preferable in the real world. For real-world usage, we would like to test on actual data. My C++ code is available: you can check how it works out on your system. You should be able to modify my code to run on your data.
You should expect such a branchless approach to work well when you had lots of mispredicted branches to begin with. If your data is so regular that you a union is effectively trivial, or nearly so, then a conventional approach (with branches) will work better. In my benchmark, I merge ‘random’ data, hence the good results for the branchless approach under the LLVM compiler.
Further reading: For high speed, one would like to use SIMD instructions. If it is interesting to you, please see section 4.3 (Vectorized Unions Between Arrays) in Roaring Bitmaps: Implementation of an Optimized Software Library, Software: Practice and Experience 48 (4), 2018. Unfortunately, SIMD instructions are not always readily available.
You can generate marginally better scalar ARM64 Clang code by switching the
pos2 = (v2 <= v1) ? pos2 + 1 : pos2;
to
pos2 = (v1 >= v2) ? pos2 + 1 : pos2;
which allows reuse of the initial cmp.
(Yes, it’s dumb that clang doesn’t catch this. Sigh…)
By my rough reckoning the M1’s critical path through the code is the use of the three integer units that handle flags/branches, and the above modification removes one instruction that requires these units.
(The extent to which that rough calculation represents the true gating depends on just how much time is spent in that loop, allowing all the possible instruction unrolling overlap to occur.)
The clang codegen still sucks, however. You can remove the increment of pos, via the horrible but traditional
*output_buffer++
(Store the incoming value of output_buffer at the start, and subtract it away to get the return value. Compiler should do that!)
The other clang’ism that looks suboptimal is it insists on dumb scheduling, splitting the cmp from the subsequent b.hs. No decent ARM64 core will care about the scheduling of one instruction between the cmp and the b.hs; but most will lose the opportunity to fuse the cmp and the b.hs into a single executable op.
Since this removes a second instruction from our critical path of “instructions that use the flags/branch units” it’s probably worth forcing in assembly if this were production code. (Of course then you’d go vector, I assume…)
[Also, just on style grounds, I think it’s better to have
while ((pos1 < size1) && (pos2 < size2)) {
ie replace the & with an && even though that doesn’t affect code gen.]
Thanks.
Yes. Before using assembly, I would definitively use SIMD instructions.
I use…
to indicate that both A and B can be evaluated as opposed to
where I expect only A to be evaluated if it is false.
Sort the input arrays such that input1 will always be exhausted first. This removes a comparison from the loop and deletes one of the trailing loops.
Use
pos1 += v1 <= v2; pos2 += v2 <= v1;
to force branchless execution.I wonder if
pos1 = (v1 <= v2) ? pos1 + 1 : pos1;
could be changed to
pos1 += !!(v1 <= v2);
?
You don’t need the “!!”; <= already returns a boolean (exclusively 0 or 1) result.
I do like to wrap such conditional expressions in () just to be clear, i.e.
pos1 += (v1 <= v2);
But, yes, such an expression makes clear to readers that I expect branch-free code. When the increment amount isn’t 1, then it’s more of a toss-up between:
pos += (condition) ? 42 : 0;
pos += -(condition) & 42;
The former is more readable, but the latter is closer to what I hope most processors (other than ARM) will generate.
Storing comparison result in variable makes GCC to use conditional set/move: https://godbolt.org/z/zs3WzqjqP
Well, looks like size_t variable and difference is better for GCC https://godbolt.org/z/4n4963sKj
I have added a corrected version of your function to the benchmark on github.
There is a bug in both union2by2_branchless_variable and union2by2_branchless_ptr:
!(v1 <= v2) != (v2 <= v1)
, ie merge procedure drops one of two equal elements.and union2by2_branchless too.
Ahh, I see: task were to union sorted sets ie arrays with all unique elements. I missed “unique” part of a task.
in llvm , I bench it, https://quick-bench.com/q/tNdjEe-55p6zPjDGF-jBIgDYJAQ
bm_union2by2_branchless_ptr version is not fast as bm_union2by2_branchless_variable and bm_union2by2, why? only because it’s pointer?
I do not know the answer in this particular instance, but it is not uncommon for code written directly in terms of pointers to be slower due to the overhead of doing arithmetic in address space. Compilers often do a better job when you use indexes. For speed, I recommend working with indexes by default unless you have a specific reason to prefer pointer arithmetic.
I included the pointer approach in my code because someone said it was faster.
There are two relatively simple ways to extract more instruction-level parallelism from a branchless merge routine since it’s so latency limited; I estimate the carried dependency latency as 6 cycles on Zen 3. The simplest is to merge from the front and back at the same time. The next level beyond that is to take the middle element (median) of the larger array and split the smaller array by it using one binary search. That splits the problem into two subproblems to which you can apply bidirectional merging. Splitting by the median like this is the idea behind parallel merge sort, but it also works well for extracting ILP.
Here’s a simple benchmark I just whipped up. For large arrays with millions of random uint32 elements, on my Zen 3 laptop the unidirectional merge1 function runs in 7 cycles/elem and the bidirectional merge2 function runs in 3.5 cycles/elem. The binary search median splitting is the same idea but with 4 merge frontiers instead of 2 (you can add more frontiers but register pressure becomes an issue).
https://gist.github.com/pervognsen/fa2a8cba0dc852094067348a57492ecb
And just as I say posted that code, I realized this doesn’t apply directly to sorted set union/intersection/difference (at least not without intermediate buffers or a final compaction pass) since you don’t know the cardinality of the final result ahead of time. (I wonder if that is an argument in favor of representing sorted sets as a list or tree of array chunks rather than forcing a flat array.)
If nothing else, hopefully these ideas for extracting more parallelism from branchless merging (as used merge in sort) will be helpful to people who haven’t seen it before.
Some quick thoughts having only explored this a little bit:
1) I think Veedrac has a good suggestion above to remove a comparison from the main loop. Instead of checking to see if you have reached the end of either list on each iteration, if you “peek” at the end of each list at the beginning you only have to check the list that you know will end first. That is, add a wrapper function that compares the last element of each list, and swaps input1/2 and size1/2 before starting the merge. This removes a comparison per iteration.
2) While assigning the result of the comparison to a variable does convince GCC to use a fully branchless approach, it still produces longer code than Clang. The difference (as you and others have probably already realized) is that Clang realizes that only a single comparison is necessary if it uses “sbb” (subtract with borrow). Since a comparison is internally just a subtraction, this might not save much, but it does simplify the inner loop.
3) In union2by2_branchless_variable, the use of “d” and “d2” to store the comparison results seems like a parody of bad computer science naming conventions. Given how much you value code clarity, I’m surprised by your use of such unhelpful variable names. Maybe “use_v1” and “use_v2”? I ended up with this: https://godbolt.org/z/53rhaoebc (with a wrapper function needed to possibly reorder the inputs, and a tail needed to copy the remaining elements).
Good points.
Note that the first optimization applies no matter how you compute the union.
I have actually implemented it (see GitHub).
Sadly it is an optimization that will be made useless when bound checks are present (I expect).
The galloping mode of Timsort comes to my mind. It can reduce the number of comparisons. The problem might be slightly different however (union vs union all).
There are vectorized merges, see “Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture” Intel (2010), 4.2.3 Merging Two Sorted Lists