Suppose I give you a word and you need to determine the location of the 1-bits. For example, given the word 0b100011001, you would like to get 0,3,4,8.
You could check the value of each bit, but that would take too long. A better approach is use the fact that modern processors have fast instructions to count the number of “trailing zeros” (on x64 processors, you have tzcnt). Given 0b100011001, this instruction would give you 0. Then you if you set this first bit to zero (getting 0b100011000), the trailing-zero instruction gives you 3, and so forth. Conveniently enough, many processors can set the least significant 1-bit to zero using a single instruction (blsr); you can implement the desired operation in most programming languages like C as a bitwise AND: word & (word - 1).
Thus, the following loop should suffice and it is quite efficient…
while (word != 0) { result[i] = trailingzeroes(word); word = word & (word - 1); i++; }
How efficient is it exactly?
To answer this question, we first need to better define the problem. If the words you are receiving have few 1-bits (say less than one 1-bit per 64-bit words), then you have the sparse regime, and it becomes important to detect quickly zero inputs, for example. If half of your bits are set, you have the dense regime and it is best handled using using vectorization and lookup tables.
But what do you do when your input data is neither really sparse (that is, you almost never have zero inputs) nor really dense (that is, most of your bits are set to zero)? In such cases, the fact that the instructions in your loop are efficient does not help you as much as you’d like because you have another problem: almost every word will result in at least one mispredicted branch. That is, your processor has a hard time predicting when the loop will stop. This prevent your processor from doing a good job retiring instructions.
You can try to have fewer branches at the expense of more instructions:
while (word != 0) { result[i] = trailingzeroes(word); word = word & (word - 1); result[i+1] = trailingzeroes(word); word = word & (word - 1); result[i+2] = trailingzeroes(word); word = word & (word - 1); result[i+3] = trailingzeroes(word); word = word & (word - 1); i+=4; }
The downside of this approach is that you need an extra step to count how many 1-bit there are in your words. Thankfully, it is a cheap operation that can be resolved with a single instruction on x64 processors.
This ‘unrolled’ approach can void more than half of the mispredicted branches, at the expense of a few fast instructions. It results in a substantial reduction in the number of CPU cycles elapsed (GNU GCC 8, Skylake processor):
cycles / 1-bit | instructions / 1-bit | branch misses / word | |
---|---|---|---|
conventional | 4.7 | 8.2 | 0.68 |
fast | 3.4 | 8.2 | 0.41 |
So we save about 1.3 cycles per 1-bit with the fast approach. Can the mispredicted branches explain this gain? There about 6 bits set per input word, so the number of mispredicted branches per 1-bit is either 0.15 or 0.065. If you multiply these fractions by 15 cycles (on the assumption that each mispredicted branch costs 15 cycles), you get 2.25 cycles and 1 cycles; or a difference of 1.25 cycles. It does seem credible that the mispredicted branches are an important factor.
I offer my source code, it runs under Linux.
We use this decoding approach in simdjson.
How close are we to the optimal scenario? We are using one instruction per 1-bit to count the number of trailing zeros, one instruction to zero the least significant 1-bit, one instruction to advance a pointer where we write, one store instruction. Let us say about 5 instructions. We are getting 9.8 instructions. So we probably cannot reduce the instruction count by most than a factor of two without using a different algorithmic approach.
Still, I expect that further gains are possible, maybe you can go faster by a factor of two or so.
Futher reading: Parsing Gigabytes of JSON per Second and Bits to indexes in BMI2 and AVX-512.
Credit: Joint work with Geoff Langdale. He has a blog.
You can make the code a bit faster on GCC 8, by using alternative C construction. If you introduce an auxiliary “uint32_t* val = base_ptr + base” and then each update will be like “*val++ = static_cast(idx) + trailingzeroes(bits);” then the compiler emits a slightly better machine code.
Before: instructions per cycle 2.57, cycles per value set: 3.797
After: instructions per cycle 2.45, cycles per value set: 3.428
10% for free, not that bad 🙂
You could do better by matching the number of bits processed to typical input data. Eg. unconditionally do 6 bits, do 4 more if there are still bits left, and then loop 2 bits at a time. This reduces branch misses as well as unnecessary work when there are no more set bits.
However the most obvious way to speed this up further is to delay the decoding until you actually need it. This avoids the overhead of expanding the bitmap into a much larger data structure (max expansion is 64 times) and all associated cache overheads.
Also the latency of computing the next bit will be completely hidden, unlike in these examples where the 2-cycle latency of x = x & (x – 1) is going to dominate (processing 2 64-bit masks in parallel may avoid this latency chain, but that makes things even more complex).
Note on a modern AArch64 core the basic decoder is fastest by default since it has the minimum number of instructions per bit set. Doing extra work doesn’t offset the reduction in branch misses (which are fast on Arm cores due to shallow pipelines). Changing faster_decoder to do 6 bits first, then 4 bits and finally loop on 1 bit is ~10% faster than fast_decoder and 20% than simdjson_decoder.
Btw would it be reasonable to have a #if around #include <x86intrin.h> and evts.push_back(PERF_COUNT_HW_REF_CPU_CYCLES) so the code becomes more portable? That event fails on AArch64 kernels but this causes all other events to fail too…
Thanks… I did not know that PERF_COUNT_HW_REF_CPU_CYCLES would fail on AArch64, thanks for pointing this out.
Yes, evidently, the x86intrin header needs to be guarded with appropriate checks.
Note on a modern AArch64 core the basic decoder is fastest by default since it has the minimum number of instructions per bit set.
Sorry if I took some time to get back to you on this. So I verify this result initially, on an older compiler, but after reading up that the latest GNU GCC improved code generation improved I tried with GNU GCC 9… The results appear to be equivalent to what I get on x64, thus apparently contradicting your statement…
I post my results as a MarkDown file…
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2019/05/03/RESULTS_AARCH64_AMPERE.md
Furthermore, you will find a dockerfile to ease reproduction.
Well you found a GCC9 bug! It incorrectly adds a useless popcount after the basic_decoder loop:
mov x0, x4
.L2758:
rbit x1, x0
clz x1, x1
add w1, w3, w1
str w1, [x19, w2, uxtw 2]
sub x1, x0, #1
add w2, w2, 1
ands x0, x0, x1
bne .L2758
fmov d0, x4
cnt v0.8b, v0.8b
addv b0, v0.8b
umov w0, v0.b[0]
add w20, w20, w0
That adds extra latency and 6 instructions, slowing basic_decoder.
Interesting.
So I did additional testing.
I have updated the results in an markdown file in my repo (with the AMPERE string in the name). To ensure reproducibility, I have posted my scripts and docker files.
It seems that LLVM 8 compiles basic_decoder to the following…
As for the basic_decoder function under GCC-9, it compiles to…
It is hard for me to understand your concern with the 2-cycle dependency. We are using, in the best case, 8.5 cycles per set bit… far more than 2 cycles.
Note that even in the best scenario, we are a full 2x the number of cycles that an Intel processor needs. That’s not good.
What is troubling is that the basic_decoder runs at 1 instruction per cycle or less. So there is contention for something and it is not strictly just data dependency…
Is this the best aarch64 can do for a population count?
https://godbolt.org/z/7KtXbC
This looks expensive… recent AMD x64 processors do the same with one instruction that can be executed 4 times per cycle.
Yes that’s the right sequence given there is no integer instruction. So it’s important not to use popcount on AArch64 as if it is really cheap.
I get less than 30 cycles per word on basic_decoder, which is faster than the x86 core you used. It may well be that branch prediction is the main performance limiter for sparse bitsets like this. So I still believe it’s fastest to decode on demand rather than like this.
On a different processor, I presume? Because I’d be pretty happy with 30 cycles per word … but I clearly do not get close to this (all my results are posted in my repo).
Yes definitely. It seems Cortex cores can predict this better, for example Cortex-A72 has fewer than 13K mispredictions for all tests, but your results show over 20K mispredicts on x86 and Ampere.
@Wilco
Took my some time, but I built up an ARM box of my own.
I confirm that the Cortex-A72 is close to Intel Skylake as far as cycles, and mispredicted branches. Sometimes Intel does better, sometimes the A72 does better. The differences are not large from what I see (that’s what you’d expect from competitive technologies).
The Cortex-A72 is definitively superior on this benchmark than Ampere’s Skylark.
It is still inferior to Intel in the end because it does not appear to be able to cram as many instructions per cycle…
So you can reproduce the slowdown of basic_decoder. Note you’re quoting the non-inline version of basic_decoder rather than the main loop where it is inlined, which is where GCC9 adds the redundant popcount.
@Wilco
I have run the test on Apple’s A12 using the current version of Xcode…
https://github.com/lemire/iosbitmapdecoding
The numbers do not make much sense: they would imply that Apple’s A12 is highly inefficient.
The opposite is true: the A12 is highly effective on a per cycle basis: https://lemire.me/blog/2019/05/15/bitset-decoding-on-apples-a12/
Reported on GCC bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90594
Excellent. We need good code generation.
A similar problem comes up when looking for certain bytes or patterns. Best approach is to use vector-comparisons and movemask, resulting in bitmaps and essentially your problem.
One idea I have yet to try is to sort bitmap-words based on popcount. Processing the sorted bitmap-words can completely eliminate the mispredicted branches. You have to do more memory operations, which will partially offset the misprediction savings. But more importantly the algorithm gets rather complicated, so you are trading your own time and sanity for runtime performance. So far I couldn’t bring myself to make that trade.
I have tried it and at least for my use cases it didn’t turn out faster. The sorting ate up my savings (I used a type of radix sort which I think was basically ideal for this scenario).
However, as I recall by baseline was quite a bit faster than the ~4 cycles shown here: I think it was closer to 1.5 cycles.
@Travis @Jörn
I did implement what you describe. It is in the code but not covered by the blog post…
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2019/05/03/bitmapdecoding.cpp#L304
This can be done branchlessly with AVX2+BMI2 if you use the technique described in https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask/36951611#36951611, using an array of sequential numbers as the src to the “compress” function.
With AVX512 you can do an entire 64-bit integer at once with a single instruction,
vcompressps
(_mm512_mask_compresstoreu_epi8(dest, bitset, rangeOf0Through64
).(With AVX512_VBMI2*, not AVX512F.)
And now I see something like this mentioned in one of the further reading links :).
I think that I allude to the vectorization option in my post, with the important caveat:
So I think vector instructions will have trouble coping with the kind of data I am considering in this post.
This is not to say that you cannot do it “branchlessly”: you can. You can bring down cache misses to almost zero. The trick is to do that while also not increasing instruction count too much.
How well does a switch statement work? e.g.
int count = popcnt(word);
int* resultEnd = result + count;
switch(count) {
case 64: case 63: // ...etc
for(int i=-count+4; i; i++) {
resultEnd[i-4] = trailingzeroes(word);
word &= word-1;
}
case 4: // add more cases if desired
resultEnd[-4] = trailingzeroes(word);
word &= word-1;
case 3:
resultEnd[-3] = trailingzeroes(word);
word &= word-1;
case 2:
resultEnd[-2] = trailingzeroes(word);
word &= word-1;
case 1:
resultEnd[-1] = trailingzeroes(word);
case 0:
}
This could reduce “wasted” actions. If the exact ordering doesn’t matter, you could also skip the resultEnd calculation.
Also, the initial branch may be expensive in this case if the number of bits set isn’t always the same. You could probably do some hybrid approach where you do a more coarse grained switch if the number of bits set happens to often fall within a particular range, e.g.:
int count4 = popcnt(word) & -4;
// example for granularity = 4, change values if more appropriate
int* resultEnd = result + count4;
switch(count4) {
case 60: case 56: // ...etc
// do some loop
case 8:
resultEnd[-8] = trailingzeroes(word);
word &= word-1;
resultEnd[-7] = trailingzeroes(word);
word &= word-1;
resultEnd[-6] = trailingzeroes(word);
word &= word-1;
resultEnd[-5] = trailingzeroes(word);
word &= word-1;
case 4:
resultEnd[-4] = trailingzeroes(word);
word &= word-1;
resultEnd[-3] = trailingzeroes(word);
word &= word-1;
resultEnd[-2] = trailingzeroes(word);
word &= word-1;
resultEnd[-1] = trailingzeroes(word);
word &= word-1;
case 0:
resultEnd[0] = trailingzeroes(word);
word &= word-1;
resultEnd[1] = trailingzeroes(word);
word &= word-1;
resultEnd[2] = trailingzeroes(word);
}
Although this approach does make it much closer to what you already have.
Depending on the bits set, a fast SIMD+lookup approach may still be reasonable even if there’s a smallish number of bits typically set, e.g.
int wordPart16 = (word & 0xffff) << 4;
_mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
result += _mm_popcnt_u32(wordPart16);
wordPart16 = (word>>12) & 0xffff0;
_mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
result += _mm_popcnt_u32(wordPart16);
wordPart16 = (word>>28) & 0xffff0;
_mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
result += _mm_popcnt_u32(wordPart16);
wordPart16 = (word>>44) & 0xffff0;
_mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
//result += _mm_popcnt_u32(wordPart16); // if needed
This may still be competitive if there’s only 6 bits per 64-bit word.
I have tried both of these techniques. The vectorized decoding is already part of CRoaring https://github.com/RoaringBitmap/CRoaring/blob/master/src/bitset_util.c#L553
It works and this library is in production in some large corporations.
However, I have not managed to make it competitive in the average-density scenario.
I tried the switch case approach but I could not make it run faster though I must admit that I did not do much more than just code some C and record timings.
My benchmark is rather simple: have you tried modifying it to test your ideas? Get in touch if you get promising results.
Ah, you’re using 32-bit indicies – that will seriously hurt a SIMD approach.
On the flipside, if AVX512 is usable, you can use the VCOMPRESSD technique instead, which should perform very well.
I suspect the naiive switch statement to perform better than the naiive loop approach. The performance of the jump could be problematic if there’s any unpredictable variation in the number of bits set.
The coarse grained switch works better with variation of bits set, but is more similar to the simdjson decoder. Performance ultimately depends on the density characteristics I think; simdjson is probably slightly better if <8 bits are usually set, as it can avoid a jump.
If it’s rare that there will be more than 1 bit set per 4, one could break it into nibbles and use PSHUFB to obtain indicies, which could perform okay. Looking at your sample data though, this isn’t the case, in fact, there seems to be a mix of densities, though they repeat regularly.
It looks like your benchmark requires Linux perf counters, so must be run on a Linux baremetal install (doesn’t work in a Linux VM), which makes things a little tricky for me…
Well, wasn’t too far off what I had expected (I haven’t checked these for accuracy, so these are just rough indications). Figures are cycles per word:
simdjson: 22.867
basic: 33.117
switch: 31.389
switch4: 30.199
switch8: 24.309
switch16: 23.028
vcompressd: 15.322
Regarding the switch, I’d want to have a close look at how the compiler implements it. The compiler is free to turn a switch case into branching. I am surprised by the vcompressd results.
Would you share your code? Maybe as a GitHub gist?
Yes, I confirmed that the switch is compiling to a jump, not a series of branches.
The vcompressd method is fairly straightforward:
#ifdef __AVX512F__
static inline void vcompressd(uint32_t *base_ptr, uint32_t &base,
uint32_t idx, uint64_t bits) {
base_ptr += base;
base += hamming(bits);
__m512i indicies = _mm512_add_epi32(_mm512_set1_epi32(idx), _mm512_setr_epi32(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15));
uint32_t* upper_ptr = base_ptr + _mm_popcnt_u32(bits);
_mm512_mask_compressstoreu_epi32(base_ptr, bits, indicies);
indicies = _mm512_add_epi32(indicies, _mm512_set1_epi32(16));
// _mm_popcnt_u32 causes GCC to insert unnecessary MOVSX instructions (Clang not affected), so _mm_popcnt_u64 may be faster there
base_ptr += _mm_popcnt_u32(bits & 0xffff);
bits >>= 16;
_mm512_mask_compressstoreu_epi32(base_ptr, bits, indicies);
indicies = _mm512_add_epi32(indicies, _mm512_set1_epi32(16));
bits >>= 16;
_mm512_mask_compressstoreu_epi32(upper_ptr, bits, indicies);
indicies = _mm512_add_epi32(indicies, _mm512_set1_epi32(16));
upper_ptr += _mm_popcnt_u32(bits & 0xffff);
bits >>= 16;
_mm512_mask_compressstoreu_epi32(upper_ptr, bits, indicies);
//_mm256_zeroupper(); // automatically inserted by compiler; otherwise problematic if inlined and unrolled
}
#endif
Thank you.
Just stumbled across this – I’ll try and do some experimenting with your code because it is an intriguing problem.
One small observation for targeting Aarch64 would be that given there is neither a count trailing zeros or an unset trailing zero instruction, repeating the construction:
base_ptr[X] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
bits = bits & (bits - 1);
requires at least 5 data processing instructions:
rbit, clz, add (idx + leading_zero_count), sub (1), and
We probably should instead reverse bits once upfront.
Then we can explicitly count leading zeroes, and clear the most significant bit with 4 instructions:
clz, add (idx + leading_zero_count), asr (arithmetic shift right int64_t_min by leading_zero_count), bic (clear bits up to an including most significant set bit)
corresponding to C code something like:
lz = leadingzeroes(rev_bits);
base_ptr[X] = static_cast<uint32_t>(idx) + lz;
rev_bits = rev_bits & ~(int64_t(0x1000000000000000) >> lz);
Not tested this myself yet.
Thank you.
Implemented as follows:
It does save quite a bit of instructions. Whether it is faster is less clear. On at least one ARM server I have, it is actually slower despite the instruction count.
See my code and benchmark at https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2019/05/03/bitmapdecoding.cpp