In my post Really fast bitset decoding for “average” densities, I reported on our work accelerating the decoding of bitsets. E.g., given a 64-bit register, you want to find the location of every 1-bit. So given 0b110011, you would want to get 0, 1, 4, 5. We want to do this operation with many such registers. When the content of the register is hard to predict, you can be almost certain to get a mispredicted branch with every new register to be decoded. On modern processors with deep and wide pipelines, these mispredictions can become a bottleneck.
On recent x64 processors, we find that it is beneficial to decode in bulk: e.g., assume that there are at least 4 set bits, decode them without checking whether there are four of them. The code might look as follow:
while (word != 0) { result[i] = trailingzeroes(word); word = word & (word - 1); result[i+1] = trailingzeroes(word); word = word & (word - 1); result[i+2] = trailingzeroes(word); word = word & (word - 1); result[i+3] = trailingzeroes(word); word = word & (word - 1); i+=4; }
So we are trading branches for a few more instructions. If the branches are easy to predict, that is a bad trade, but if the branches are hard to predict, it can be beneficial.
We consider the scenario where the input data contains neither very many nor very few 1-bit, and where their distribution is hard to predict. With our approach, we can get it down to less than 3.5 cycles per 1-bit decoded on recent Intel processors. To achieve this kind of speed, we retire nearly 2.5 instructions per cycle.
What about ARM processors? Sadly, I have not yet been able to reliably measure the same kind of high speed. The Linux-based ARM systems I have seem to be limited to a quasi-scalar execution mode, retiring never much more than 1.5 instructions per cycle. Either these ARM processors are not powerful enough, or else I am not benchmarking properly.
An additional difficulty is that ARM processors do not have a fast 64-bit population-count instruction (to determine the number of 1-bit per register). Instead, you must use an instruction which finds the number of 1-bit per byte, and sum that up using another instruction. So while one instruction suffices on an Intel processor, at least two (or more) instructions are necessary, and so the cost and total latency is higher. Similarly ARM processors lack a “trailing zero” instruction: you have to reverse the bit order and use a “leading zero” instruction. So maybe ARM processors are just fundamentally at a disadvantage on this task compared to Intel processors. But I am not convinced that it is the case. If I look at the instructions counts, they seem to be similar between ARM and Intel code. That is, while ARM makes you work harder to compute some operations, Intel has its own limitations. It may all average out.
So I’d like to be certain that I have a powerful ARM processor to give ARM a fighting chance. Thankfully I do have many powerful ARM processors… I have one in my iPhone for example. Trouble is, I cannot instrument it and install Linux on it. I cannot easily use any compiler I’d like. Still, I can run benchmarks and record the time elapsed. All I need to do is write a little mobile application. I record the nanoseconds per set bit. It seems that the Apple A12 in my iPhone is limited to 2.5 GHz, so I multiply the result by 2.5 to get the number of cycles.
conventional | 1.7 ns | 4.125 cycles |
fast | 1.2 ns | 3 cycles |
If these numbers can be trusted, then the Apple A12 might possibly be more efficient than an Intel Skylake processor (3.5 cycles vs. 3 cycles). Given that Apple’s A12 is reputed to have a really wide pipeline, this sounds credible.
Using Apple’s Instruments tool, I got that the fast decoding approach runs at 3.7 instructions per cycle.
My code is available. If you have a Mac and an iPhone, you should be able to reproduce my results.
Update: The latest version of my code measures the clock speed as part of the benchmark.
Would it be possible to isolate the lowest set bit?
x & (-x)
The lzcnt of a “1-hot” should be just as good as a tzcnt?
One could then xor/sub the isolated bit with the source word to clear the that bit.
True: you do not need to reverse the bits, you can skin this cat some other way, but it is harder to do it while saving instructions.
Using tmp = (bits – tmp) & (tmp – bits); bits = bits – tmp; for finding and clearing the next set bit should be faster (2 cycle latency) than the most obvious sequence.
This seems to generate three instructions…
So that’s 3 in latency, no?
Update: as pointed out by Travis, this has a total of 2 cycle of latency due to parallelism… but if we have anything else in the loop that updates bits, then we get to three cycles.
The two subtractions are independent so can execute in parallel, so total latency 2.
But it is followed by bits = bits – lowest (at least in how Wilco described it) and that depends on lowest.
I guess that the idea is that the compiler should be able to merge all of this, into 3 instructions in total?
I don’t think Wilco is imagining any merging (although x86 BMI does have a
BLSI
which does thex & -x
in one instruction, latency 1 on Intel, but 2 on AMD).Yes, the chain from
bits
as input, tobits
as output is 3 cycles here (assuming no merging):lowest = (bits - lowest) & (lowest - bits);
bits = bits - lowest;
However the unrolled code in question is something like:
lowest = (bits - lowest) & (lowest - bits);
result[i] = tz(lowest);
bits2 = bits - lowest;
lowest = (bits2) & (lowest - bits);
result[i+1] = tz(lowest);
bits = bits2 - lowest;
lowest = (bits) & (lowest - bits2);
...
The dependency chain is only 2 cycles for each result. Essentially the
bits = bits - lowest
is both the end of one result and the first part of the next.That works. I think you can code it like so…
It works but at least on my Apple M1, it is not particularly fast.
Something like that. I am surprised it performs poorly. You may have to check the assembly to ensure the generated code is as you expect.
I did not write that it performed poorly.
I posted the numbers at https://github.com/simdjson/simdjson/pull/1546
It seems to exactly match the “rbit/clz for every bit” routine in terms of performance.
I have an isolated benchmark with instrumentation at…
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2019/05/03
It is the same instruction count (roughly). But they all seem to max out at 4 instructions/cycle on the M1/clang 12.
I think at this point it is generally accepted that A12 has higher IPC than Skylake on most scalar code. High end Skylake models may still have higher overall single threaded performance than any A12 because they run at a higher frequency than the fastest A12s.
Processor architects will tell you (even if you didn’t ask) that you can’t compare designs on the basis of IPC alone, because higher frequency designs may sacrifice IPC to get there, so you can’t really talk about an “A12 running at 5 GHz” because such a thing may not be possible (it would require a different design).
A typical approach is to calibrate based on a measurement with a known timing. I usually use a series of dependent adds or a series of stores. Dependent adds take 1 cycle each, so by timing a long string of those you can calculate CPU frequency. Stores are similar: running at one per cycle (in L1) on recent Intel CPUs, but I think the A12 can do 2 per cycle! In any case, adds are probably easier: you’ll immediately see if you get a reasonable result or not. In my experience this technique done carefully gives you a very accurate calibration (i.e., far better than 1%) – able to even measure temperature related drift in clock timing!
An iPad is a better target than an iPhone since it suffers less thermal throttling. Maybe put it in the freezer first :).
I would also be an interesting exercise, but my impression is the A12 is way ahead of the pack here.
Well… I wasn’t planning on building servers out of A12 chips… yet.
Yes. I’ll do so in a later revision of this test.
I doubt that my benchmark is intensive enough to get in trouble with respect to heat to the point where using a freezer is warranted. It lasts a very short time (but long enough to trigger the big cores).