Bitset decoding on Apple’s A12

In my post Really fast bitset decoding for “average” densities, I reported on our work accelerating the decoding of bitsets. E.g., given a 64-bit register, you want to find the location of every 1-bit. So given 0b110011, you would want to get 0, 1, 4, 5. We want to do this operation with many such registers. When the content of the register is hard to predict, you can be almost certain to get a mispredicted branch with every new register to be decoded. On modern processors with deep and wide pipelines, these mispredictions can become a bottleneck.

On recent x64 processors, we find that it is beneficial to decode in bulk: e.g., assume that there are at least 4 set bits, decode them without checking whether there are four of them. The code might look as follow:

  while (word != 0) {
    result[i] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+1] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+2] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+3] = trailingzeroes(word);
    word = word & (word - 1);
    i+=4;
  }

So we are trading branches for a few more instructions. If the branches are easy to predict, that is a bad trade, but if the branches are hard to predict, it can be beneficial.

We consider the scenario where the input data contains neither very many nor very few 1-bit, and where their distribution is hard to predict. With our approach, we can get it down to less than 3.5 cycles per 1-bit decoded on recent Intel processors. To achieve this kind of speed, we retire nearly 2.5 instructions per cycle.

What about ARM processors? Sadly, I have not yet been able to reliably measure the same kind of high speed. The Linux-based ARM systems I have seem to be limited to a quasi-scalar execution mode, retiring never much more than 1.5 instructions per cycle. Either these ARM processors are not powerful enough, or else I am not benchmarking properly.

An additional difficulty is that ARM processors do not have a fast 64-bit population-count instruction (to determine the number of 1-bit per register). Instead, you must use an instruction which finds the number of 1-bit per byte, and sum that up using another instruction. So while one instruction suffices on an Intel processor, at least two (or more) instructions are necessary, and so the cost and total latency is higher. Similarly ARM processors lack a “trailing zero” instruction: you have to reverse the bit order and use a “leading zero” instruction. So maybe ARM processors are just fundamentally at a disadvantage on this task compared to Intel processors. But I am not convinced that it is the case. If I look at the instructions counts, they seem to be similar between ARM and Intel code. That is, while ARM makes you work harder to compute some operations, Intel has its own limitations. It may all average out.

So I’d like to be certain that I have a powerful ARM processor to give ARM a fighting chance. Thankfully I do have many powerful ARM processors… I have one in my iPhone for example. Trouble is, I cannot instrument it and install Linux on it. I cannot easily use any compiler I’d like. Still, I can run benchmarks and record the time elapsed. All I need to do is write a little mobile application. I record the nanoseconds per set bit. It seems that the Apple A12 in my iPhone is limited to 2.5 GHz, so I multiply the result by 2.5 to get the number of cycles.

conventional 1.7 ns 4.125 cycles
fast 1.2 ns 3 cycles

If these numbers can be trusted, then the Apple A12 might possibly be more efficient than an Intel Skylake processor (3.5 cycles vs. 3 cycles). Given that Apple’s A12 is reputed to have a really wide pipeline, this sounds credible.

Using Apple’s Instruments tool, I got that the fast decoding approach runs at 3.7 instructions per cycle.

My code is available. If you have a Mac and an iPhone, you should be able to reproduce my results.

Update: The latest version of my code measures the clock speed as part of the benchmark.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

13 thoughts on “Bitset decoding on Apple’s A12”

  1. “you have to reverse the bit order and use a “leading zero” instruction”

    Would it be possible to isolate the lowest set bit? x & (-x)
    The lzcnt of a “1-hot” should be just as good as a tzcnt?
    One could then xor/sub the isolated bit with the source word to clear the that bit.

    1. Using tmp = (bits – tmp) & (tmp – bits); bits = bits – tmp; for finding and clearing the next set bit should be faster (2 cycle latency) than the most obvious sequence.

      1. This seems to generate three instructions…

         lowest = (bits - lowest)
             & (lowest - bits);
        

        So that’s 3 in latency, no?

        Update: as pointed out by Travis, this has a total of 2 cycle of latency due to parallelism… but if we have anything else in the loop that updates bits, then we get to three cycles.

              1. I don’t think Wilco is imagining any merging (although x86 BMI does have a BLSI which does the x & -x in one instruction, latency 1 on Intel, but 2 on AMD).

                Yes, the chain from bits as input, to bits as output is 3 cycles here (assuming no merging):

                lowest = (bits - lowest) & (lowest - bits);
                bits = bits - lowest;

                However the unrolled code in question is something like:

                lowest = (bits - lowest) & (lowest - bits);
                result[i] = tz(lowest);
                bits2 = bits - lowest;
                lowest = (bits2) & (lowest - bits);
                result[i+1] = tz(lowest);
                bits = bits2 - lowest;
                lowest = (bits) & (lowest - bits2);
                ...

                The dependency chain is only 2 cycles for each result. Essentially the bits = bits - lowest is both the end of one result and the first part of the next.

                1. That works. I think you can code it like so…

                    lowest = 0
                    for (...) {
                        // we make a 'copy' of lowest, but it should not be compiled as a copy
                        uint64_t tmp = lowest;
                        // the next two line can execute at the same time
                        lowest = (lowest - bits);
                        bits = (bits - tmp);
                        // then we finish updating 'lowest', in a second cycle
                        lowest &= bits;
                        ... then use lowest to identify the bit location (with clz)
                    }
                  

                  It works but at least on my Apple M1, it is not particularly fast.

                  1. Something like that. I am surprised it performs poorly. You may have to check the assembly to ensure the generated code is as you expect.

                    1. I did not write that it performed poorly.

                      I posted the numbers at https://github.com/simdjson/simdjson/pull/1546

                      It seems to exactly match the “rbit/clz for every bit” routine in terms of performance.

                      I have an isolated benchmark with instrumentation at…
                      https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2019/05/03

                      It is the same instruction count (roughly). But they all seem to max out at 4 instructions/cycle on the M1/clang 12.

  2. I think at this point it is generally accepted that A12 has higher IPC than Skylake on most scalar code. High end Skylake models may still have higher overall single threaded performance than any A12 because they run at a higher frequency than the fastest A12s.

    Processor architects will tell you (even if you didn’t ask) that you can’t compare designs on the basis of IPC alone, because higher frequency designs may sacrifice IPC to get there, so you can’t really talk about an “A12 running at 5 GHz” because such a thing may not be possible (it would require a different design).

    To be certain, I would need to be able to measure the number of cycles elapsed directly in my code.

    A typical approach is to calibrate based on a measurement with a known timing. I usually use a series of dependent adds or a series of stores. Dependent adds take 1 cycle each, so by timing a long string of those you can calculate CPU frequency. Stores are similar: running at one per cycle (in L1) on recent Intel CPUs, but I think the A12 can do 2 per cycle! In any case, adds are probably easier: you’ll immediately see if you get a reasonable result or not. In my experience this technique done carefully gives you a very accurate calibration (i.e., far better than 1%) – able to even measure temperature related drift in clock timing!

    An iPad is a better target than an iPhone since it suffers less thermal throttling. Maybe put it in the freezer first :).

    Of course, I could get myself a top-notch Qualcomm or Samsung processor and install Linux on it, and I may end up doing just that.

    I would also be an interesting exercise, but my impression is the A12 is way ahead of the pack here.

    1. Processor architects will tell you (even if you didn’t ask) that you can’t compare designs on the basis of IPC alone, because higher frequency designs may sacrifice IPC to get there, so you can’t really talk about an “A12 running at 5 GHz” because such a thing may not be possible (it would require a different design).

      Well… I wasn’t planning on building servers out of A12 chips… yet.

      A typical approach is to calibrate based on a measurement with a known timing.

      Yes. I’ll do so in a later revision of this test.

      An iPad is a better target than an iPhone since it suffers less thermal throttling. Maybe put it in the freezer first :).

      I doubt that my benchmark is intensive enough to get in trouble with respect to heat to the point where using a freezer is warranted. It lasts a very short time (but long enough to trigger the big cores).

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.