Apple’s M1 processor and the full 128-bit integer product

If I multiply two 64-bit integers (having values in [0, 264)), the product requires 128 bits. Intel and AMD processors (x64) can compute the full (128-bit) product of two 64-bit integers using a single instruction (mul). ARM processors, such as those found in your mobile phone, require two instructions to achieve the same result: mul computes the least significant 64 bits while mulh computes the most significant 64 bits.

I believe that it has often meant that computing the full 128-bit product was more expensive, everything else being equal, on ARM processors than on x64 (Intel) processors. However, the instruction set does not have to determine the performance. For example, ARM processors can recognize that I am calling both instructions (mul  and mulh) and process them more efficiently. Or they may simply have very inexpensive multipliers.

To explore the problem, let us pick two pseudo-random number generators, splitmix and wyhash:

uint64_t splitmix() {
    uint64_t z = (state += UINT64_C(0x9E3779B97F4A7C15));
    z = (z ^ (z >> 30)) * UINT64_C(0xBF58476D1CE4E5B9);
    z = (z ^ (z >> 27)) * UINT64_C(0x94D049BB133111EB);
    return z ^ (z >> 31);
}
uint64_t wyhash() {
    state += 0x60bee2bee120fc15ull;
    __uint128_t tmp = (__uint128_t)(state)*0xa3b195354a39b70dull;
    uint64_t m1 = (tmp >> 64) ^ tmp;
    tmp = (__uint128_t)m1 * 0x1b03738712fad5c9ull;
    return (tmp >> 64) ^ tmp;
 }

As I reported earlier, wyhash should almost always be faster on an Intel or AMD processor as it is only two multiplications with an addition whereas the splitmix function is made of two multiplications with several other operations. However, wyhash requires two full multiplications whereas splitmix requires only two 64-bit products. If the two full multiplications in wyhash are equivalent two four multiplications, then wyhash becomes more expensive.

I wrote a small C++ benchmark to measure the time (in nanoseconds) that it takes to compute a random value using Apple’s new M1 processor (ARM, 3.2 GHz). The compiler is Apple clang version 12 which comes by default on the new Apple Silicon laptops.

Apple M1
wyhash 0.60 ns/value
splitmix 0.85 ns/value

My test suggests that it takes a bit under 3 cycles to generate a number with splitmix and a bit under 2 cycles to generate a number with wyhash. The wyhash generator is much faster than splitmix on the Apple M1 processor (by about 40% to 50%) which suggests that Apple Silicon is efficient at computing the full 128-bit product of two 64-bit integers. It would be nicer to be able to report the number of instructions and cycles, but I do not know how to instrument code under macOS for such measures.

Further reading: Apple M1 Microarchitecture Research by Dougall Johnson

Credit: Maynard Handley suggested this blog post.

Update: The numbers were updated since they were off by a factor of two due to a typographical error in the code.

Update 2: An interested reader provided me with the means to instrument the code. The precise number of cycles per value is a bit over 2.8 for splitmix and exactly 2 for wyhash. Please see my repository for the corresponding code.

Appendix. Some readers are demanding assembly outputs. The splitmix function compiles to 9 assembly instructions

LBB7_2:                                 ; =>This Inner Loop Header: Depth=1
	eor	x12, x9, x9, lsr #30
	mul	x12, x12, x10
	eor	x12, x12, x12, lsr #27
	mul	x12, x12, x11
	eor	x12, x12, x12, lsr #31
	str	x12, [x0], #8
	add	x9, x9, x8
	subs	x1, x1, #1              ; =1
	b.ne	LBB7_2

while the wyhash function compiles to 10 assembly instructions

LBB8_2:                                 ; =>This Inner Loop Header: Depth=1
	mul	x12, x9, x10
	umulh	x13, x9, x10
	eor	x12, x13, x12
	mul	x13, x12, x11
	umulh	x12, x12, x11
	eor	x12, x12, x13
	str	x12, [x0], #8
	add	x9, x9, x8
	subs	x1, x1, #1              ; =1
	b.ne	LBB8_2

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

17 thoughts on “Apple’s M1 processor and the full 128-bit integer product”

  1. Apple M1 chip is a warehouse/workshop model

    Copyright © 2018 Lin Pengcheng. All rights reserved.

    Introduction

    Computer hardware is also a factory that produces data, so it can also apply the “warehouse/workshop model”, The model uses memory as the core, not the CPU. Finally, we can achieve the grand unification of all IT fields such as hardware, software, Internet, and Internet of Things.

    Warehouse: Memory
    Workshop: CPU, graphics card, sound card, etc.
    Standardized data: data transmitted between hardware that conforms to industry standard interfaces
    Acceptance: Motherboard with standardized interfaces such as PCI, SATA, USB, etc.
    External standardized data: hard disk, flash drive, etc.

    The out-of-order execution technology of modern CPUs is a mistake (February 16, 2021)

    Out-of-order execution is a product of wrong programming methodology, wrong computer architecture and weak compiler conditions.

    In the “warehouse/workshop model”, the workshop is an orderly and high-speed ray (pipeline). The warehouse scheduling function performs dynamic planning and unified scheduling for all workshops and resources, without conflict and competition, and runs in the optimal order and efficiency.

    Follower Case

    My computer hardware architecture design was published on February 06, 2019. One or two years later, the Apple M1 chip adopted the “warehouse/workshop model” design and was released on November 11, 2020.

    Warehouse: unified memory
    Workshop: CPU, GPU and other cores
    Products (raw materials): information, data

    there’s also a new unified memory architecture that lets the CPU, GPU, and other cores exchange information between one another, and with unified memory, the CPU and GPU can access memory simultaneously rather than copying data between one area and another. Accessing the same pool of memory without the need for copying speeds up information exchange for faster overall performance. reference: Developer Delves Into Reasons Why Apple’s M1 Chip is So Fast

    From the introduction

    Apple M1 has not done global optimization of various core (workshop) scheduling.
    Apple M1 only optimizes the access to memory data (materials and products in the warehouse).
    Apple needs to further improve the programming language and compiler to support and promote my programming methodology.
    My architecture supports a wider range of workshop types than Apple M1, with greater efficiency, scalability and flexibility.

    Conclusion

    Apple M1 chip still needs a lot of optimization work, now its optimization level is still very simple, after all, it is only the first generation of works, released in stages.

    Forecast(2021-01-19): I think Intel, AMD, ARM, supercomputer, etc. will adopt the “warehouse/workshop model”

    In the past, the performance of the CPU played a decisive role in the performance of the computer. There were few CPU cores and the number and types of peripherals. Therefore, the CPU became the center of the computer hardware architecture.

    Now, with more and more CPU and GPU cores, and the number and types of peripherals, the communication, coordination, and management of cores (or components, peripherals) have become more and more important, They become a key factor in computer performance.

    The core views of management science and computer science are the same: Use all available resources to complete the goal with the highest efficiency. It is the best field of management science to accomplish production goals through communication, coordination, and management of various available resources. The most effective, reliable, and absolutely mainstream way is the “warehouse/workshop model”.

    Only changing the architecture, not changing or only expanding the CPU instruction set, not only will not affect the CPU compatibility, but also bring huge optimization space.

    So I think Intel, AMD, ARM, supercomputing, etc. will adopt the “warehouse/workshop model”, which is an inevitable trend in the development of computer hardware. My unified architecture and programming methodology will be vigorously promoted by these CPU companies, sweeping the world from the bottom up.

    Finally, the “warehouse/workshop model” will surely replace the “von Neumann architecture” and become the first architecture in the computer field, and it is the first architecture to achieve a unified software and hardware.

    link: The Grand Unified Programming Theory: The Pure Function Pipeline Data Flow with principle-based Warehouse/Workshop Model

  2. For people who are interested in the details: what suggested to me that the pair UMULH+MUL might be fused is the following patent:

    https://patents.google.com/patent/US9223577B2/

    This patent is interesting because traditional fused instructions were restricted to zero or no output register (think eg the traditional cmp+branch pair, or the ARM crypto fused pairs). The reason for this is that the traditional pipeline has a destination register allocation stage (usually called Rename) which (for various good reasons) is set up to allocate one destination register per instruction.

    What the patent describes is a very neat scheme whereby an instruction can be split at decode into multiple sub-instructions that pass through rename separately but then (and this is the novel part) they are recombined for the purposes of scheduling and execution.

    This is actually a remarkably nice idea. It allows Apple to treat the common ARM instruction Load Pair as a single unit for most purposes, even though that overwrites two instructions. And it allows various interesting instruction pairs (like UMULH and MUL) to be fused into one operation where that makes sense, even if that combined operation generates two destination registers.

    It would be interesting if anyone reading this can think of further instruction pairs that are likely fused.
    Dougall Johnson’s initial M1 explorations at

    https://dougallj.github.io/applecpu/firestorm.html

    lists the known fused instructions (some crypto, and what are essentially compare+branch).
    His list omits

    the obvious case of ADR+ADRP (done since the first ARMv8 cores)
    arithmetic followed by a branch (but without setting a flag, eg ADD+CBZ)
    the probable case (but not yet tested) of constructing large immediates via MOV+MOVK

    All those are obvious. Not obvious (but the obvious next interesting case to test for the purposes of this blog!) is the reverse of multiplication, namely division.
    The problem is suppose you want to perform both a divide and a remainder, another situation that’s generally tricky to handle optimally because once again there are two result registers.
    The ARMv8 solution is a UDIV (to generate the divid result) followed by MSUB (to generate the remainder). This is another obvious fusion candidate if you have the ability for one fused instruction to overwrite two registers.

  3. Apple M1 has two fully pipelined integer pipes that can do MUL or MULH. This means it can produce one full 64-bit x 64-bit -> 128-bit per cycle. (More information is available on Dougall Johnson site.)

    GMP lib assembly loop results prove how well it does on such code. For instance the multi-precision multiplication loop mul_1 produces 64-bit each cycle, while the fastest x86 chip needs 1.5 cycle.

    Of course the number of cycles is not the full story, but that M1 is quite good at crunching integer numbers 🙂

  4. Why dont you ever post/look at the generated assembly in such small
    benchmarks? Especially in profiler tools like Intel Vtune and Amd uProf, which would provide instruction level insight on modern CPUs and give potentially true answers to architecture questions. Eg if a bottleneck is the number of ALUs or pipeline width or dependencies , etc.

    For example I looked at Lehmers rng in godbold Clang/Gcc and they generate slightly different order of imul/mul/add.

    [imul mul add] OR [mul imul add] – where imul and add have the dependency and are in direct sequence, while imul mul add have 1 instruction between them. Still one cant really tell if that would be helpful on modern out-of-order machines. A profiler would tell.

    Are there profiler tools like that for the Apple M1?

    I think posting asm/profiles would be more enlightening (and entertaining…) than just counting the number of potentially expensive instructions. Though more work involved…

    1. Why dont you ever post/look at the generated assembly in such small
      benchmarks?

      I recommend using godbolt.org.

      Especially in profiler tools like Intel Vtune and Amd uProf, which
      would provide instruction level insight on modern CPUs and give
      potentially true answers to architecture questions. Eg if a bottleneck
      is the number of ALUs or pipeline width or dependencies , etc.

      I do not expect Intel Vtune and Amd uProf to work on Apple M1 macbooks. One can use Apple’s Instruments, however.

  5. Or simply, the M1 has enough execution units and reordering capacity to compute two or three 64-bit multiplications at once?

    Note that the way wyhash() is implemented, computing the next state is just a trivial addition from the current state, so a modern CPU would be able to overlap several consecutive calls to wyhash(). You don’t need a 128-bit multiplier to explain the timings, IMHO (which doesn’t mean the 128-bit multiplier doesn’t exist, of course :-)).

    1. Uh, by the way, this page (at the time I’m writing this) and the RSS feed show different numbers. The RSS feed says 0.30ns vs 0.45ns for wyhash vs splitmix, this page says 0.60ns vs 0.85ns. What are the right values?

      1. The blog post was updated because I was dividing by half the number of integers. This is explained in the blog post (see at the bottom).

        Both the page and the RSS feed are updated in sync.

    2. Or simply, the M1 has enough execution units and reordering capacity
      to compute two or three 64-bit multiplications at once?

      My tests are not thorough enough to enable me to conclude one way or the other about the underlying technology, so my conclusion is merely stated as “Apple Silicon is efficient at computing the full 128-bit product of two 64-bit integers”.

    3. Note that the way wyhash() is implemented, computing the next state is just a trivial addition from the current state, so a modern CPU would be able to overlap several consecutive calls to wyhash().

      Er, both functions are of this type. splitmix64 uses an additive constant of 0x9E3779B97F4A7C15, while wyhash uses an additive constant of 0x60bee2bee120fc15. Thus, there should be enough ILP in either function to saturate the processor’s functional units.

      64×64 multipliers are large and expensive, and integer multiply isn’t that common an operation. It’s hard to imagine equipping a non-DSP processor with more than one.

  6. This is not M1 specific and happened somewhere around Apple A10 SOC. Unfortunately I do not have device with A10 to test but here are results for devices form A9 to A14. And referring to our discussion it would expect the similar performance results from Cortex A75.

    A14
    splitmix 0.45 ns/value (5.65 %)
    wyrng 0.3 ns/value (13.9 %)

    A13
    splitmix 0.5 ns/value (3.76 %)
    wyrng 0.35 ns/value (11.6 %)

    A12X
    splitmix 0.55 ns/value (8.7 %)
    wyrng 0.4 ns/value (3.67 %)

    A11
    splitmix 0.6 ns/value (4.24 %)
    wyrng 0.4 ns/value (7.36 %)

    A9
    splitmix 1.2 ns/value (29.7 %)
    wyrng 1.5 ns/value (31.1 %)

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax