Careful with Pair-of-Registers instructions on Apple Silicon

Egor Bogatov is an engineer working on C# compiler technology at Microsoft. He had an intriguing remark about a performance regression on Apple hardware following what appears to be an optimization. The .NET 9.0 runtime introduced the optimization where two loads (ldr) could be combined into a single load (ldp). It is a typical peephole optimization. Yet it made things much slower in some cases.

Under ARM, the ldr instruction is used to load a single value from memory into a register. It operates on a single register at a time. Its assembly syntax is straightforward ldr Rd, [Rn, #offset]. The ldp instruction (Load Pair of Registers) loads two consecutive values from memory into two registers simultaneously. Its assembly syntax is similar but there are two destination registers: ldp Rd1, Rd2, [Rn, #offset]. The ldp instruction loads two 32-bit words or two 64-bit words from memory, and writes them to two registers.

Given a choice, it seems that you should prefer the ldp instruction. After all, it is a single instruction. But there is a catch on Apple silicon: if you are loading data from a memory that was just written to, there might be a significant penalty to ldp.

To illustrate, let us consider the case where we write and load two values repeatedly using two loads and two stores:

for (int i = 0; i < 1000000000; i++) {
  int tmp1, tmp2;
  __asm__ volatile("ldr %w0, [%2]\n"
                   "ldr %w1, [%2, #4]\n"
                   "str %w0, [%2]\n"
                   "str %w1, [%2, #4]\n"
    : "=&r"(tmp1), "=&r"(tmp2) : "r"(ptr):);

Next, let us consider an optimized approach where we combine the two loads into a single one:

for (int i = 0; i < 1000000000; i++) {
  int tmp1, tmp2;
  __asm__ volatile("ldp %w0, %w1, [%2]\n"
                   "str %w0, [%2]\n"
                   "str %w1, [%2, #4]\n"
    : "=&r"(tmp1), "=&r"(tmp2) : "r"(ptr) :);

It would be surprising if this new version was slower, but it can be. The code for the benchmark is available. I benchmarked both on AWS using Amazon’s graviton 3 processors, and on Apple M2. Your results will vary.

function graviton 3 Apple M2
2 loads, 2 stores 2.2 ms/loop 0.68 ms/loop
1 load, 2 stores 1.6 ms/loop 1.6 ms/loop

I have no particular insight as to why it might be, but my guess is that Apple Silicon has a Store-to-Load forwarding optimization that does not work with Pair-Of-Registers loads and stores.

There is an Apple Silicon CPU Optimization Guide which might provide better insight.

Daniel Lemire, "Careful with Pair-of-Registers instructions on Apple Silicon," in Daniel Lemire's blog, April 29, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

3 thoughts on “Careful with Pair-of-Registers instructions on Apple Silicon”

  1. I remember dealing with a memory hazard like this fifteen years ago when using NEON SIMD instructions on an iPhone.

  2. Apple Silicon CPU Optimization Guide:

    > • Paired load operations: These common cracked instructions have two destination registers and are cracked before renaming. However, unless the operands are Q- sized, the processor will re-fuse them back into a single μop before sending them to the Load and Store Execution Units. Use these instructions wherever possible.

  3. It might be interesting to check what happens with 64-bit registers.

    Writes to 32-bit registers require zero extension as per the architecture so this might create some complications depending on how the micro-architecture works.

    Note I did not check any available documentation about Apple Silicon so I might be completely off.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.