Modern processors can execute several instructions per cycle. Because processors cannot easily run faster (in terms of clock speed), vendors try to get their processors to do more work per cycle.
Apple processors are wide in the sense that they can retire many more instructions per cycle than comparable Intel or AMD processors. However, some people argue that it is unfair because ARM instructions are less powerful and do less work than x64 (Intel/AMD) instructions so that we have performance parity.
Let us verify.
I have a number parsing benchmark that records the number of cycles, instructions and nanosecond spent parsing numbers on average. The number of instructions and cycles is measured using performance counters, as reported by Linux. I parse a standard dataset of numbers (canada.txt), I keep the fast_float numbers (ASCII mode).
system | instructions per float | cycles per float | instructions per cycle |
---|---|---|---|
Intel Ice Lake, GCC 11 | 302 | 64 | 4.7 |
Apple M1, LLVM 14 | 299 | 45 | 6.6 |
Of course, that’s a single task, but number parsing is fairly generic as a computing task.
Looking the assembly output often does not reveal a massive benefit for x64. Consider the following simple routine:
// parses an integer of length 'l' // into an int starting with value // x. for(int i = 0; i < l; i++) { x = 10 * x + (c[i]-'0'); } return x;
LLVM 16 compiles this to the following optimized ARM assembly:
start: ldrb w10, [x1], #1 subs x8, x8, #1 madd w10, w0, w9, w10 sub w0, w10, #48 b.ne start
Or the following x64 assembly…
start: lea eax, [rax + 4*rax] movsx edi, byte ptr [rsi + rdx] lea eax, [rdi + 2*rax] add eax, -48 inc rdx cmp rcx, rdx jne start
Though your mileage will vary, I find that for the tasks that I benchmark, I often see as many ARM instructions being retired than x64 instructions. There are differences, but they are small.
For example, in a URL parsing benchmark, I find that ARM requires 2444 instructions to parse a URL on average, against 2162 instructions for x64: a 13% benefit for x64. That’s not zero but it is not a massive benefit that overrides other concerns.
However, Apple processors definitively retire more instructions per cycle than Intel processors.
That looks like the difference between micro-ops and macro-ops. x86 instructions can include a load or store, which are broken up and scheduled separately.
The key is that the x86 instructions that are most commonly used by compilers are relatively simple instructions and those are the instructions x86 vendors optimize their designs for.
The rare instructions that aren’t used very often are handled by firmware or microcode anyway.
So in the end, real x86 programs are pretty RISC anyway.
There are some cases where availability of barrel shifter logic integrated to other instructions on ARM can really shine on microbenchmarks (talking of highly optimised inner loops consisting of dozen instructions or less); at the same time, lack of some specific instructions such as parallel bits extract and deposit can provide a significant benefit on x86. I wonder how things are on vectored instruction sets.
Eh, availability of such instructions on x86, of course – and lack of them on ARM (at least Apple for now).
“ because ARM instructions are less powerful and do less work than x64 (Intel/AMD) instructions so that we have performance parity.”
Anyone who thinks this has not looked at ARM ISA in much detail and thinks aarch64 is classic RISC. It is not. If CISC is a state of being not RISC, then aarch64 is cisc. It has common instructions like load pair with autoincrement (updates 3GPRs. Such instructions are rare even in x86), alu operatons with shifts etc. There are some instructions in neon that have to be (sanely) implemented as a long sequence of ops.
There may be a case for this sort of argument against RISC-V, which sometimes needs 3instructions to do what one aarch64/x86 instruction does, like load with base+ scaled index+ displacement. Maybe it is an issue there.