Programmers often need to write integers as characters. Thus given the 32-bit value 1234, you might need a function that writes the characters `1234`. We can use the fact that the ASCII numeral characters are in sequence in the ASCII table: ‘0’+0 is ‘0’, ‘0’+1 is ‘1’ and so forth. With this optimization in mind, the standard integer-to-string algorithm looks as follow:

while(n >= 10) p = n / 10 r = n % 10 write '0' + r n = p write '0' + n

This algorithm writes the digits in reverse. So actual C/C++ code will write a pointer that you decrement (and not increment):

while (n >= 10) { const p = n / 10; r = n % 10; n = p; *c-- = '0' + r; } *c-- = '0' + n;

You can bound the size of the string (10 characters for 32-bit integers, 20 characters for 64-bit integers). If you have signed integers, you can detect the sign initially and make the integer value non-negative, write out the digits and finish with the sign character if needed. If you know that your strings are long, you can do better by writing out the characters two at a time using lookup tables.

How fast is this function ? It is going to take dozens of instructions and CPU cycles. But where is the bottleneck?

If you look at the main loop, and pay only attention to the critical data dependency, you divide your numerator by 10, then you check its value, and so forth. So your performance is bounded by the speed at which you can divide the numerator by 10.

The division instruction is relatively slow, but most compilers will convert it into a multiplication and a shift. It implies that the whole loop has a latency of about 5 cycles if you count three cycles for the multiplication and one cycle for the shift, with one cycle for the loop overhead. Of course, the function must also compute the remainder and write out the result, but their cost is maybe less important. It is not that these operations are themselves free: computing the remainder is more expensive than computing the quotient. However, we may get them almost for free because they are on a critical data dependency path.

How correct is this analysis? How likely is it that you are just bounded by the division by 10? The wider your processor, the more instructions it can retire per cycle, the more true you’d expect this analysis to be. Our commodity processors are already quite wide. Conventional Intel/AMD processors can retire about 4 instructions per cycle. The Apple M1 processor can retire up to 8 instructions per cycle.

To test it out, let us add a function which only writes out the most significant digit.

while (n >= 10) { n /= 10; c--; } c--; *c = '0' + char(n);

Here is the number of nanoseconds required per integer on average according to a benchmark I wrote. The benchmark is designed to measure the latency.

function | Apple M1 clang 12 | AMD Zen2 gcc 10 |
---|---|---|

fake itoa | 11.6 ns/int | 10.9 ns/int |

real itoa | 12.1 ns/int | 12.0 ns/int |

According to these numbers, my analysis seems correct on both processors. The numbers are a bit closer in the case of the Apple M1 processor, but my analysis is not sufficiently fine to ensure that this difference is significant.

Hence, at least in this instance, your best chance of speeding up this function is either by dividing by 10 faster (in latency) or else by reducing the number of iterations (by processing the data in large chunks). The latter is already found in production code.

In the comments, Travis Downs remarks that you can also try to break the chain of dependencies (e.g., by dividing the task in two).

**Further reading**: Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries, Software: Practice and Experience 49 (6), 2019

It is not necessary to do a division for each step: a single multiplication by 10 suffices.

For example, you can interpret a number like 1235 as 0.1235 by setting the decimal point to the left of the highest digit (so-called “fixed point” representation). Depending on the size of the involved integers, this can be a simple reinterpretation, not requiring any code.

Then, you can extract each digit one at a time by multiplying by 10:

`0.1235 * 10 = 1.2345`

and extracting the integer part (the part to the left of the fixed point). You can further optimize this by multiplying by 5 rather than 10, since the final missing factor of two can be rolled into the shift. This is a big advantage on Intel`x86`

since * 5 can be done with a single 1-cycle latency`lea`

instruction.The approach is covered quite well here.

Now the approach you describe only

nominallyuses a division, but the compiler transforms it into a multiplication: but using the multiplication directly results in fewer instructions and opens up other opportunities for optimization like the`lea`

trick.There’s a third way: try for increased ILP within a single conversion. E.g., you could split the number into two and extra one digit from each half per iteration. This does more total work but has two latency chains of half the length (plus some overhead for the splitting and combining). There are several other tricks which decrease the latency at the cost of a bit more work.

You are entirely correct, of course, a third way is to try to break down the problem into two (or more) tasks. Thanks for the great link.

“The Apple M1 processor can retire up to 8 instructions per cycle.”

M1 can decode, map, and rename 8 instructions per cycle.

It can retire (as in clear and free) 16 History File entries per cycle (ie free 16 register per cycle)

It can (I kid you not) retire up to 56(!) instructions from the ROB per cycle.

Can we verify this with performance counters?

In any case I think your original statement was correct in a sense: the M1 can

sustainretirement of 8 instructions per cycle. Technically, on any given cycle it might retire more, but this is kind of uarch trivia in the sense thatsustainedretirement will be limited by earlier bottlenecks such as decode.Well, if it is able to retire 10, 20, 30 instructions per cycle even just some of the time, then it suggests that we should be able to see higher-than-8 retired instructions per cycle in some cases using performance counters… does it not?

Yes, you can measure it: see here for example.

I just mean that as far as I can tell when you said “The Apple M1 processor can retire up to 8 instructions per cycle.” you are just using retire as a standin for “process” or said in another way: “The M1 is 8 wide” or “The M1 has a maximum IPC of 8”.

As Maynard points out, the statement isn’t strictly true when you refer to “retire” specifically, because it happens that this process has a larger retirement bandwidth than its width (unlike many x86 processors where retire was the limiting or tied-for-limiting factor). So the spirit of what you said is correct: it can handle up to 8 instructions per cycle, and even though some parts of the pipeline may momentarily exceed that, they will have be preceded by periods of < 8 throughput so the average never exceeds 8.

To add to Travis’ points:

(a) The big “strategic” point is that one of the design principles for creating a truly fast CPU is that you want to split the machine into as many independent parts as possible, each connected by queues, and things are working well if those queue are as dynamic as possible — ideally they should

constantlybe alternating between almost full (just absorbed a large block of activity) and almost empty (ready to absorb the next block).Or to put it differently, designing for mean behavior (mean rate of integer instructions, mean rate of loads, etc) is table stakes; to do substantially better means designing beyond the mean to the standard deviation. Many, dynamic, queues are part of that design principle.

(b) So while the M1 can only sustain 8 instructions per cycle, you want queues that are substantially more ambitious than this.

You want your Fetch queue (sitting between the L1I$ and Decode) to be asynchronous, and able to pull in substantially more than 8 instructions per cycle, to provide a buffer against bad luck (either a jump to the end of a cache line from which you can only pull a few instructions; or a jump that misses L1I completely).

You provide buffers between Rename and (int/fp/loadstore) Scheduling that can accept 8 instructions, even though int scheduling is only 6-wide, and fp and loadstore scheduling are each 4-wide — so that even in the case of a long burst of int/fp/loadstore you can clear eight instructions out of Decode per cycle and move onto the next eight — your 8-wide front-end is not throttled by the 6 or 4-wide capacity of your execution core, at least not for shortish (~50 or so) burst lengths.

And you try to clear out the ROB machinery as rapidly as possible after a blocking head -of-ROB clears.

Naively you might consider this unimportant – – unlike the previous cases, there is apparently no compelling reason for the back-end to have to clear faster than it can be fed by the front end.

The point that is missed by this naive analysis is heterogeneity of resources, and the fact that resources are cleared in order after a blocking head of ROB clears.

So, suppose the machine has stalled in Rename because it can no longer allocate a physical int register. When the head of ROB clears, registers will start to be freed — but in-order. And if all the registers cleared in the first cycle are FP registers, that doesn’t help the stall in Decode, which doesn’t make progress until an int register is freed. That’s why we want registers freed at faster than they can be consumed, to deal with this heterogeneity.

The same holds for the ROB proper. Recall that the M1 ROB consists of ~330 rows, each of which can hold up to seven instructions of which one (and only one) can be a failable instruction (ie an instruction that can misspeculate — a load, store, or branch).

The throughput constraint is, if you want to be able to process 7 instructions per cycle of exactly the right mix (4 load/store + 3 branches) you have to be able to clear 7 failables from the ROB per cycle.

In fact Apple clear 8 rows per cycle, of which failables are the important part (you have to test that each one does not in fact result in a flush); the rest are just unimportant detritus at this point.

Given that the History File (for tracking register dependencies and freeing registers) is a separate structure that operates on its own schedule (of 16 free’s per cycle), it’s in fact fairly easy to free up lots of the ROB in a single cycle!

(Experts might ask about that three branches claim above — M1 only has two branch units! True — but simple branches [non conditional, don’t use the link register] “execute” at Decode, not in a branch unit…)