Measuring the system clock frequency using loops (Intel and ARM)

In my previous post, Bitset decoding on Apple’s A12, I showed that Apple’s latest ARM-based processor can decode set bits out of a stream of words using 3 cycles per set bit. This compares favourably to Intel processors since I never could get one of them to do better than 3.5 cycles per set bit on the same benchmark. It is also more than twice as efficient as an ARM-based server on a per cycle basis.

In my initial blog post on the Apple A12, I complained that I was merely speculating regarding the clock speed. That is, I read on Wikipedia that the A12 could run at 2.5GHz and I assumed that this was the actual speed.

It was immediately pointed out to me by Vlad Krasnov of Cloudflare, and later by Travis Downs, that I can measure the clock speed, if only indirectly, by writing a small loop in assembly.

There are many ways to achieve the desired goal. On Intel x64 processor, you can write a tight loop that decrements a counter by one with each iteration. At best, this can run in one subtraction per cycle and Intel processors are good enough to achieve this rate:

; initialize 'counter' with the desired number
label:
dec counter ; decrement counter
jnz label ; goes to label if counter is not zero

In fact, Intel processors fuse the two instruction (dec and jnz) into a single microinstruction and they have special optimizations for tight loops.

You can write a function that runs in almost exactly “x” cycles for any “x” large enough. Of course, it could take a bit longer but if “x” is small enough and you try enough times, you can get a good result.

What about ARM processors? You can apply the same recipe, though you have to use different instructions.

; initialize 'counter' with the desired number
label:
subs counter, counter, #1 ; decrement counter
bne label ; goes to label if counter is not zero

In my initial version, I had unrolled these loops. The intuition was that I could bring the overhead of the cost down to zero. However, the code just looked ugly. Yet Travis Downs and Steve Canon insisted that the unrolled loop was the way to do. So I do it two ways: as an unrolled loop, and as a tight loop.

My code is available. I also provide a script which can run the executable several times. Keep in mind that the frequency of your cores can change, and it takes some time before an unsolicited processor gets back to its highest frequency.

The tight loop is pretty, but the price I pay is that I must account for the cost of the loop. Empirically, I found that the ARM Skylark as well as the ARM Cortex A72 take two cycles per iteration. The Apple A12 takes a single cycle per iteration. It appears that the other ARM processors might not be able to take a branch every cycle, so they may be unable to support a one-cycle-per-iteration loop.

Of course, you can measure the CPU frequency with performance counters, so my little program is probably useless on systems like Linux. However, you can also grab my updated ios application, it will tell you about your processor speed on your iPhone. On the Apple A12, I do not need to do anything sophisticated to measure the frequency as I consistently get 2.5 GHz with a simple tight loop after I have stressed the processor enough.

Credit: I thank Vlad Krasnov (Cloudflare), Travis Downs and Steve Canon (Apple) for their valuable inputs.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

24 thoughts on “Measuring the system clock frequency using loops (Intel and ARM)”

  1. I don’t think the 1 vs 2 cycle throughput is related to fusion, but rather a limitation on taken branches per cycle. Probably those CPUs cannot do a branch more than once every two cycles or some similar limitation.

    You could test this by unrolling the loop including the branch instruction but not taken for all by the last (make it beq), I think you get 1 cycle per sub/bne pair even on those CPU once you don’t need the loop to have a taken branch every cycle.

    Note that Intel CPUs fuse such pairs but also cannot do a taken branch every cycle except in special cases of very small displacement backwards branches, so fusion is no guarantee of one branch per cycle.

  2. Another way is macros. You can generate a pretty big block of assembly while maintaining readable code. I’ve got versions with and without the bne. Sketch of the no bne approach:

    #define TEN (x) x x x x x x x x x x
    #define HUNDRED (x) TEN(TEN(x))
    #define THOUSAND (x) TEN(HUNDRED(x))

    getFrequency() {
    //start timer here
    for (i=0; i<1000000; i++) {
    //<prolog here>
    THOUSAND(asm("add x0, x0, x0"););
    }
    //stop timer here
    }

  3. Instruments provides access to a set of performance counters, including ones such as INST_A64 and FIXED_CYCLES: perhaps these could be useful?

      1. This is for a physical device (I tried this on my iPad, which has an A10 “Fusion” processor, but I don’t see why it wouldn’t work on an A12); out of the events I tried, I was able to pull information out of FIXED_INSTRUCTIONS and FIXED_CYCLES. To use this, you can connect your device to a Mac and launch Instruments, selecting the “Counters” template and your device (and profiling app). Then go to File > Recording Options and click the + in the “Events and Formulas” section, and pick the events you want to measure from there. You should then be able to record your app: in mine, for example, I set it to run a three-instruction loop a billion times and I ended up with a little bit over 3 billion instructions executed total. I’m sure it’s possible to get more accurate results, but I was having issues getting the recording to work correctly if I didn’t call UIApplicationMain, which added overhead. Maybe you can rig up something better?

        1. That’s interesting but these two metrics are not very helpful if you are looking for the frequency, since neither of them tell you much about actual frequency I expect. Fixed cycle, I would guess, is just a measure of time elapsed. The instruction count is just… well, the instruction count.

          Now, if I had the real number of cycles, that’d be great. Combined with the instruction count, that gives me something useful. I need to look into it.

          1. FIXED_CYCLES seems to vary with time in a similar manner to FIXED_INSTRUCTIONS, for what it’s worth, and telling Instruments to plot cycles per instruction seems to give something that looks reasonable.

          2. I think FIXED_CYCLES is CPU cycles, not a fixed real-time counter.

            Here “FIXED” refers to the fact the event can be counted by a dedicated fixed-function counter, rather than a programable one, and not that the cycle period (measured in time) is “fixed” or anything like that.

              1. Ok. So FIXED_CYCLE varies over time and it seems to be highly correlated, visually, with the number of instructions per unit of time.

                Anyhow, FIXED_INSTRUCTION goes up to 29346274130 whereas FIXED_CYCLES is 9408002514 so that is 3.12 instructions per cycle. That’s for the whole program. It is much higher than on x64 where the highest you reach for part of the benchmark is 2.6 instructions per cycle.

                1. You may have found this already, but I though I’d mention it anyways: you can create your “formulas” in the view where you add the event counters by clicking on the gear instead of the + button. You can just type in Instructions / Cycles if that’s what you were trying to measure, which lets the computer do the work for you instead of you having to do the calculation manually 🙂

                    1. Uh, I think you can copy/paste things out of Instruments and get tab-separated data. As for getting the raw data out, I’m not sure: you might be able to write a tool that links against some of the frameworks inside of the Instruments app bundle to extract these.

        2. For future reference, your instructions work, but, importantly, you have to indicate that you want to sample by time. Otherwise, if you sample by event, where event is 1,000,000 cycles, then you just record nothing (which I am sure makes sense, but is not explained anywhere).

          1. Yeah, I saw that too where sampling by event didn’t seem to produce anything. Mine defaulted to time though so I forgot to mention it.

  4. On Intel processors, there is the TSC register and instruction … which is pretty damned important. Not clear the current ARM CPUs have an equivalent. Worth spending a bit of thought on the how the TSC can be (very) useful.

    1. AArch64 has cntvct_el0 which is a fixed frequency counter (typically 50-100MHz) which is useful for accurate timing. Counters that vary with the rapid changing clock frequency are less useful to software.

        1. I can verify this. Used the TSC to collect ultra-precise timing measurements from a custom Linux device driver. The tick rate is CPU specific (1600 MHz on the target box), and very steady. Has proved extremely useful.

      1. That depends a lot on what you’re actually measuring, and for what purpose. If you’re benchmarking an inner compute kernel for tuning with no dependencies on L2 or beyond, you want to isolate that measurement from thermal variation, frequency transients, memory/cache contention, etc. Cycles is great for that purpose.

        For tuning or comparing bigger systems, wall clock time (like these fixed-frequency counters provide) in often more meaningful (but beware coherency of such measurements between cores if a process migrates or has multiple threads; what you can count on differs across platforms).

        1. Indeed, cycles are useful when you’re optimizing small kernels. I typically use performance counters to get more detail when trying to figure out what is limiting performance.

          But at the end of the day the goal is to reduce total time taken of a complete application.

    2. Yeah, but the TSC (and ARM equivalents, I think) all count in “real time” not “CPU clock cycles” –
      that makes them directly useful for ehat most people want (real time measurement, time-stamping, etc), but not directly useful for counting cycles. Still they can serve as the real-time clock part of the calibration.

      I usually don’t bother because things like the std::chrono clocks and clock_gettime tend to use rdtsc under the covers so I just use the portable alternatives and get most of the rdtsc advantage.

      If you want to measure CPU clock cycles directly, you can on Intel but it takes a rdpmc and you have to program the performance counters, so it’s definitely a level of difficulty up, and less portable (eg on Windows I still haven’t seen a way to access the performance counters without a kernel driver).

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.