The cost of runtime dispatch

For high-performance software, it is sometimes needed to use different functions, depending on what the hardware supports. You might write different functions, some functions for advanced processors, others for legacy processors.

When you compile the code, the compiler does not yet know which code path will be taken. At runtime, when you start the program, the right function is chosen. This process is called runtime dispatch. Standard libraries will apply runtime dispatch without you having to do any work. However, if you write your own fancy code, you may need to apply runtime dispatching.

On Intel and AMD systems, you can do so by querying the processor, comparing the processor’s answer with the various functions you have compiled. Under Visual Studio, you can use __cpuid function while GNU GCC has __get_cpuid.

How expensive is this step?

Of course, the answer depends on your exact system but can we get some idea? I wrote a small C++ benchmark and my estimate is between 100 ns and 150 ns. So it is several hundreds of cycles.

Though it is inexpensive, if you are repeatedly calling an inexpensive function, it may not be viable to pay this price each time. So you can simply do it once, and then point at the right function for all follow-up calls.

Your only mild concern should be concurrency: what if several threads call the same function for the first time at once? In a language like C++, it is unsafe to have several threads modify the same variable. Thankfully, it is a simple matter of requiring that the change and queries be done atomically. On Intel and AMD processors, atomic accesses are often effectively free.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

9 thoughts on “The cost of runtime dispatch”

  1. FWIW, on ARM you generally can’t use the equivalent of the Intel CPUID instruction in unprivileged code. On Linux you’re supposed to use getauxval(AT_HWCAP) and/or getauxval(AT_HWCAP2) (depending on which feature(s) you want to check for), which is obviously going to be a lot slower. I have no idea what you’re supposed to do on Windows.

    I have some code in portable-snippets for both x86 and ARM (on Linux); it’s not my best work, but it is functional. If you want something a bit beefier Google’s cpu_features library is probably your best bet right now, but integrating it is a bit of a pain unless you are already using CMake (and it’s a bit of a pain if you are already using CMake because, well, you’re using CMake ;)).

    If you don’t have to worry about supporting multiple compilers, there are lots of interesting options out there. GCC has the target_clones attribute. clang has a cpu_dispatch (I think ICC does too, but I’m not certain). Unfortunately stuff like that doesn’t work if you’re using preprocessor directives to switch between different implementations, and AFAIK MSVC doesn’t have anything similar.

    I think the much more interesting, and important, question is where in the code to do the runtime dispatching. Doing it at too low of a level means you’re performing a lot of extra checks and hurting the compiler’s ability to optimize. Doing it at too high a level means you end up with a lot of bloat. In my experience, if the cost of the check is a concern you should probably move it up a bit.

    For example, one question I get about SIMDe pretty often is whether it does dynamic dispatch. It would be very convenient, but it would also be absolutely devastating for performance. I’d be interested to hear about your experience with where to put the dynamic dispatch code in simdjson and why.

    1. I think that simdjson has a different design issue with respect to runtime dispatching than SIMDe because we can easily hide away the runtime dispatching without effort. Our user-facing API has few entry points.

      1. Yes, I was referring to the CPUID calls required by runtime dispatch. If the load-time CPU dispatch afforded by the toolchain does the job, it seems like a more maintainable solution.

  2. CPUID is a serializing instruction, which makes benchmarking it rather useless and informative. I.e. it will wait until all previous instructions are fully completed, which can be a major performance issue on the modern out-of-order CPUs.

  3. When using C++11 or higher, it’s sufficient to do something like

    static unsigned int cpuid = getcpuid()

    Doing this inside any function (as well as in global scope, although that can be prone to issues with static ctor order) is guaranteed to be thread-safe, so there’s no need to roll your own atomic-based construct for this.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see