It is more complicated than I thought: -mtune, -march in GCC

My favourite C compilers are GNU GCC and LLVM’s Clang. In C, you compile for some architecture. Thus you have to tell the compiler what kind of machine you have.

In theory, you could recompile all the code for the exact machine you have, but it is slow and error prone. So we rely on prebuilt binaries, typically, and these are often not tuned specifically for our hardware. It is possible for a program to detect the hardware it is running on and automatically adapt (e.g., via runtime dispatch) but that is not often done in C.

So GCC and Clang have flags that allow you to tell them what kind of hardware you have. They are “-march” and “-mtune”. I thought I understood them, until now.

Let me run through the basics of “-march” and “-mtune” that most experienced programmers will know about.

  • The first flag (“-march”) tells the compiler about the minimal hardware your code should run on. That is, if you write “-march=haswell”, then your code should run on machines that have haswell-type processors and anything better or compatible (anything that has the same instruction sets). They may not run on other machines. The GCC documentation is clear:

    -march=cpu-type allows GCC to generate code that may not run at all on processors other than the one indicated.

  • The other flag (“-mtune”) is just an optimization hint, e.g., if you write “-mtune=haswell”, you tell the compile to generate code that runs best on “haswell”-type processors. The GCC documentation is clear enough:

    While picking a specific cpu-type schedules things appropriately for that particular chip, the compiler does not generate any code that cannot run on the default machine type unless you use a -march=cpu-type option. For example, if GCC is configured for i686-pc-linux-gnu then -mtune=pentium4 generates code that is tuned for Pentium 4 but still runs on i686 machines.

    By default, when unspecified, “-mtune=generic” applies which means that the compiler will “produce code optimized for the most common processors”. This is somewhat ambiguous and will strictly depend on the compiler version you are using, as new processors being released might change this tuning.

Thankfully, your compiler can automatically detect your processor, it calls this automatically detected processor “native”. So I have been compiling my code with “-march=native” because I want the compiler to do the best it can do on the machine I am using. I assumed, until now, that if my processor is detected as having architecture X, doing “-march=native” implied “-march=X -mtune=X”. And that could almost be inferred from the documentation:

Specifying -march=cpu-type implies -mtune=cpu-type.

This has lead me to believe that “-march” trumps “-mtune” meaning that if you set “-march=native”, then the “-mtune” is effectively irrelevant.

I was wrong.

Let us check using funny command lines. I use a skylake processor with GNU GCC 5.5. It is important to note that this compiler predates skylake processors.

  1. I can type gcc -march=native -Q --help=target | grep -- '-march=' | cut -f3 to check which processor is automatically detected. On my favourite machine, I get “broadwell”. That is slightly wrong, but close enough given that the compiler does not know about skylake processors.
  2. One reading of the documentation is that “-march=native” implies “-mtune=native”, so let us check. I type gcc -march=native -Q --help=target | grep -- '-mtune=' | cut -f3 and I get “generic”. Ah! The compiler has detected “broadwell” but it is not tuning for “broadwell” or for “native”, rather it is tuning for “generic”.
  3. What if instead of “-march=native”, I type “-march=broadwell”. Surely it should make no difference? I type gcc -march=broadwell -Q --help=target | grep -- '-mtune=' | cut -f3 and I get “broadwell”. So even if you have a broadwell processor that gets recognized as such, the flags “-march=native” and “-march=broadwell” differ in the sense that they impact differently the tuning.

Let me repeat this: if you have a skylake processor that gets recognized as a broadwell processor, then “-march=broadwell” and “-march=native” are different flags having a different effect on your code.

What you care about is whether it produces different binaries. Does it? Unfortunately yes, it does. See my code sample.

Does it matter in practice, as far as performance goes? Probably not in actual systems, but if you are doing microbenchmarking, studying a specific function, small differences might matter.

I will keep using “-march=native” as it is the expedient approach, but I would really like to know how to best tune specifically for my hardware without having to do messy command-line Kung Fu.

Credit: The example and the key observation are due to Travis Downs.

12 thoughts on “It is more complicated than I thought: -mtune, -march in GCC”

    1. It’s not a bug per se, because it happens when GCC is too old to know about the new arch. So it doesn’t happen (for Skylake) on newer GCC, but it would presumabley still happen with a newer CPU uarch.

  1. Maybe it depends on your operating system and GCC version. On CentOS 7.5 with native GCC 4.8.5 and even with GCC 8.2 RC setting march=native also means mtune=native is set

    On Core i7 4790K cpu

    with GCC 4.8.5 native

    gcc -v
    Using built-in specs.
    COLLECT_GCC=gcc
    COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
    Target: x86_64-redhat-linux
    Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
    Thread model: posix
    gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)

    you get for march and mtune

    gcc -march=native -Q --help=target | egrep -- '-march=|-mtune' | cut -f3
    core-avx2
    core-avx2

    with GCC 8.2 RC snapshot reported as 8.1.1 right now

    gcc -v
    Using built-in specs.
    COLLECT_GCC=gcc
    COLLECT_LTO_WRAPPER=/opt/gcc-8.2.0-RC-20180719/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper
    Target: x86_64-redhat-linux
    Configured with: ../configure --prefix=/opt/gcc-8.2.0-RC-20180719 --disable-multilib --enable-bootstrap --enable-plugin --with-gcc-major-version-only --enable-shared --disable-nls --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-install-libiberty --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++ --enable-initfini-array --disable-libgcj --enable-gnu-indirect-function --with-tune=generic --build=x86_64-redhat-linux --enable-lto --enable-gold
    Thread model: posix
    gcc version 8.1.1 20180719 (GCC

    you get for march and mtune

    gcc -march=native -Q --help=target | egrep -- '-march=|-mtune' | cut -f3
    haswell
    haswell

    and specifically for haswell target you get for march and mtune

    gcc -march=haswell -Q --help=target | egrep -- '-march=|-mtune' | cut -f3
    haswell
    haswell

    1. You need to run the test with a compiler that doesn’t know about your arch to make this interesting. In particular, for gcc 8 your results are as expected: Haswell is known by gcc and you are running on Haswell, so you get march and mtune set to Haswell.

      For the gcc 4.8.5 test, it isn’t clear what it means: core-avx2 is no longer a supported option for gcc (at least according to the manual): it reminds me of the icc options? It doesn’t make sense to tune for “core-avx2” since that is not an micro-architecture, so it’s hard to say what gcc is doing internally. Perhaps this behavior changed in later versions of gcc.

      1. For the gcc 4.8.5 test, it isn’t clear what it means: core-avx2 is no
        longer a supported option for gcc (at least according to the manual):
        it reminds me of the icc options? It doesn’t make sense to tune for
        “core-avx2” since that is not an micro-architecture, so it’s hard to
        say what gcc is doing internally. Perhaps this behavior changed in
        later versions of gcc.

        Ah didn’t realise core-avx2 was no longer supported. Probably explains why i had issues compiling PHP 7.3 alphas – on Skylake cpu failed to compile with Zend Opcache on GCC 4.8.5 but compiled fine on GCC 7.3.1 🙂

  2. A note about the gcc documentation you mentioned:

    Specifying -march=cpu-type implies -mtune=cpu-type.

    It could be clearer: what it should say is that “Specifying -march=cpu-type implies -mtune=cpu-type if not otherwise explicitly specified.” I had always interpreted it that way, but probably because before reading it I had seen lots of examples where both are specified (indeed, the documentation hints at that usage).

    That is, it has always been the case that passing both -march and -mtune to the same compilation makes sense: you often want to target some fairly broad range of chips (say, since Sandy Bridge) but optimize for the chip you know will be the most common in your case in the immediate future (say Skylake).

    You can see some method to gcc’s madness here. When you specify that gcc should use instructions and tuning for your arch, but you run into a problem when the arch is newer than gcc knows. In that case, what gcc does is different for the “march” side of things versus the “mtune”.

    For the march, you are just talking about available instructions and instruction sets. Any version of GCC knows about some set of instruction sets, usually corresponding to the newest arch it knows about. It can also query the instruction sets supported by the current CPU. If it as unknown type, it could match it against the arches it knows about and if there is an exact match or a “superset match” it could just use that – and so it does: it selects Broadwell since from an ISA point of view, Skylake is Broadwell (Skylake may support a few extra instructions such as MPX, but since gcc doesn’t know about them, it wouldn’t query for them and so this logic probably gets the same result whether it is using exact match or superset match).

    Another way of looking at it is that -march=broadwell is just a shortcut for specifying a long list of -m options like -mavx, -mavx2, -mpclmul, etc, and the same list can be generated for -march=native by querying the processor’s capabilities, which may then be compressed to something like -march=broadwell if it matches the list implied by Broadwell.

    All this is good because it prevents a huge regression when using -march=native: if it didn’t do this when you upgraded your CPU you’d suddenly lose access to AVX2, AVX, any version of SSE greater than 2 and so on, since gcc would just be like “Oh, I don’t know about this CPU so I’ll use the based x86-64 profile”. So I think we can say gcc is doing a reasonable thing on the -march side of things.

    That leaves -mtune. The main problem as you put is that -march=native implies (for example) -mtune=broadwell on Skylake chips when gcc doesn’t know about Skylake, but it does not imply -mtune=broadwell. In fact, in this particular case, -mtune=broadwell would be the best option: -mtune=generic is worse.

    We know that, however, only with the benefit of hindsight: Skylake performs very much like Broadwell (which performs essentially identical to Haswell before it), so Broadwell is a good tune for Skylake. That certainly hasn’t always been the case though: when the switch to the P4 uarch was made, the tune for the “previous” arch would have been a bad match for P4, and same when P4 was in turn dropped in favor of a return to the PPro/PentiumM architecture.

    So the rule of “use the latest arch (from same manufacturer?)” would have worked well recently but not in the past. It would also have trouble when some manufacturer doesn’t have a linear list of architectures, but rather also has various secondary archictectures, like Intel with Atom and the Phi/Knights* stuff.

    The rule of “use generic tune” seems like a reasonable compromise, and also has the advantage of being easier to implement: no need to implement an ordering of architectures or deal with the various families etc. So even though I originally thought this was really dumb, I can see the logic.

    Last note. You write:

    By default, when unspecified, “-mtune=generic” applies which means…

    I think you know this, but one should be clear that this only applies if you don’t also specify -march. Usually you want to specific -march since the difference there is huge: newer instruction sets, and -mtune comes along for the side.

  3. I hate no editing capabilities, and this typo is too important: it should read:

    The main problem as you put is that -march=native implies (for
    example) -march=broadwell on Skylake chips when gcc doesn’t know about
    Skylake, but it does not imply -mtune=broadwell

  4. Thanks. This is an appropriate and timely bit of information, given my upcoming exercise. 🙂

    I can somewhat understand the choice of compiler-default behaviors, but also expect it might wander a bit between versions. This should not matter for most folk, for most problems, but if you are working a problem targeted for a specific processor, this stuff matters.

  5. For the longest time, a codebase I worked on had -march=native -mtune=native. It was just easier to let GCC figure things out instead of specifying the actual values, and it worked, so why bother?

    But it does. And this article is a great link to share with people who don’t know that.

    The reason I had to change the code base was virtual machines. Some of the build was being done in a QEMU VM, so the CPU returned from procinfo was a QEMU. This broke the build entirely, since GCC couldn’t figure out what the CPU architecture was. But if it hadn’t been for that, I would not have been aware of the issues with -march=native -mtune=native. So thank you for writing the article to bring this to more people’s attention.

  6. If the compiler does not know the actual architecture – you mentioned that broadwell is not correct, just close enough – how is it going to know that tuning for broadwell is more appropriate than tuning generic? Because apparently it is not a broadwell.

    It seems consistent to me apply generic tuning for a CPU that the compiler does not (yet) have enough details. It cannot just assume that broadwell tuning is the best choice for all future broadwell successor CPUs.

    1. It seems consistent to me apply generic tuning for a CPU that the compiler does not (yet) have enough details.

      It is not wrong, but I would argue that it is not possible to infer this behaviour from the documentation. So the net result is a surprise, and surprises are not good.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax