Modern processors have powerful vector instructions which allow you to load several values at once, and operate (in one instruction) on all these values. Similarly, they allow you to have vector constants. Thus if you wanted to add some integer (say 10001) to all integers in a large array, you might first load a constant with 8 times the value 10001, then you would load elements from your array, 8 elements by 8 elements, add the vector constant (thus do 8 additions at once), and then store the result. Everything else being equal, this might be 8 times faster.
An optimizing compiler might even do this optimization for you (a process called ‘auto-vectorization). However, for more complex code, you might need to do it manually using “intrinsic” functions (e.g., _mm256_loadu_si256, _mm256_add_epi32, etc.).
Let us consider the simple case I describe, but where we process two arrays at once… using the same constant:
#include <x86intrin.h> #include <stdint.h> void process_avx2(const uint32_t *in1, const uint32_t *in2, size_t len) { // define the constant, 8 x 10001 __m256i c = _mm256_set1_epi32(10001); const uint32_t *finalin1 = in1 + len; const uint32_t *finalin2 = in2 + len; for (; in1 + 8 <= finalin1; in1 += 8) { // load 8 integers into a 32-byte register __m256i x = _mm256_loadu_si256((__m256i *)in1); // add the 8 integers just loaded to the 8 constant integers x = _mm256_add_epi32(c, x); // store the 8 modified integers _mm256_storeu_si256((__m256i *)in1, x); }; for (; in2 + 8 <= finalin2; in2 += 8) { // load 8 integers into a 32-byte register __m256i x = _mm256_loadu_si256((__m256i *)in2); // add the 8 integers just loaded to the 8 constant integers x = _mm256_add_epi32(c, x); // store the 8 modified integers _mm256_storeu_si256((__m256i *)in2, x); } }
My expectation, until recently, was that optimizing compilers would keep the constant in a register, and never load it twice. Why would they?
Yet you can check that GCC loads the constant twice. You will recognize the assembly sequence:
mov eax, 10001 // load 10001 in a general register vpbroadcastd ymm1, eax // broadcast 10001 to all elements
In this instance, other compilers (like LLVM) do better. However, in other instances, both LLVM and GCC happily load constants more than once. Only the Intel compiler (ICC) seems to be able to avoid this issue with some consistency.
The processor has more than enough vector registers, so it is not a register allocation issue. Of course, there are instances where it is best to avoid creating the constant, but you can check that even when the compiler ought to know that the constant is always needed, it may still create it twice. AVX-512 has introduced new mask types and they suffer from this effect as well.
Does it matter? In most cases, this effect should have little performance impact. It is almost surely only a few instructions of overhead per function.
It would be interesting to be able to instruct the compiler not to do reload the constants. You might think that the static keyword could help, but with LLVM, static vector variables may be protected by a lock, which probably makes your code even heavier.
There is one potential reason why GCC loads the constant twice: in the assembly you see the “jb .L2” -> “jb .L8” -> ret path that will never load the constant. At least from the assembly you cannot a priori say that each loop is entered at least once or even that if one is entered, the other is entered as well. If one loop is taken and the other is not, you would need the constant in the common parent of those blocks. That would be a pessimization of the “no loop is taken” path from the beginning.
Some of the obvious optimizations like merging the two loops are also not really allowed because the ranges might overlap in memory. A simple __restrict doesn’t seem to help though.
Even if you make sure that the constant is used at least twice… for sure (unconditionally), GCC still loads the constant twice…
I agree that you can eventually get GCC to stop doing that with enough coddling… but that does not help at scale…
I don’t this would have any kind of measurable impact at scale, you have other kind of issues such as push/pop the registers for each func call playing a stronger role here.
We know that, doing higher level programming with C or C++ we leave this kind of control to the compiler.
The issue is always the same: we trust the compiler to do a decent job, and if it’s not enough we dive into the assembly to squeeze the last bit of cycles we can.
That shouldn’t be an issue if you’re already proficient writing SIMD code and looking at the assembly output.
Always appreciate thoughts on that matter though 🙂
I agree that it is unlikely to have a measurable impact on performance.
There is definitely the “not sure if loop executes” problem mentioned earlier, which causes it to move it to execute once per loop because it thinks that guarantees it executes the minimum number of times it can (by the CFG).
What is happening otherwise (IE if you make the loops constant-number-of-iterations for loops) is that constant propagation at a high level determines the vector is a constant, and propagates it forward into both loops, which is fine. Note that
It is expected that later, after lowering, etc, something with a machine cost model will commonize it if necessary.
The low level definitely knows it is constant in both cases.
But it does not compute that commonizing it will save anything from what i can tell (I haven’t looked at every single pass dump at the RTL level to verify this, only a few that i would have expected to eliminate it).
See https://godbolt.org/z/jxWKcnTT1
This will show you the ccp1 pass, which propagates the constant forward
if you swap over to the final rtl pass, you can see it knows it is equivalent to a constant
Nothing in between CSE’s it, even if i turn on size optimization.
This is likely related to believing that constants are free in most cases (at the high level this is definitely the right view. As I said, at the low level where it has a machine cost model, it’s weirder that it doesn’t eliminate it even though it’s a constant)
Is there any difference in declaring:
const register __m256i c = _mm256_set1_epi32(10001);
Seems the *register* keyword might be ignored.
The register keyword in such a context seems quite useless in my experience.