Vectorized shifts: are immediates faster?

Our processors are all equipped with vector instructions also called SIMD (single instruction multiple data). One common instruction is the “shift”. Roughly speaking, a shift is either a multiplication by a power of two or a division by a power of two, when considering the data as unsigned integers. Thus 1<<3 is 8.

Older Intel and AMD processors have shift instructions so that you can shift several integers at once. So given [1,2,3,4], you can compute [1<<3, 2<<3, 3<<3, 4<<3] using a single instruction. It is faster, obviously.

However, one form of these instructions can only shift by an “immediate” integer value, that is, the integer must be passed explicitly to the instructions (as opposed to being read from a register). That is an annoying constraint. It is good if you want to shift all your values by 3 bits, and 3 is a fixed quantity, but what if you want to shift by a quantity that is known only at runtime? You can’t. Thus older processors can’t vectorize the following code very well using immediate integers:

void scalarshift(uint32_t *array, size_t N,  int shift) {
  for (size_t k = 0; k < N; ++k) {
    array[k] = array[k] >> shift;

Thankfully, there is also a version of this instruction that takes a non-immediate value, but it must be passed as a vector register: you store the shift count in the first few bits of the vector register.

More recent processors can do variable shifts. Thus given the vectors [1,2,3,4] and [7,8,9,10], one could compute [1<<7, 2<<8, 3<<9, 4<<10] in one instruction. And the vector [7,8,9,10] does not need to be known at compile time.

For some reason, Intel has preserved the older form of the instructions. If you have an immediate integer, you can shift with it. This saves you from having to populate a shift vector and explicitly occupying a register.

But is using immediate integers when possible worth it from a performance point of view? Agner Fog’s instruction table shows little difference between the two forms of the instruction, but I wanted to check for myself. Thus I wrote two versions of the same code, shifting a whole array.

The version with immediate values looks like this…

void vectorshift(uint32_t *array, size_t N) {
  __m256i * a = (__m256i *) array;
  for (size_t k = 0; k  < N / 8 ; k ++, a++) {
    __m256i v = _mm256_loadu_si256(a);
    v = _mm256_srli_epi32(v,SHIFT);

It looks messy because I use intrinsic functions, but you should be able to figure out quickly what my code is doing.

The version with variable shifts is just a bit more complicated…

void vectorshift(uint32_t *array, size_t N, int shift) {
  __m256i * a = (__m256i *) array;
  __m256i s = _mm256_set1_epi32(shift);
  for (size_t k = 0; k  < N / 8 ; k ++, a++) {
    __m256i v = _mm256_loadu_si256(a);
    v = _mm256_srlv_epi32(v,s);

In practice, you want to generously unroll these loops for greater speed.

What is the verdict? On a skylake processor, both reach an identical speed, at 0.35 cycles per input integer. The instruction count is nearly identical.

Thus it is probably not worth bothering with shifts by immediate values for performance reasons.

My code is available.

Daniel Lemire, "Vectorized shifts: are immediates faster?," in Daniel Lemire's blog, February 28, 2018.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

One thought on “Vectorized shifts: are immediates faster?”

  1. For some reason, Intel has preserved the older form of the instructions.

    That’s not really weird at all. Of course once Intel introduces an instruction they are pretty much bound to support it forever, lest they break binary compatibility for any code using it.

    As far as I know, Intel has never removed published x86 or x86-64 instructions once introduced (the same isn’t the same for AMD which due to market dynamics did backtrack on things like 3DNow! and the XOP instruction sets to align with the Intel extensions instead).

    It should be worth noting there are actually three levels of “variability” in the shift instruction:

    1) Compiled-in immediate (i.e,. the amount to shift is fixed at compile time and applies to all elements).
    2) Runtime amount but the same for all elements.
    3) Runtime amount and may be different for all elements.

    Conceptually there is a fourth possibility “Compile-time immediate, may be different per element” but it has never been supported (indeed, the immediate would huge).

    You compared (1) and (3) and indeed on Skylake they are documented to have identical performance. Oddly, the (2) variant is slower than either. It seems that on Skylake, the variant (2) is implement by one uop to broadcast the shift amount to every element in some temporary register and then uses the same uop as the fully-variable shift (3), slowing it down by the extra uop.

    In earlier uarches like Haswell, the situation was different: (1) and (2) had the same performance as Skylake (i.e., 2 slower than 1), but the fully variable shifts (3) were considerably slower, taking 3 uops each. On that platform using the immediate-operand shifts can help a lot.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.