Implementing the missing sign instruction in AVX-512

Intel and AMD have expanded the x64 instruction sets over time. In particular, the SIMD (Single instruction, multiple data) instructions have become progressively wider and more general: from 64 bits to 128 bits (SSE2), to 256 bits (AVX/AVX2) to 512 bits (AVX-512). Interestingly, many instructions defined on 256 bits registers through AVX/AVX2 are not available on 512 bits registers.

With SSSE3, Intel introduced sign instructions, with the corresponding intrinsic functions (e.g., _mm_sign_epi8). There are 8-bit, 16-bit and 32-bit versions.  It was extended to 256-bit registers in AVX2.

What these instructions do is to apply the sign of one parameter to the other parameter. It is most easily explained as pseucode code:

function sign(a, b): # a and b are integers
   if b == 0 : return 0
   if b < 0 : return -a 
   if b > 0 : return a

The SIMD equivalent does the same operation but with many values at once. Thus, with SSSE3 and psignb, you can generate sixteen signed 8-bit integers at once.

You can view it as a generalization of the absolute function: abs(a) = sign(a,a). The sign instructions are very fast. They are used in numerical analysis and machine learning: e.g., it is used in llama.cpp, the open source LLM project.

When Intel designed AVX-512 they decided to omit the sign instructions. So while we have the intrinsic function  _mm256_sign_epi8, we don’t have _mm512_sign_epi8. The same instructions are missing for 16 bits and 32 bits integers (e.g., no _m512_sign_epi16 is found).

You may implement it for AVX-512 with a several instructions. I found this one approach:

#include <x86intrin.h>

__m512i _mm512_sign_epi8(__m512i a, __m512i b) {
  __m512i zero = _mm512_setzero_si512();
  __mmask64 blt0 = _mm512_movepi8_mask(b);
  __mmask64 ble0 = _mm512_cmple_epi8_mask(b, zero);
  __m512i a_blt0 = _mm512_mask_mov_epi8(zero, blt0, a);
  return _mm512_mask_sub_epi8(a, ble0, zero, a_blt0);;

It is disappointingly expensive. It might compile to four or five instructions:

vpmovb2m k2, zmm1
vpxor xmm2, xmm2, xmm2
vpcmpb k1, zmm1, zmm2, 2
vpblendmb zmm1{k2}, zmm2, zmm0
vpsubb zmm0{k1}, zmm2, zmm1

In practice, you may not need to pay such a high price. The reason the problem is difficult is that we have three cases to handle (three signs b=0, b>0, b<0).  If you do not care about the case ‘b = 0’, then you can do it in two instruction, not counting the zero (one xor):

#include <x86intrin.h>

__m512i _mm512_sign_epi8_cheated(__m512i a, __m512i b) {
   __m512i zero = _mm512_setzero_si512();
  __mmask64 blt0 = _mm512_movepi8_mask(b);
  return _mm512_mask_sub_epi8(a, blt0, zero, a);;

E.g., we implemented…

function sign_cheated(a, b): # a and b are integers
   if b < 0 : return -a 
   if b ≥ 0 : return a

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “Implementing the missing sign instruction in AVX-512”

  1. you can do it in two instruction

    Three if you include the xor (though I think it’s fair to ignore it).

    Alternative _mm512_sign_epi8 which is one byte shorter due to avoiding vpcmpb =P

    // zero elements
    __mmask64 bne0 = _mm512_test_epi8_mask(b, b);
    a = _mm512_maskz_mov_epi8(bne0, a);
    // negate elements
    __mmask64 blt0 = _mm512_movepi8_mask(b);
    return _mm512_mask_sub_epi8(a, blt0, _mm512_setzero_si512(), a);;

  2. A couple of typos here?

    You can view is as a generalization of the absolution function: abs(a)
    = sign(a,b).

    Should be “…view it as a…” and “…abs(a) = sign(a, a)…”, isn’t it?

    1. You could try but it definitively generates more instructions and several of these instructions have long latencies.

      __m512i _mm512_sign_epi8_alt(__m512i a, __m512i b) {
      __m256i a1 = _mm512_extracti64x4_epi64(a,0);
      __m256i a2 = _mm512_extracti64x4_epi64(a,1);
      __m256i b1 = _mm512_castsi512_si256 (b);
      __m256i b2 = _mm512_extracti64x4_epi64(b,1);
      a1 = _mm256_sign_epi8(a1,b1);
      a2 = _mm256_sign_epi8(a2,b2);
      __m512i r = _mm512_castsi256_si512(a1);
      r = _mm512_inserti64x4(r,a2, 1);
      return r;
      1. The following 2 extract intrinsics:

        __m256i a1 = _mm512_extracti64x4_epi64(a,0);
        __m256i b1 = _mm512_extracti64x4_epi64(b,0);

        can be replaced with cast intrinsics:

        __m256i a1 = _mm512_castsi512_si256 (a);
        __m256i b1 = _mm512_castsi512_si256 (b);

        which do not generate any instructions.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see

You may subscribe to this blog by email.