Intel and AMD have expanded the x64 instruction sets over time. In particular, the SIMD (Single instruction, multiple data) instructions have become progressively wider and more general: from 64 bits to 128 bits (SSE2), to 256 bits (AVX/AVX2) to 512 bits (AVX-512). Interestingly, many instructions defined on 256 bits registers through AVX/AVX2 are not available on 512 bits registers.

With SSSE3, Intel introduced sign instructions, with the corresponding intrinsic functions (e.g., `_mm_sign_epi8`). There are 8-bit, 16-bit and 32-bit versions. It was extended to 256-bit registers in AVX2.

What these instructions do is to apply the sign of one parameter to the other parameter. It is most easily explained as pseucode code:

function sign(a, b): # a and b are integers if b == 0 : return 0 if b < 0 : return -a if b > 0 : return a

The SIMD equivalent does the same operation but with many values at once. Thus, with SSSE3 and psignb, you can generate sixteen signed 8-bit integers at once.

You can view it as a generalization of the absolute function: abs(a) = sign(a,a). The sign instructions are very fast. They are used in numerical analysis and machine learning: e.g., it is used in llama.cpp, the open source LLM project.

When Intel designed AVX-512 they decided to omit the sign instructions. So while we have the intrinsic function `_mm256_sign_epi8`, we don’t have `_mm512_sign_epi8`. The same instructions are missing for 16 bits and 32 bits integers (e.g., no `_m512_sign_epi16` is found).

You may implement it for AVX-512 with a several instructions. I found this one approach:

#include <x86intrin.h> __m512i _mm512_sign_epi8(__m512i a, __m512i b) { __m512i zero = _mm512_setzero_si512(); __mmask64 blt0 = _mm512_movepi8_mask(b); __mmask64 ble0 = _mm512_cmple_epi8_mask(b, zero); __m512i a_blt0 = _mm512_mask_mov_epi8(zero, blt0, a); return _mm512_mask_sub_epi8(a, ble0, zero, a_blt0);; }

It is disappointingly expensive. It might compile to four or five instructions:

vpmovb2m k2, zmm1 vpxor xmm2, xmm2, xmm2 vpcmpb k1, zmm1, zmm2, 2 vpblendmb zmm1{k2}, zmm2, zmm0 vpsubb zmm0{k1}, zmm2, zmm1

In practice, you may not need to pay such a high price. The reason the problem is difficult is that we have three cases to handle (three signs b=0, b>0, b<0). If you do not care about the case ‘b = 0’, then you can do it in two instruction, not counting the zero (one xor):

#include <x86intrin.h> __m512i _mm512_sign_epi8_cheated(__m512i a, __m512i b) { __m512i zero = _mm512_setzero_si512(); __mmask64 blt0 = _mm512_movepi8_mask(b); return _mm512_mask_sub_epi8(a, blt0, zero, a);; }

E.g., we implemented…

function sign_cheated(a, b): # a and b are integers if b < 0 : return -a if b ≥ 0 : return a

Daniel Lemire, "Implementing the missing sign instruction in AVX-512," in *Daniel Lemire's blog*, January 11, 2024.

Three if you include the xor (though I think it’s fair to ignore it).

Alternative

`_mm512_sign_epi8`

which is one byte shorter due to avoiding`vpcmpb`

=P`// zero elements`

__mmask64 bne0 = _mm512_test_epi8_mask(b, b);

a = _mm512_maskz_mov_epi8(bne0, a);

// negate elements

__mmask64 blt0 = _mm512_movepi8_mask(b);

return _mm512_mask_sub_epi8(a, blt0, _mm512_setzero_si512(), a);;

Nit: it’s actually

`b<0`

(you’ll need to fix the other condition too).What about the case when either value is NaN? Got to return

something…Integers cannot be NaN.

A couple of typos here?

Should be “…view

itas a…” and “…abs(a) = sign(a,a)…”, isn’t it?Correct, thanks.

Not so “cheated”, reminds me of forward/backward

dnnl_eltwise_abs from https://oneapi-src.github.io/oneDNN/dev_guide_eltwise.html

Is there a reason not to use _mm256_sign_epi8 twice?

You could try but it definitively generates more instructions and several of these instructions have long latencies.

The following 2 extract intrinsics:

__m256i a1 = _mm512_extracti64x4_epi64(a,0);

__m256i b1 = _mm512_extracti64x4_epi64(b,0);

can be replaced with cast intrinsics:

__m256i a1 = _mm512_castsi512_si256 (a);

__m256i b1 = _mm512_castsi512_si256 (b);

which do not generate any instructions.

There’s some room for improvement:

use vptestmb to avoid the immediate for vpcmpb (already pointed out by an earlier commenter)

don’t reference the original a again, so if it’s a load result, the compiler can just use a zero-masking load

zero-mask instead of merge-mask into a zeroed vector

don’t define function names in the reserved namespace _… (this can lead to silent bad effects if an intrinsic with that name exists in a later header version)

`__m512i m512_sign_epi8(__m512i a, __m512i b)`

{

__mmask64 b_nonzero = _mm512_test_epi8_mask(b, b);

__mmask64 b_neg = _mm512_movepi8_mask(b); // extract sign bits: b < 0

`__m512i a_zeroed = _mm512_maskz_mov_epi8(b_nonzero, a); // (b!=0) ? a : 0`

return _mm512_mask_sub_epi8(a_zeroed, b_neg, _mm512_setzero_si512(), a_zeroed); // b_neg ? 0-a_zeroed : a_zeroed

}

See my Stack Overflow answer for more detail on these. Clang does a surprisingly bad job with your version, using 64 bytes of zeros from static storage for the compare, even though it has to xor-zero a register for other stuff.

If the result is used for something like _mm512_add_epi8, ending with zero-masking can let a clever-enough compiler (clang 17 or later) fold that into merge-masking.

_mm512_add_epi8(x, m512_sign_epi8_foldable(y,z)); can become

`vpmovb2m k1, zmm2`

vpxor xmm3, xmm3, xmm3

vptestmb k2, zmm2, zmm2

vpsubb zmm1 {k1}, zmm3, zmm1

# zero-masking of the vpsubb result optimized away,

# folded into merge-masking for the add

vpaddb zmm0 {k2}, zmm0, zmm1

ret

Again, see my SO answer for source for that version. It’s worse if the source is memory and the zero-masking can’t fold into the next use of the result, but it’s at least break-even with a memory source if the zero masking can be folded into merge-masking for a later op. And it’s a win in that case for a register source.