Filtering numbers faster with SVE on Graviton 3 processors

Processors come, roughly, in two large families x64 processors from Intel and AMD, and ARM processors from Apple, Samsung, and many other vendors. For a long time, ARM processors occupied mostly the market of embedded processors (the computer running your fridge at home) with the ‘big processors’ being almost exclusively the domain of x64 processors.

Reportedly, the Apple CEO (Steve Jobs) went to see Intel back when Apple was designing the iPhone to ask for a processor deal. Intel turned Apple down. So Apple went with ARM.

Today, we use ARM processors for everything: game consoles (Nintendo Switch), powerful servers (Amazon and Google), mobile phones, embedded devices, and so forth.

Amazon makes available its new ARM-based processors (Graviton 3). These processors have sophisticated SIMD instructions (SIMD stands for Single Instruction Multiple Data) called SVE (Scalable Vector Extensions). With these instructions, we can greatly accelerate software. It is a form of single-core parallelism, as opposed to the parallelism that one gets by using multiple cores for one task. The SIMD parallelism, when it is applicable, is often far more efficient than multicore parallelism.

Amazon’s Graviton 3 appears to have 32-byte registers, since it is based on the ARM Neoverse V1 design. You can fit eight 32-bit integers in one register. Mainstream ARM processors (e.g., the ones that Intel uses) have SIMD instructions too (NEON), but with shorter registers (16 bytes). Having wider registers and instructions capable of operating over these wide registers allows you reduce the total number of instructions. Executing fewer instructions is a very good way to accelerate code.

To investigate SVE, I looked at a simple problem where you want to remove all negative integers from an array. That is, you read and array containing signed random integers and you want to write out to an output array only the positive integers. Normal C code might look as follows:

```void remove_negatives_scalar(const int32_t *input,
int64_t count, int32_t *output) {
int64_t i = 0;
int64_t j = 0;
for(; i < count; i++) {
if(input[i] >= 0) {
output[j++] = input[i];
}
}
}
```

Replacing this code with new code that relies on special SVE functions made it go much faster (2.5 times faster). At the time, I suggested that my code was probably not nearly optimal. It processed 32 bytes per loop iteration, using 9 instructions. A sizeable fraction of these 9 instructions have to do with managing the loop, and few do the actual number crunching.  A reader, Samuel Lee, proposed to effectively unroll my loop. He predicted much better performance (at least when the array is large enough) due to lower loop overhead. I include his proposed code below.

Using a graviton 3 processor and GCC 11 on my benchmark, I get the following results:

cycles/int instr./int instr./cycle
scalar 9.0 6.000 0.7
branchless scalar 1.8 8.000 4.4
SVE 0.7 1.125 ~1.6
unrolled SVE 0.4385 0.71962 ~1.6

The new unrolled SVE code uses about 23 instructions to process 128 bytes (or 32 32-bit integers), hence about 0.71875 instructions per integer. That’s about 10 times fewer instructions than scalar code and roughly 4 times faster than scalar code in terms of CPU cycles.

The number of instructions retired per cycle is about the same for the two SVE functions, and it is relatively low, somewhat higher than 1.5 instructions retired per cycle.

Often the argument in favour of SVE is that it does not require special code to finish the tail of the processing. That is, you can process an entire array with SVE instructions, even if its length is not divisible by the register size (here 8 integers). I find Lee’s code interesting because it illustrates that you might actually need to handle the end of long array differently, for efficiency reasons.

Overall, I think that we can see that SVE works well for the problem at hand (filtering out 32-bit integers).

Appendix: Samuel Lee’s code.

```void remove_negatives(const int32_t *input, int64_t count, int32_t *output) {
int64_t j = 0;
const int32_t* endPtr = input + count;
const uint64_t vl_u32 = svcntw();

svbool_t all_mask = svptrue_b32();
while(input <= endPtr - (4*vl_u32))
{
svint32_t in0 = svld1_s32(all_mask, input + 0*vl_u32);
svint32_t in1 = svld1_s32(all_mask, input + 1*vl_u32);
svint32_t in2 = svld1_s32(all_mask, input + 2*vl_u32);
svint32_t in3 = svld1_s32(all_mask, input + 3*vl_u32);

svbool_t pos0 = svcmpge_n_s32(all_mask, in0, 0);
svbool_t pos1 = svcmpge_n_s32(all_mask, in1, 0);
svbool_t pos2 = svcmpge_n_s32(all_mask, in2, 0);
svbool_t pos3 = svcmpge_n_s32(all_mask, in3, 0);

in0 = svcompact_s32(pos0, in0);
in1 = svcompact_s32(pos1, in1);
in2 = svcompact_s32(pos2, in2);
in3 = svcompact_s32(pos3, in3);

svst1_s32(all_mask, output + j, in0);
j += svcntp_b32(all_mask, pos0);
svst1_s32(all_mask, output + j, in1);
j += svcntp_b32(all_mask, pos1);
svst1_s32(all_mask, output + j, in2);
j += svcntp_b32(all_mask, pos2);
svst1_s32(all_mask, output + j, in3);
j += svcntp_b32(all_mask, pos3);

input += 4*vl_u32;
}

int64_t i = 0;
count = endPtr - input;

svbool_t while_mask = svwhilelt_b32(i, count);
do {
svint32_t in = svld1_s32(while_mask, input + i);
svbool_t positive = svcmpge_n_s32(while_mask, in, 0);
svint32_t in_positive = svcompact_s32(positive, in);
svst1_s32(while_mask, output + j, in_positive);
i += svcntw();
j += svcntp_b32(while_mask, positive);
while_mask = svwhilelt_b32(i, count);
} while (svptest_any(svptrue_b32(), while_mask));
}
```

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

6 thoughts on “Filtering numbers faster with SVE on Graviton 3 processors”

1. Joern Engel says:

> since it is best on

I suspect you meant “based on”.

2. Joern Engel says:

Thank you both for doing this work!

I was rather impressed with SVE and thought of it as a clearly superior vector instruction set. Having to do manual loop unrolling for performance would negate much of what makes SVE such a nice instruction set.

It is notable, however, that the problem you benchmarked is basically two instructions, plus a load/store pair, plus loop overhead. Most actual code I deal with is much larger, reducing the relative cost of the loop overhead. That also reduces the benefit of loop unrolling and I would typically not bother.

Even if some loops are as simple as your example, they typically don’t dominate runtime and one should concentrate more on other code. So after my initial shock, I don’t think this is such a big problem for SVE.

Still, good to know how about such problems. Thank you for highlighting them!

1. My post is not meant to imply that there is a problem with SVE.

2. Travis Downs says:

I would disagree that small loops like this are uncommon.

Small loops like this are common, and form the basis for optimized versions of common library routines like memcpy, strlen, memchr, and routines in higher level languages, etc. They also form important primitives in applications like databases where you might wish to take the bitwise AND or OR of two bitmaps, etc.

Furthermore, small loops are the ones where you stand the best chance of getting a good auto-vectorization out of the compiler, further increasing their importance under vectorization.

In my experience there is a *huge* amount to gain from modest unrolls of 2-8 iterations of many real-world vectorized (and not vectorized) loops, even without SVE.

1. Samuel Lee says:

Indeed! Being able to handle small buffer tail of an otherwise unrolled loop with a few predicated SVE instructions is much cleaner than having to fall back to scalar code.

For large inputs, unrolling can definitely be beneficial; not only do you reduce the proportion of instructions that are doing the loop handling, in many cases you can reduce the dependencies between instructions (e.g. if instructions in the body of the loop depends on the loop counter, you can end up serializing on updates to the loop counter, reducing the ability to take advantage of instruction-level-parallelism).
These benefits are independent of the instruction set, provided you have enough registers to play with.

In this case it also a benefit because the throughput of predicate handling instructions appears to be limited for the V1, and in the unrolled loop we can make assumptions to reduce proportion of instructions that use this critical resource.

I think ideally compilers would be able to automatically do this sort of unrolling of SVE code in the future (whether autovectorized or intrinsics).

You may subscribe to this blog by email.