Suppose that I give you a relatively long string and you want to remove all spaces from it. In ASCII, we can define spaces as the space character (‘ ‘), and the line ending characters (‘\r’ and ‘\n’). I am mostly interested in algorithmic and performance issues, so we can simplify the problem by removing all byte values less or equal to 32.
In a previous post where I asked how quickly we could prune spaces, the best answer involved vectorization using 128-bit registers (SSSE3). It ends up being between 5 and 10 times faster than the naive approach.
Conveniently enough, ARM processors all have 128-bit vector registers, just like x64 processors. So can we make ARM processors go as fast as x64 processors?
Let us first consider a fast scalar implementation:
size_t i = 0, pos = 0; while (i < howmany) { char c = bytes[i++]; bytes[pos] = c; pos += (c > 32 ? 1 : 0); }
This prunes all character values less or equal to 32, writing back the data in-place. It is very fast.
Can we do better with vector instructions? Vector instructions are instructions supported by virtually all modern processors that operate over wide registers (16 bytes or more).
On x64 processors, the winning strategy is to grab 16 bytes of data, quickly compare against white space characters, then extract a mask (or bitset) value made of 16 bits, one bit per character, where each bit indicates whether the value found is a white space. The construction of such a bitset is cheap on an x64 processor, as there is a dedicated instruction (movemask). There is no such instruction on ARM processors. You can emulate movemask using several instructions.
So we cannot proceed as we did on x64 processors. What can we do?
Just like with SSSE3, we can quickly check whether byte values are less or equal to 32, thus identifying white space characters:
static inline uint8x16_t is_white(uint8x16_t data) { const uint8x16_t wchar = vdupq_n_u8(' '); uint8x16_t isw = vcleq_u8(data, wchar); return isw; }
Next we can quickly check whether any of the 16 characters is a white space, by using about two instructions:
static inline uint64_t is_not_zero(uint8x16_t v) { uint64x2_t v64 = vreinterpretq_u64_u8(v); uint32x2_t v32 = vqmovn_u64(v64); uint64x1_t result = vreinterpret_u64_u32(v32); return result[0]; }
This suggests a useful strategy. Instead of comparing characters one by one, compare 16 characters at once. If none of them is a white space character, just copy the 16 characters back to the input and move on. Otherwise, we fall back on the slow scalar approach, with the added benefit that we do not need to repeat the comparison:
uint8x16_t vecbytes = vld1q_u8((uint8_t *)bytes + i); uint8x16_t w = is_white(vecbytes); uint64_t haswhite = is_not_zero(w); w0 = vaddq_u8(justone, w); if(!haswhite) { vst1q_u8((uint8_t *)bytes + pos,vecbytes); pos += 16; i += 16; } else { for (int k = 0; k < 16; k++) { bytes[pos] = bytes[i++]; pos += w[k]; } }
Most of the benefit from this approach would come if you can often expect streams of 16 bytes to contain no white space character. This seems like a good guess in many applications.
I wrote a benchmark where I try to estimate how long it takes to prune spaces, on a per character basis, using input data where there are few white space characters, placed at random. My source code is available, but you need an ARM processor to run it. I run the benchmark on a 64-bit ARM processor (made of A57 cores). John Regher has a few more benchmarks on this same machine. I think it is the same cores that you find in the Nintendo Switch.
scalar | 1.40 ns |
NEON | 0.92 ns |
The technical specification is sparse. However, the processor runs at 1.7 GHz as one can verify by using perf stat. Here is the number of cycles per character we need…
scalar | ARM | recent x64 |
---|---|---|
scalar | 2.4 cycles | 1.2 cycles |
vectorized (NEON and SSSE3) | 1.6 cycles | 0.25 cycles |
(The source code for x64 is available on GitHub.)
In comparison, on an x64 processor, the scalar version uses something like 1.2 cycles per character, which would put the ARM machine at half the performance of a recent x64 processor on a per cycle basis. That is to be expected as the A57 cores are hardly meant to compete with recent x64 processors on a cycle per cycle basis. However, with SSSE3 on an x64 machine, I manage to use a little as 0.25 cycles per character, which is more than 5 times better than what I can do with ARM NEON.
This large difference comes from an algorithmic difference. On x64 processors, we are relying on the movemask/pshufb combo and we end up with a branchless algorithm involving very few instructions. Our ARM NEON version is much less powerful.
There is a lot to like about ARM processors. The assembly code is much more elegant than the equivalent with x86/x64 processors. Even the ARM NEON instructions feel cleaner than the SSE/AVX instructions. However, for many problems, the total lack of a movemask instruction might limit the scope of what is possible with ARM NEON.
But maybe I underestimate ARM NEON… can you do better than I did?
Note: The post has been edited: it is possible on 64-bit ARM processors to reshuffle 16 bits in one instruction as one of the commenters observed.
Here’s branchless NEON version using VTBL1 instriction: https://gist.github.com/mmozeiko/be2e8afdf5a0a82b7dbdc8f013abfb5f
On my Raspberry Pi 3 (AArch64, Cortex-A53) benchmark compiled with clang 4.0.1 shows that this variant is a tiny bit faster than neon_despace:
despace(buffer, N) : 6.81 ns per operation
neon_despace(buffer, N) : 4.15 ns per operation
neon_despace_branchless(buffer, N) : 4.09 ns per operation
Thanks. Currently, I get the following:
So it is not great.
Update: This was gcc. If I compile with clang, I get the following:
I don’t think that the difference between 1.04 and 1.01 is actually significant, but it still a nice result.
I updated code to do only one VTBL1 instruction on AArch64 (based on information from Sam’s comment).
Now the code is reasonably faster than “neon_despace” variant:
despace(buffer, N) : 6.65 ns per operation
neon_despace(buffer, N) : 4.00 ns per operation
neon_despace_branchless(buffer, N) : 3.54 ns per operation
But result is incorrect. The table for vqtblq1 should be calculated in different way.
Wow, impressive work, that’s blazing fast even if it’s an ARM. How did you figure this out?
Interesting post – and good analysis on the # of cycles.
It would be very interesting to see a power comparison – after all if a 95w x86 takes 1.2 cycles to process something but a 20w ARM takes 2.4 cycles – the x86 is eating huge amounts of power to achieve its result.
What would the table look like if we broke it down using the # of cycles and the performance per watt ?
My guess is you’d find the ARM beating the x86 significantly.
There are 4.5W Kaby Lake CPUs out there which have SSE4, so I don’t think you’re going to get much joy by normalising for power.
I think this is just an instance where x86 has an instruction that ARM is missing. It’s not really fundamental to the architecture of either CPU.
is_not_zero can be implemented in one instruction on arm64.
static inline uint16_t is_not_zero(uint8x16_t v) {
return vaddlvq_u8(v);
}
Thanks. It seems that vaddlvq_u8 might be slower.
It’s useful when you need count of 0xff in vector. It’s (vaddlvq_u8(v) >> 8) + 1.
Yes. Important observation.
Wouldn’t you prefer vaddlvq_u8(vorrq_u8(v,vdupq_n_u8(1)))? It is the same number of instructions and the movi call in a tight loop could get optimized away.
Depends on context. For example here using vaddlvq_u8 improves performance a bit. https://gist.github.com/notorca/c731fc6a916849c3be4f4a8f55b3c583
Cortex A53 (pine64):
despace(buffer, N) : 3.48 ns per operation
neon_despace(buffer, N) : 2.11 ns per operation
neon_despace_branchless(buffer, N) : 2.04 ns per operation
neon_despace_branchless_my(buffer, N) : 1.83 ns per operation
iPhone SE:
pointer alignment = 32768 bytes
memcpy(tmpbuffer,buffer,N) : 0.093 ns per operation
despace(buffer, N) : 0.79 ns per operation
neon_despace(buffer, N) : 0.64 ns per operation
neon_despace_branchless(buffer, N) : 0.43 ns per operation
neon_despace_branchless_my(buffer, N) : 0.40 ns per operation
On the processor I am using, vqmovn_u64 (uqxtn) has a latency of 4 and a throughput of 1; meanwhile uaddlv has a latency of 8 (over 8 bit values) and the same throughput. So vqmovn_u64 (uqxtn) is preferable I think.
Most other operations, such as a shift by an immediate (ushr) or a bitwise OR (vorr) has a latency of three. So something like vaddlvq_u8(vshrq_n_u8(v,7)) ends up having a latency of 8+3= 11 cycles.
Or vaddlvq_u8(vshrq_n_u8(v,7))?
Slight correction: In AArch64 the TBL instruction can shuffle 16B at once, this is exposed in the intrinsics with vqtblq.
Yes it does, e.g., vqtbl1q_u8. My mistake.
Your GitHub code has a mistake. The `vld4q_u8` intrinsic will interleave the words. For instance, the first vector will have characters { 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51 }. The correct method is to use `vld1q_u8` four times, incrementing the address by 16 each time.
Good catch!