Pruning spaces faster on ARM processors with Vector Table Lookups

Last week, I asked how fast one could remove spaces from a string using ARM processors. On a particular benchmark, I got 2.4 cycles per byte using regular (scalar) code and as little as 1.8 cycles per byte using ARM NEON instructions. These are “vectorized instructions” that you find in virtually all ARM processors. Vectorized instructions operate over wide registers (spanning at least 16 bytes), often executing the same operation (such as addition or multiplication) over several values at once. However, my trick using ARM NEON instructions relied on the fact that my input stream would contain few spaces. So it was not a very positive blog post for ARM processors.

But then I got feedback from several experts such as Martins Mozeiko, Cyril Lashkevich and Derek Ledbetter. This feedback made me realize that I had grossly underestimated the power of ARM NEON instructions. One reason for my mistake is that I had been looking at older ARM NEON instructions instead of the current AArch64 instructions, which are much more powerful.

To recap, on an x64 processor, you can remove spaces from strings very quickly using vectorized instructions in the following manner:

  • Compare 16 bytes of input characters with white space characters to determine where (if anywhere) there are white space characters.
  • The result of the comparison is itself a 16-byte register, where matching characters have the byte value 255 whereas non-matching characters have the byte value 0. Turn this vector register to a 16-bit integer value by “downsampling” the bits. This can be achieved by a “movemask” instruction present in all x64 processors since the introduction of the Pentium 4 a long time ago.
  • From this mask, compute the number of white space characters by counting the 1s. This can be done with the popcnt instruction.
  • From this mask also, load up a “shuffling register” that tells you how to reorder the bytes so that white space characters are omitted. Then use what Intel and AMD call a “shuffling instruction” (pshufb introduced with the SSSE3 instruction set many years ago) to quickly reorder the bytes.

I thought that the same could not be done with ARM NEON, but I was wrong. If you have access to recent AMD processors (supporting AArch64), then you can closely mimic the x64 processors and get good performance.

Let us review the various components.

To start, we can quickly compare 16 byte values with the byte value 33 to quickly identify common white space characters such as the space, the line ending, the carriage return and so forth.

uint8x16_t is_nonwhite(uint8x16_t data) {
  return vcgeq_u8(data, vdupq_n_u8(' '+1));

ARM NEON has convenient “reduce” instructions, so I can sum up the values of a vector. I can put this to go use to quickly compute how many matching characters I have:

uint8_t bytepopcount(uint8x16_t v) {
  return vaddvq_u8(vshrq_n_u8(v,7));

To compute a 16-bit mask, I also use such a reduce function after computing the bitwise AND of my comparison with some convenient vector (which allows me to distinguish which characters match)…

uint16_t neonmovemask_addv(uint8x16_t input8) {
  uint16x8_t input = vreinterpretq_u16_u8(input8);
  const uint16x8_t bitmask = { 0x0101 , 0x0202, 0x0404, 0x0808, 0x1010, 0x2020, 0x4040, 0x8080 };
  uint16x8_t minput = vandq_u16(input, bitmask);
  return vaddvq_u16(minput);

Finally, I call a Vector Table Lookup instruction which is pretty much equivalent to Intel/AMD’s shuffle instruction:

int mask16bits =  neonmovemask_addv(data);
uint8x16_t shuf = vld1q_u8(shufmask + 16 * mask16bits);
uint8x16_t reshuf = vqtbl1q_u8(data,shuf);

Of course, I am not explaining everything in detail. My full source code is available. All you need is access to a recent ARM processor with Linux running on it, and you are all set to run it.

It turns out that we can double my previous best score:

scalar 1.40 ns
NEON (old code) 0.92 ns
NEON (Vector Table Lookup) 0.52 ns

What is better is that my new code is effectively branchless: its performance is not very sensitive to the input data.

Using the fact that I know the clock speed of my processor, I can make a quick comparison in terms of CPU cycles per input byte…

scalar ARM recent x64
scalar 2.4 cycles 1.2 cycles
vectorized (NEON AArch64 and SSSE3) 0.88 cycles 0.25 cycles

(The source code for x64 processors is available on GitHub.)

What is interesting is that we are getting under one cycle per input byte which is a kind of performance that is difficult to achieve with scalar code that writes byte values one by one. It is still the case that the ARM NEON code is over three times slower than the equivalent on x64 processors, but I am using a relatively weak core (A57 on a Softiron Overdrive 1000) and my code might be subject to further optimization.

16 thoughts on “Pruning spaces faster on ARM processors with Vector Table Lookups”

  1. Great work! In the future ARM Scalable Vector Extension there is prefect instruction ‘COMPACT’ which “Read the active elements from the source vector and pack them into the lowest-numbered elements of the destination vector. Then set any remaining elements of the destination vector to zero.” This instruction will make shufmask unneeded.

    1. Great work!

      Thanks. I have not made any attempt to optimize the code, beyond writing something that I can understand and that is likely to be correct. So it seems likely we can do even better.

      This instruction will make shufmask unneeded.

      Are you sure? Some AVX-512 instruction sets have compress instructions that do something similar, but they compress 32-bit words, not bytes. So I’d be interested in verifying that the documentation refers to the application of COMPACT to bytes.

      1. You are right, COMPACT works with words and doublewords only. But it still can be used, expand 2 times, compact, than narrow 2 times.

        1. But it still can be used (…)

          Of course, the only way to know if it is practical is to write the code and test it out on actual hardware, but I don’t think I have any hardware for it… Do we know when that will be available?

          1. Yes, it would be interesting to experiment with such HW. I hope annual update of iPhones brings us it.

        2. Full disclosure, I’m a graduate at ARM (and I’m not commenting on behalf of ARM in any way)

          In SVE, the new SPLICE instruction will be able to act on bytes and should cover this benchmark nicely (again performance will be implementation dependent, so we shall see how that goes):
          “Splice two vectors under predicate control. Copy the first active to last active elements (inclusive) from the first source vector to the lowest-numbered elements of the result. Then set any remaining elements of the result to a copy of the lowest-numbered elements from the second source vector. The result is placed destructively in the first source vector.”

          So in SVE this should boil down to 5 instructions per vector (interleaved as appropriate to hide latencies):
          LD1B //load contiguous vector
          CMPGT //set a predicate to 1 where non-white and 0 where whitespace
          SPLICE //group non-white characters in bottom of vector (we don’t care what happens at the top)
          ST1B //store contiguous vector
          INCP //increment pointer by number of non-white characters (using predicate)

          (You can have a look at what’s coming in more detail if you check the XML files from the zip in the link Cyril pointed to)

          1. Sam, is there any ARM emulator that works like Intel Software Emulator? I mean one can run their compiled program using selected instruction set, and thanks to that would be able to test at least correctness of implementation for upcoming architectures.

  2. Btw size of table can be reduced 2 times, because row_n+1 == row_n <> 1));
    if (index & 1) {
    shuf0 = vextq_u8(vdupq_n_u8(0), shuf, 1);

    2. remove all even lines form shufmask, replace last unused values by zero and load shuf like this:
    uint16_t index = neonmovemask_addv(w0);
    uint8x16_t shuf0 = vld1q_u8(shufmask + 16 * (index >> 1) – index &1);

    In first case there is additional instruction and branch, in second access to unaligned memory. In fact indexes in shufmask are 4-bit, and the table can be compressed 2 times more, but unpacking will require 1 vector multiplication and 1 vector and.

    1. Seems parser ate part of my comment šŸ™ I have to use LSL for logical shift left
      row_n+1 == row_n LSL 8
      1. remove all even lines form shufmask, and calculate shuf like this.
      uint16_t index = neonmovemask_addv(w0);
      uint8x16_t shuf0 = vld1q_u8(shufmask + 16 * (index >> 1));
      if (index & 1) {
      shuf0 = vextq_u8(vdupq_n_u8(0), shuf, 1);

      1. I think you were clear enough.

        My guess is that adding a branch to save memory might often be a negative. My current benchmark leaves us with an “easy to predict” branch, so my guess is that if were to implement it, we would not see a performance difference… however, this could degenerate in other, harder benchmarks.

        Your other change is more likely to be beneficial generally speaking. Not that it will be faster, but it will cut in the size of the binary.

        We could do a lot better by replacing the 16-bit lookup with two 8-bit lookups, but it might double the number of instructions…

  3. Here’s my attempt. Like your newest method, I construct a bit mask recording whether each of the 8 characters in a block passed or failed the test, and then using vtbl to extract the correct characters and write them with a single instruction. But I didn’t want to use a lookup table.

    I couldn’t find a simple way to construct the vtbl indices all at once, so I decided to flip the problem around. I do 16 8-character blocks at a time, and I construct the vtbl indices by doing the same operation 8 times, and then I do three rounds of zipping to put them in the correct order.

    In each of the 8 steps, I find the location of the rightmost set bit by computing popcount((b – 1) & ~b), and then I clear that bit by doing b &= b – 1.

    But it turns out to be more than twice as slow as your giant look-up table. On an iPhone 5s, in ns per operation:
    despace: 1.28
    neon_despace: 1.04
    neon_despace_branchless: 0.64
    neontbl_despace: 0.24
    neon_interleaved_despace (my function): 0.58

    I also wrote a simple test app for iOS. I posted all of this at GitHub.

    I have a new idea for computing the vtbl indices, but it probably won’t beat the look-up table.

    1. I found a method for taking an integer and separating alternate set bits. I use NEON’s polynomial multiplication feature and multiply the 8-bit integer by 0xFF, then AND the original with the product to get the 1st, 3rd, 5th, ā€¦ set bits, and AND with the original with the complement of the product to get the 2nd, 4th, 6th, ā€¦ set bits. Then I do this once more, so now I have four bytes, each with at most two set bits. Then I count the leading and trailing zeroes to get the indices of the bits.

      Doing this cut the time from 0.58 to 0.49. Unrolling the loop, doing 256 bytes at once, reduces the time to 0.37, compared with 0.24 using the look-up table.

Leave a Reply

Your email address will not be published. If you leave an email, you will be notified when there are replies. The comment form expects plain text. If you need to format your text, you can use HTML tags such <strong> and <em>. For formatting code as HTML automatically, I recommend