Arbitrary byte-to-byte maps using ARM NEON?

Modern processors have fast instructions that can operate on wide registers (e.g., 128-bit). ARM processors, the kind of processors found in your phone, have such instructions called “NEON”. Sebastian Pop pointed me to some of his work doing fast string transformations using NEON instructions. Sebastian has done some great work to accelerate the PHP interpreter on ARM processors. One of his recent optimization is a way to transform the case of strings quickly.

It suggested the following problem to me. Suppose that you have a stream of bytes. You want to transform byte values in an arbitrary manner. Maybe you want to map the byte value 1 to the byte value 12, the byte value 2 to the byte value 53… and so forth.

Here is how you might implement such a function in plain C:

 for(size_t i = 0; i < volume; i++) {
    values[i] = map[values[i]];
 }

For each byte, you need two loads (to get to map[values[i]]) and one store, assuming that the compiler does not do any magic.

To implement such a function on block of 16 bytes with NEON, we use the vqtbl4q_u8 function which is essentially a way to do 16 independent look-up in a 64-byte table. It uses the least significant 5 bits as a look-up index. If any of the other bits are non-zero, it outputs zero. Because there are 256 different values, we need four distinct calls to the vqtbl4q_u8 function. One of them will give non-zero results for byte values in [0,64), another one for bytes values in [64,128), another one for byte values in [128,192), and a final one for byte values in [192,256). We select the right values with a bitwise XOR (and the veorq_u8 function). Finally, we just need to apply bitwise ORs to glue the results back together (via the vorrq_u8 function).

uint8x16_t simd_transform16(uint8x16x4_t * table, uint8x16_t input) {
  uint8x16_t  t1 = vqtbl4q_u8(table[0],  input);
  uint8x16_t  t2 = vqtbl4q_u8(table[1],  
       veorq_u8(input, vdupq_n_u8(0x40)));
  uint8x16_t  t3 = vqtbl4q_u8(table[2],  
       veorq_u8(input, vdupq_n_u8(0x80)));
  uint8x16_t  t4 = vqtbl4q_u8(table[3],  
       veorq_u8(input, vdupq_n_u8(0xc0)));
  return vorrq_u8(vorrq_u8(t1,t2), vorrq_u8(t3,t4));
}

In terms of loads and stores, assuming that you enough registers, you only have one load and one store per block of 16 bytes.

A more practical scenario might be to assume that all my byte values fit in [0,128), as is the case with a stream of ASCII characters…

uint8x16_t simd_transform16_ascii(uint8x16x4_t * table, 
               uint8x16_t input) {
  uint8x16_t  t1 = vqtbl4q_u8(table[0],  input);
  uint8x16_t  t2 = vqtbl4q_u8(table[1],  
     veorq_u8(input, vdupq_n_u8(0x40)));
  return vorrq_u8(t1,t2);
}

To test it out, I wrote a benchmark which I ran on a Cortex A72 processor. My source code is available. I get a sizeable speed bump when I use NEON with an ASCII input, but the general NEON scenario is slower than a plain C version.

plain C 1.15 ns/byte
neon 1.35 ns/byte
neon (ascii) 0.71 ns/byte

What about Intel and AMD processors? Most of them do not have 64-byte lookup tables. They are limited to 16-byte tables. We need to wait for AVX-512 instructions for wider vectorized lookup tables. Unfortunately, AVX-512 is only available on some Intel processors and it is unclear when it will appear on AMD processors.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

16 thoughts on “Arbitrary byte-to-byte maps using ARM NEON?”

  1. The TBX instruction (as opposed to TBL) is designed for this purpose and saves you doing the merge operations.

    I suppose you could try a bunch of VPSHUFB’s on x86, but it’s not quite as efficient; might be enough to beat scalar code perhaps? TBL4/TBX4 isn’t exactly fast on ARM, so the shuffles on x86 may have a chance…

    Fast arbitrary 8-bit->8-bit mapping is nice, but I think only AVX512-VBMI can make it efficient.

    1. The TBX instruction (as opposed to TBL) is designed for this purpose
      and saves you doing the merge operations.

      In my tests, it is slower. Here is what I tried…

      uint8x16_t simd_transform16x(uint8x16x4_t * table, uint8x16_t input) {
        uint8x16_t  t1 = vqtbx4q_u8(input, table[0],  input);
        t1 = vqtbx4q_u8(t1, table[1],  veorq_u8(input, vdupq_n_u8(0x40)));
        t1 = vqtbx4q_u8(t1, table[2],  veorq_u8(input, vdupq_n_u8(0x80)));
        t1 = vqtbx4q_u8(t1, table[3],  veorq_u8(input, vdupq_n_u8(0xc0)));
        return t1;
      }
      

      Am I misusing TBX…?

      1. The first call should be TBL, not TBX. TBX’s destination register is read+modify, so there’d be a forced move since you still refer to ‘input’ later on.
        Other than that, it looks fine, and I don’t know why it’d be slower.

        I’m not familiar with the OoO behaviour on the A72. TBL and TBX are the same speed, but TBX does force a dependency chain. Usually doesn’t matter on OoO processors, because they can schedule it in parallel with the next loop iteration.
        According to the ARM optimization manual, TBL4/TBX4 have a latency of 15 cycles, but no throughput info is given, so you really do want to try to get the instructions to run in parallel.

        Maybe you could test with a combination of the two, e.g.:

        uint8x16_t simd_transform16x2(uint8x16x4_t * table, uint8x16_t input) {
        uint8x16_t t1 = vqtbl4q_u8(table[0], input);
        t1 = vqtbx4q_u8(t1, table[1], veorq_u8(input, vdupq_n_u8(0x40)));
        uint8x16_t t2 = vqtbl4q_u8(table[2], veorq_u8(input, vdupq_n_u8(0x80)));
        t2 = vqtbx4q_u8(t2, table[3], veorq_u8(input, vdupq_n_u8(0xc0)));
        return vorrq_u8(t1, t2);
        }

        1. Usually doesn’t matter on OoO processors, because they can schedule it
          in parallel with the next loop iteration.

          At 15 cycles of latency per TBX4, that’s a full 60 cycles of latency for the whole chain. I don’t know what the OoO window is on the A72, but it is hard to imagine that you can totally hide such latency.

          Even if you break it into two distinct part… you are still left with 30 + 3… 33 cycles of latency. It makes really hard to go really fast unless you have a crazily long instruction window…

          (I’ll run more tests.)

        2. Your new approach is better but still apparently slower than what I describe in my blog post…

          transform(map, values,volume)                               :  4959 ns total,  1.21 ns per input key
          neon_transform(map, values,volume)                          :  5542 ns total,  1.35 ns per input key
          neon_transformx(map, values,volume)                         :  7583 ns total,  1.85 ns per input key
          neon_transformx2(map, values,volume)                        :  5833 ns total,  1.42 ns per input key
          
          1. On an Amazon instance, I get slightly better results (different compiler, however)… with a 2% gain for your approach… (it appears to be a genuine 2% gain).

            transform(map, values,volume)                               :  3612 ns total,  0.88 ns per input key
            neon_transform(map, values,volume)                          :  4344 ns total,  1.06 ns per input key
            neon_transformx(map, values,volume)                         :  6036 ns total,  1.47 ns per input key
            neon_transformx2(map, values,volume)                        :  4248 ns total,  1.04 ns per input key
            
            1. This page states that the A72 has a 128 entry ROB, so a 60 latency chain might be problematic? This page suggests that the throughput for TBL4 is 2 per clock (no clue on accuracy), which seems to suggest that you’d need 30(!) lookups in parallel to maximise throughput.

              Your test results (thanks for posting them) seem to suggest that the latency is a problem. I suppose, since the compiler probably isn’t doing it, you could manually unroll the loop and interleave the TBL instructions to help the processor run stuff in parallel. Unrolling 4 times may be enough – it should achieve the same level of concurrency, but reduce the need for merging instructions.

              1. The version I present in my blog post with TBL4 does not have a 30 cycle latency. It can do four TBL4… somewhat in parallel… and then it needs to “OR” the results together (since there are four of them, this requires 3 ORs, but two of them can be done in parallel).

                However, TBX4 can save us one bitwise OR so it should be faster, at least in theory, because it reduces the instruction count. However, it comes at a cost: longer dependency chains.

                Doing more lookups in parallel should help, but as you observe, a 60-cycle dependency chain is really hard to hide. So the question as to how useful TBX is compared to TBL remains open. The evidence so far suggests that TBX is only moderately useful.

              2. I find a bit hard to believe that the TBL4 implementations all have 2 per clock throughput. A TLB4 would need to read from 5 source registers (the 4 table registers and the control), so 2 per cycle means 10 vector reads per cycle which is a huge amount – plus whatever other reads you do on other vector units at the same time. That’s more reasons that much bigger contemporary chips.

                Someone should test it…

                1. The throughput definitely does seem weird. The official A72 optimization guide explicitly leaves them out under AArch64, though does specify 2 per clock for AArch32 operation (4x 64-bit registers) *. My guess is that the post just went with the 2 per clock for all TBL/TBX instructions.

                  * Even if we interpret 4x 64-bit registers as 2x 128-bit, a VTBX4 in AArch32 would require 4 vector reads (2x source table + source indicies + destination (I assume it needs to read the destination to blend in the bytes?)) per instruction, so doing 2x VTBX4 per clock would mean 8 reads/clock. Entirely possible that the guide is wrong though.

                  Considering that ORR is a very fast operation, and the cost of TBL4/TBX4 so large, I don’t expect too much of a gain from TBX, but I imagine that there should be one if you can get it to parallelize well.

  2. I think you have enough registers in NEON (64-bit) for the full 256->256 lookup: 32 128-bit registers. So your 16 table registers fit easily, and only a handful of extras are needed for temporaries, etc.

    It’s slow just because it needs 2x as many tbl instructions, and those instructions dominate runtime. Indeed, it is approximately 2x as slow, as you’d expect.

        1. Yeah both 32-bit and 64-bit ARM have 32 NEON registers, but in the 32-bit case they are only 64-bit wide.

          1. So NEON is somewhat close to AVX in terms of total register space, at least if you just look at the ISA. Right?

            That is, I can glue together virtually pairs of NEON 128-bit registers and make myself sixteen 256-bit registers “à la AMD”.

            1. Yes, it is identical in the sense that they both have 512 bytes of register space, either 32 x 16b or 16 x 32b.

              AVX-512 quadruples that (!!) to 2048 bytes: 32 x 64b.

              That is, I can glue together virtually pairs of NEON 128-bit registers
              and make myself sixteen 256-bit registers “à la AMD”.

              Well logically, yes. You can glue together any number of registers “in software” to create a longer “register”. It’s quite a bit different than what AMD did though, in the sense that they did it in hardware. In software you need N instructions to execute on wider “meta instruction” when you glue together N registers. In hardware, you only need 1: the expansion to N operations happens internally.

              In many cases this is much more efficient, since you can run into front-end limitations with so many instructions. This is a primary reason why CPU SIMD gets wider, rather than simply adding more EUs at the same width. That is, we have AVX-512 rather than simply 2x as many 256-bit units, even though 2x as many units is basically strictly more flexible – it is too hard to keep all those units fed at the front end: the CPU needs to be very wide.

              GPUs have taken the opposite approach, which ever-increasing numbers of smallish-width EUs which now number in the 1000s on the fastest chips. They can do that because the whole execution model is quite different.

              It’s worth noting that I although we are calling it “a la AMD”, Intel used the same strategy for SSE and AVX: AVX-512 was the first time since MMX they didn’t release the initial chips after a width expansion with half-width EUs.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax