Quickly pruning elements in SIMD vectors using the simdprune library

Modern processors have powerful vector instructions. However, some algorithms are tricky to implement using vector instructions.

I often need to prune selected values from a vector. On x64 processors, we can achieve this result using table lookups and an efficient shuffle instruction. Building up the table each time gets tiring, however.

Let us consider two of my recent blog posts Removing duplicates from lists quickly and How quickly can you remove spaces from a string? They follow the same pattern. We take a vector, identify the values that we want to remove, build a corresponding bit mask and then remove them. In one case, we want to remove repeated values, in another, we want to remove spaces.

Building the bit mask is efficient and takes only a few instructions:

  • We can identify the values to remove using vectorized comparisons (e.g., using the intrinsics _mm_cmpeq_epi8 or _mm256_cmpeq_epi32). This is typically very inexpensive.
  • We then build the bit mask from the comparison vector using what Intel calls a movemask (e.g., using the intrinsics _mm_movemask_epi8 or _mm256_movemask_ps). The movemask is relatively cheap, though it can have a high latency.

The pruning itself is ugly, and it requires a table lookup. I decided to publish a software library called simdprune to make it easier.

The library is quite simple. If you need to remove every other value in a 256-bit vector, you can get this result with the function call prune256_epi32(x,0b10101010).

Link: https://github.com/lemire/simdprune

9 thoughts on “Quickly pruning elements in SIMD vectors using the simdprune library”

  1. In the example of the README of your project, shouldn’t the zero vector 0,0,0,0,0,0,0,0 be a one vector 1,…,1?

  2. There was a thread on stack exchange about packing left from a mask, and they recommended using PEXT to pull the bits. Would that work here?

    1. Yes. It can be made to work. It might be very useful for pruning bytes because the current solution, with a large table, is not ideal.

      How to make it all come together for high efficiency is the tricky part.

          1. I have added, for benchmarking purpose, the BMI approach and, in my tests, it is slower. The BMI instructions can be nice, but they often have high latency so if you string them with data dependencies, it is not always super efficient.

            1. Ryzen instructions just came out on Agner’s site, and PEXT/PDEP have reciprocal latency of 18 cycles. šŸ™

Leave a Reply

Your email address will not be published. If you leave an email, you will be notified when there are replies. The comment form expects plain text. If you need to format your text, you can use HTML tags such <strong> and <em>. For formatting code as HTML automatically, I recommend tohtml.com.