Modern processors have powerful vector instructions. However, some algorithms are tricky to implement using vector instructions.
I often need to prune selected values from a vector. On x64 processors, we can achieve this result using table lookups and an efficient shuffle instruction. Building up the table each time gets tiring, however.
Let us consider two of my recent blog posts Removing duplicates from lists quickly and How quickly can you remove spaces from a string? They follow the same pattern. We take a vector, identify the values that we want to remove, build a corresponding bit mask and then remove them. In one case, we want to remove repeated values, in another, we want to remove spaces.
Building the bit mask is efficient and takes only a few instructions:
- We can identify the values to remove using vectorized comparisons (e.g., using the intrinsics _mm_cmpeq_epi8 or _mm256_cmpeq_epi32). This is typically very inexpensive.
- We then build the bit mask from the comparison vector using what Intel calls a movemask (e.g., using the intrinsics _mm_movemask_epi8 or _mm256_movemask_ps). The movemask is relatively cheap, though it can have a high latency.
The pruning itself is ugly, and it requires a table lookup. I decided to publish a software library called simdprune to make it easier.
The library is quite simple. If you need to remove every other value in a 256-bit vector, you can get this result with the function call prune256_epi32(x,0b10101010).
In the example of the README of your project, shouldn’t the zero vector 0,0,0,0,0,0,0,0 be a one vector 1,…,1?
Yes. Fixed. Thank you.
There was a thread on stack exchange about packing left from a mask, and they recommended using PEXT to pull the bits. Would that work here?
Yes. It can be made to work. It might be very useful for pruning bytes because the current solution, with a large table, is not ideal.
How to make it all come together for high efficiency is the tricky part.
Here’s the thread: http://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask
They also mention VCOMPRESSPS for 32-bit values under AVX512.
I am aware of vcompress and it is mentioned in the README of the library. It is not super useful because none of us has access to it.
The BMI code is cool.
I have added, for benchmarking purpose, the BMI approach and, in my tests, it is slower. The BMI instructions can be nice, but they often have high latency so if you string them with data dependencies, it is not always super efficient.
Ryzen instructions just came out on Agner’s site, and PEXT/PDEP have reciprocal latency of 18 cycles. 🙁
@KWillets
That sounds bad. On the other hand, I am not sure that PEXT/PDEP is common in software, or even that it will become common.