Suppose that you have a long sequence of bits 10101011100000… you want to visit all the bits set to 1. That is, given 10101011100000, you would like to get the indexes of all the bits set to one: 0,2,4,6,7,8,9.
In a recent blog post, I reviewed fast techniques to iterate over the position of the bits set to 1 in such a bit stream. A fast function in C to solve the problem makes use of the trailing-zero instruction found in recent x64 processors and generated by the __builtin_ctzll intrinsic in several compilers such as LLVM’s clang and GNU gcc. The trailing-zero instruction gives us the number of consecutive zeroes present in a word starting from the least significant bit (so that odd integers have a number of trailing zeros equal to 0). The code is relatively elegant:
size_t bitmap_decode_ctz(uint64_t *bitmap, size_t bitmapsize, uint32_t *out) { size_t pos = 0; uint64_t bitset; for (size_t k = 0; k < bitmapsize; ++k) { bitset = bitmap[k]; while (bitset != 0) { uint64_t t = bitset & -bitset; int r = __builtin_ctzll(bitset); out[pos++] = k * 64 + r; bitset ^= t; } } return pos; }
In the comments, powturbo (it is a pseudonym) remarked that you could solve the same problem using SIMD instructions. SIMD instructions operate on vectors of words and are often faster. I have not looked at powturbo’s code, but I happen to have an implementation of my own, inherited from my friend Nathan Kurz. It basically processes bytes one by one, looking up the indexes in a table (8kB). Omitting the table, the code is just a bit longer than the trailing-zero version:
int bitmap_decode_avx2(uint64_t * array, size_t sizeinwords, uint32_t *out) { uint32_t *initout = out; __m256i baseVec = _mm256_set1_epi32(-1); __m256i incVec = _mm256_set1_epi32(64); __m256i add8 = _mm256_set1_epi32(8); for (int i = 0; i < sizeinwords; ++i) { uint64_t w = array[i]; if (w == 0) { baseVec = _mm256_add_epi32(baseVec, incVec); } else { for (int k = 0; k < 4; ++k) { uint8_t byteA = (uint8_t) w; uint8_t byteB = (uint8_t)(w >> 8); w >>= 16; __m256i vecA = _mm256_load_si256(vt[byteA]); __m256i vecB = _mm256_load_si256(vt[byteB]); uint8_t advanceA = lengthTable[byteA]; uint8_t advanceB = lengthTable[byteB]; vecA = _mm256_add_epi32(baseVec, vecA); baseVec = _mm256_add_epi32(baseVec, add8); vecB = _mm256_add_epi32(baseVec, vecB); baseVec = _mm256_add_epi32(baseVec, add8); _mm256_storeu_si256((__m256i *) out, vecA); out += advanceA; _mm256_storeu_si256((__m256i *) out, vecB); out += advanceB; } } } return out - initout; }
(At the expense of some performance, you can use a smaller 2kB table.)
As usual, it looks like gibberish if you do not know about Intel intrinsics, but trust me: this code does nothing complicated. It is basically just memoization.
So how does the vectorized version (using SIMD instructions) does against the trailing-zero version? Here are approximate timings on a Skylake processor:
density | trailing-zero | SIMD |
---|---|---|
0.0625 | ~6.5 cycles per int | ~6 cycles per int |
0.125 | ~5 cycles per int | ~3 cycles per int |
0.25 | ~4 cycles per int | ~2 cycles per int |
0.5 | ~3 cycles per int | ~1.25 cycles per int |
0.9 | ~3 cycles per int | ~0.4 cycles per int |
For dense bitstreams, my timings are accurate, but they grow more and more inaccurate for sparser bitstreams. Nevertheless, the pattern is clear: if you have dense bitstreams, the SIMD code is about twice as fast. The gains are higher when the bitmaps are very dense. However, the gains disappear when the bit stream is sparse…
I should point out that these timings are not directly comparable with those from my earlier blog post because I write the indexes to an array instead of calling a function for each index. Thus I am currently solving a slightly easier problem.
Note: The “compress” instructions in AVX-512 (e.g., vpcompressd) could make this problem easier. Sadly, AVX-512 instructions are not yet widely available on commodity processors.
congratulation! your lookup tables are quite large for AVX2.
In TurboPFor:Integer Compression I’m using a using a 2k table instead of a 8k table and converting the mask with: _mm256_cvtepu8_epi32(_mm_cvtsi64_si128(*(uint64_t *)(vecDecodeTable[byteA]))).
I’m also using popcount instead of a length table for advanceA/B.
For comparison purposes, I have added “turbo” versions of the functions to my repository which follow your description.
It’d be great if you checked performance for higher densities, like 0.75, 0.80, 0.99.
I have added density 0.9 to the blog post. The results are very positive.
Thanks, that’s really great.
Thanks for sharing I love these types of articles.
I’m assuming the number of bitmap passes is large enough justfy to pulling the 8kB lookup table in cache for the use case?
What would this be used for?
What would this be used for?
This comes up often enough. For example, I expect we could speed up Roaring bitmaps.
An application.
Assume you want to find the union of some index sets. Also assume the union should be sorted. This relevant when doing a socalled symbolic Cholesky factorization.
Most likely you will have a bit vector that marks whether an element is in the union. If you have the bit vector it might be more efficient to generated the sorted union from that than doing a quicksort.
You are correct.
I know this is an older post and I hope no-one minds the late comment. First of all: thank you (and powturbo) so much for posting this! I was in the process of optimising a genetic algorithm and identified a part of it as the bottleneck, where I was iterating over the set bits of a lot of 64-bit integers. My original approach was pretty much what you described in your previous post (taking advantage of tzcnt). Replacing that with a version similar to what you outlined here single-handedly brought down the runtime for evaluating a single generation from 3.5 s to 0.6 s. I had to run multiple tests, comparing the results against my previous implementation, because I couldn’t believe how fast it was all of a sudden. So again: thank you very much!
Having looked at the code you posted on GitHub, I have one (probably naive) question: It seems like there, you’re explicitly aligning the lookup tables to 16-bytes and then use aligned loads to read 256-bit vectors from the table. I was under the impression that unaligned loads required 32-byte alignments but I’m really not that experienced in that area and was curious where my misunderstanding lies.
You are correct. The alignment should match the vector width.