Suppose that you wish to access values in an array of size n, but instead of having indexes in [0,n), you have arbitrary non-negative integers. This sort of problems happen when you build a hash table or other array-backed data structure.
The naive approach to this problem is to use the remainder of the division by n (sometimes called a modulo reduction):
uint32_t access(uint32_t * array, size_t n, size_t index) { return array[index % n]; }
However, if the compiler cannot inline this call and determine the value of n, then this code is likely to compile to a division instruction. Division instructions are among the slowest instructions on modern-day processors.
To avoid division, many people assume that n is a power of two. Then they use the mask trick: i & (n-1) = i % n.
uint32_t access(uint32_t * array, size_t n, size_t index) { return array[index & ( n - 1 )]; }
That’s typically much faster. You do pay a price, however. Your array lengths must all be a power of two. It is a small price to pay, but a price nonetheless.
Another approach I like is the multiply-shift fast alternative to the modulo reduction (see the fastrange library). It involves a multiplication followed by a shift:
uint32_t access(uint32_t * array, uint64_t n, size_t index) { return array[(index * n)>>32]; }
Undeniably, the masked approach ought to be faster. You cannot get much faster than a bitwise AND. It is nearly a free instruction on modern processors. Multiplications and shifts are typically more expensive.
But let us measure the throughput of these operations. One thing to take into account is that if you have to do more than one such access, the processor can vectorize it. That is, it can use fast instructions that do several multiplications at once. On x64 processors, the vpmuludq instruction can do four full 32-bit by 32-bit multiplications at once.
Let us try it out with the GCC compiler (5.5):
no-AVX2 | AVX2 | |
---|---|---|
modulo | 8 cycles | 8 cycles |
multiply-shift | 2.2 cycles | 1.5 cycles |
mask | 1.7 cycles | 1.7 cycles |
Clearly, my particular compiler does a poor job at optimizing the masked approach. It should be able to beat the multiply-shift approach. Yet what I think should be evident is that the approach with a mask is not massively more efficient than the multiply-shift approach, and might even be slower (depending on your compiler).
In effect, it may not be warranted to force your array lengths to be a power of two. With careful engineering, you might get much of the same performance with any array length. It is especially likely to be true if you often access several values at once in your array, because you can rely on vectorization.
Relying on the compiler to do some vectorization magic is fine in most instances, but what if you want more control? My original code looks like this…
uint32_t fastsum(uint32_t * z, uint32_t N, uint32_t * accesses, uint32_t nmbr) { uint32_t sum = 0; uint64_t N64 = (uint64_t) N; for(uint32_t j = 0; j < nmbr ; ++j ) { sum += z[(accesses[j] * N64)>> 32] ; } return sum; }
Here is a version with Intel intrinsics:
uint32_t vectorsum(uint32_t * z, uint32_t N, uint32_t * accesses, uint32_t nmbr) { __m256i Nvec = _mm256_set1_epi32(N); __m128i sum = _mm_setzero_si128(); for(uint32_t j = 0; j < nmbr ; j+=4) { __m256i fourints = _mm256_loadu_si256(accesses + j; __m256i f4 = _mm256_mul_epu32(fourints, Nvec); __m256i ft = _mm256_srli_epi64(f4,32); __m128i fi = _mm256_i64gather_epi32 (z,ft , 4); sum = _mm_add_epi32(sum, fi); } uint32_t buffer[4]; _mm_storeu_si128((__m128i *)buffer,sum); return buffer[0] + buffer[1] + buffer[2] + buffer[3]; }
The catch is that you have to process values four at a time.
It seems like your modulo-approximation trick is very useful. I am certainly gonna switch to using it at some point. Should make all the hash-tricking much faster.
Unless my memory is flawed, I think gather instructions are not particularly microarchitecturally efficient on current Core CPUs. (Intel has hinted that this might improve in the future, but under current architectural assumptions it would be very hard to fetch data from more than two cache lines in a cycle nonetheless.) Anyway, this should be offset by vectorisation of other operations (and resulting reduced amount of instructions, allowing lots of iterations to be reordered), if there is a sufficient amount of them…
Gather is reasonably efficient on Skylake (and not terrible on Broadwell). It is definitely still limited by the 2 loads per cycle, but it is comparable to scalar loads now without much cost for getting all the elements into the vector register. So if you actually wanted the loaded values in a vector register, it is “ideal”.
There isn’t any optimziation of overlapping or duplicate values or anything though, so if many of your indices are the same, you don’t get any speedup.
Interestingly, looking at Agner Fog’s instruction tables, Skylake client would seem to have (at least latency/uop-wise) roughly as efficient gather instructions as one might expect from a good implementation on that microarchitecture. At the same time Skylake X doesn’t perform that well, and earlier architectures do also significantly worse than Skylake client. Anyway, on Skylake Intel has fulfilled their promise! (Then again… on AMD Ryzen, gather instructions are pretty awful.)
Reducing amount of cache accesses, especially for the hash table use case doesn’t seem like a particularly attractive proposition. Custom logic would be pretty heavy and benefit would materialise mostly on couple-kilobyte hash tables…
It got worse on Skylake-X? That’s weird…