For greater speed, try batching your out-of-cache data accesses

In software, we use hash tables to implement sets and maps. A hash table works by first mapping a key to a random-looking address in an array.

In a recent series of blog posts (1, 2, 3), I have documented the fact that precomputing the hash values often accelerates hash tables.

Some people thought that I was merely making the trivial point that precomputing the hash values saved you the time to compute the hash values. That is true, but there is more to it.

On recent Intel processors, batching your load requests can be very helpful. Let me illustrate with some code.

I am going to use a simple hash function that takes integer values and return “mixed up” values:

uint32_t murmur32(uint32_t h) {
  h ^= h >> 16;
  h *= UINT32_C(0x85ebca6b);
  h ^= h >> 13;
  h *= UINT32_C(0xc2b2ae35);
  h ^= h >> 16;
  return h;
}

This function is not very expensive, but it is efficient at generating random-looking outputs.

In a complete loop, it takes between 7 and 8 cycles to compute and store a bunch of these hash values (modulo the array size) using a recent Intel processor.

Let us put this function to good use to randomly go pick up values in a large array and sum them up:

uint64_t sumrandom(uint64_t *values, size_t size) {
  uint64_t sum = 0;
  for (size_t k = 0; k < size; ++k) {
    sum += values[murmur32(k) % size ];
  }
  return sum;
}

You expect that the bulk of the time needed to execute this code would have to do with data accesses (given a large array). And indeed, for arrays exceeding my cache, it takes about 46 cycles per value to compute the sum.

So it means that about 40 cycles are due to the random look-up of the values.

Is that right?

Let us do something more complicated, where we first compute the hash values and then we do the look-ups…

uint64_t sumrandomandindexes(uint64_t *values, uint32_t *indexes, size_t size) {
  uint64_t sum = 0;
  for (size_t k = 0; k < size; ++k) {
    indexes[k] = murmur32(k) % size ;
  }
  for (size_t k = 0; k < size; ++k) {
    sum += values[indexes[k]];
  }
  return sum;
}

This looks more expensive, but it is not. It runs in about 32 cycles per operation. That is, separating the task into two separate tasks, and doing overall more stores and loads, is significantly cheaper. It should not make sense, but the result is robust. (Note that simply unrolling the loop a few times might serve us well, I used an extreme example to make my point.)

operationcost
hash function~8 cycles per value
sum over random values~46 cycles
hash function followed by sum~32 cycles

My code is available. I used GNU GCC compiler 6.0 as it gives the best results on this test.

So what is happening? When I use the Linux perf command to count the number of cache misses (perf stat -B -e cache-misses), I find that the approach the computes the hash values separately from the data loads has about 50% fewer cache misses.

Thus Intel processors have an easier time avoiding cache misses when the data loads are batched.

I have not verified how other processors fare. Do you know?

14 thoughts on “For greater speed, try batching your out-of-cache data accesses”

  1. There must be something off here. First I thought it could be a difference in Hardware Prefetcher or Loop Optimizer. But that didn’t make sense when debugging. So I dived into the generated code for the sumrandom function first:

    uint64_t sumrandom(uint64_t *values, size_t size) {
    uint64_t sum = 0;
    for (size_t k = 0; k < size; ++k) {
    sum += values[murmur32(k) % size ]; // dependency!
    }
    return sum;
    }

    Checking the memory access dependency some nasty instruction pops up:

    .L25:
    movl %ecx, %eax
    shrl $16, %eax
    xorl %ecx, %eax
    addq $1, %rcx
    imull $-2048144789, %eax, %eax
    movl %eax, %edx
    shrl $13, %edx
    xorl %edx, %eax
    imull $-1028477387, %eax, %eax
    movl %eax, %edx
    shrl $16, %edx
    xorl %edx, %eax
    xorl %edx, %edx
    divq %rsi # rdx:rax /= rsi EXPENSIVE
    addq (%rdi,%rdx,8), %r8 # r8 += rdi[ rdx * 8 ]
    cmpq %rcx, %rsi
    jne .L25

    The latency cost of the hashing will be in the modulo for murmur32(k) % size as it requires a 64 bit division. Agner reports 21-83 cycles. If we make the size of the hash table a power of 2 the compiler can get the hash bits with a mask using andl instead of the expensive divq.

    If we change

    void demo() {
    size_t N = 1024 * 1024 * 16; // 16M instead of 13M

    the divq goes away:

    .L25:
    movl %ecx, %edx
    shrl $16, %edx
    xorl %ecx, %edx
    addq $1, %rcx
    imull $-2048144789, %edx, %edx
    movl %edx, %esi
    shrl $13, %esi
    xorl %esi, %edx
    imull $-1028477387, %edx, %edx
    movl %edx, %esi
    shrl $16, %esi
    xorl %esi, %edx
    andl $16777215, %edx # and to 0xFFFFFF, no divq
    addq (%rdi,%rdx,8), %rax
    cmpq $16777216, %rcx
    jne .L25

    Comparing 13M table size bench:

    cc -march=native -O3 -o batchload batchload.c
    [demo] N= 13631488
    populaterandom(indexes, N) : 6.064 cycles per operation (best)
    sumrandom(values, N) : 52.113 cycles per operation (best) 52.556 cycles per operation (avg)
    sumrandomandindexes(values, indexes, N) : 37.835 cycles per operation (best) 38.245 cycles per operation (avg)
    sumrandomfromindexes(values, indexes, N) : 30.716 cycles per operation (best) 31.593 cycles per operation (avg)

    with 16M table size bench:

    cc -march=native -O3 -o batchload batchload.c
    [demo] N= 16777216
    populaterandom(indexes, N) : 1.733 cycles per operation (best)
    sumrandom(values, N) : 37.639 cycles per operation (best) 38.143 cycles per operation (avg)
    sumrandomandindexes(values, indexes, N) : 34.819 cycles per operation (best) 35.196 cycles per operation (avg)
    sumrandomfromindexes(values, indexes, N) : 32.992 cycles per operation (best) 33.326 cycles per operation (avg)

    performance now is practically the same. So the latency for the murmur hash and the modulo operation must be 15+ cycles. It’s quite good for having a division involved. So the savings come from removing a dependency on the hash table address calculation.

    Mystery solved πŸ™‚

  2. Oh, sorry, forgot to make it clear. From the post you mention the cost of hashing is 7 cycles yet removing the dependency showed savings of 15+ cycles. That didn’t add up.

    Another interesting thing is why is it only 15 cycles given the divq taking in theory 21+ cycles. Well, the compiler generated another version of the function with constant propagation that uses a multiplicative inverse:

    sumrandom.constprop.2:
    .LFB50:
    .cfi_startproc
    xorl %r8d, %r8d
    xorl %esi, %esi
    movabsq $5675921253449092805, %r9 # multiplicative inverse of 13
    .p2align 4,,10
    .p2align 3
    .L11:
    movl %esi, %ecx
    shrl $16, %ecx
    xorl %esi, %ecx
    addq $1, %rsi
    imull $-2048144789, %ecx, %ecx
    movl %ecx, %eax
    shrl $13, %eax
    xorl %eax, %ecx
    imull $-1028477387, %ecx, %ecx
    movl %ecx, %eax
    shrl $16, %eax
    xorl %eax, %ecx
    movq %rcx, %rax
    mulq %r9 # rdx:rax *= 13's mul inverse
    shrq $22, %rdx # rdx >>= 22 ...
    leaq (%rdx,%rdx,2), %rax
    leaq (%rdx,%rax,4), %rax
    salq $20, %rax
    subq %rax, %rcx
    addq (%rdi,%rcx,8), %r8
    cmpq $13631488, %rsi
    jne .L11
    movq %r8, %rax
    ret

    For people missing this trick, here is an article on multiplicative inverse trick for modulus. Compilers take advantage of address calculation with leaq for speed. Still this is much more expensive than a simple andl mask. Probably why it’s 15 cycles instead of 21+.

    1. Still this is much more expensive than a simple andl mask. Probably why it’s 15 cycles instead of 21+.

      I meant to say

      Still this calculation is much more expensive than a simple andl mask. It probably raises the total cost of hashing to 15 cycles* instead of 21+. And it shows when there’s a direct dependency for the memory access.

  3. Avoid the division (in modulo).

    It kills pipelining: the addition depends on the division, so it has to wait for it.

    In the array-cached variant, you allow pipelining the divisions by removing the dependency; the subsequent lookups are probably much faster than the divisions. I can only speculate that the cache misses go down because of additional pipelining in the second loop.

  4. There’s a comment from Saturday awaiting moderation due to it having external links. In that comment I show the version of the function benchmarked is another one with constant propagation. So it doesn’t do divq but it does the division via 13’s multiplicative inverse, which is somewhat faster.

    So, to complete the analysis, here is again the original version:

    cc -march=native -O3 -o batchload batchload.c
    [demo] N= 13631488
    populaterandom(indexes, N) : 6.075 cycles per operation (best)
    sumrandom(values, N) : 51.834 cycles per operation (best) 51.904 cycles per operation (avg)
    sumrandomandindexes(values, indexes, N) : 36.707 cycles per operation (best) 36.962 cycles per operation (avg)
    sumrandomfromindexes(values, indexes, N) : 31.528 cycles per operation (best) 31.754 cycles per operation (avg)

    And now let’s compare with bringing N from outside so it can’t use a constant:

    void demo( const int m ) {
    size_t N = 1024 * 1024 * m;

    printf("[demo] N= %zu \n", N);

    uint64_t *values = malloc(N * sizeof(uint64_t));
    uint32_t *indexes = malloc(N * sizeof(uint32_t));
    for (size_t i = 0; i < N; i++) {
    values[i] = i * 3 - 2;
    }
    uint64_t answer = sumrandom(values, N);

    #define repeat 5
    BEST_TIME_NOCHECK(populaterandom(indexes, N), , repeat, N, true);
    BEST_TIME(sumrandom(values, N), answer, , repeat, N, true);
    BEST_TIME(sumrandomandindexes(values, indexes, N), answer, , repeat, N, true);
    BEST_TIME(sumrandomfromindexes(values, indexes, N), answer, , repeat, N,
    true);

    free(values);

    free(indexes);
    }

    int main( int argc, char *argv[] ) {
    if ( argc < 2 )
    return -1;
    int m = atoi( argv[ 1 ] );
    demo( m );
    return EXIT_SUCCESS;
    }

    and let’s benchmark this version:

    $ ./batchload 13
    [demo] N= 13631488
    populaterandom(indexes, N) : 26.553 cycles per operation (best)
    sumrandom(values, N) : 80.907 cycles per operation (best) 81.980 cycles per operation (avg)
    sumrandomandindexes(values, indexes, N) : 58.007 cycles per operation (best) 58.351 cycles per operation (avg)
    sumrandomfromindexes(values, indexes, N) : 31.712 cycles per operation (best) 32.118 cycles per operation (avg)

    The implementation sumrandom.constprop.2 goes away and we see a much slower hashing for both versions and the difference when removing the hash dependency is 23 cycles.

    This is an important point often neglected in papers and books. Hash Tables have a cost of converting the table key to the range. This cost is significant and much bigger than the hashing function itself. In many papers what they do is use a table size 2^x to remove this cost. This might not be the case in a production-level implementation as we can’t keep growing tables by 2x forever.

    Now I think we are done. πŸ™‚

  5. Last link is wrong.

    For 16M entries: GCC 7.3 vs CLANG 6.0 show they are practically the same using andl mask and the original difference on this post doesn’t happen at all. (Like in my first comment)

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax