Speeding up a random-access function?

A common problem in software performance is that you are essentially limited by memory access. Let us consider such a function where you write at random locations in a big array.

 for ( i = 0; i < N; i++) {
    // hash is a randomized hash function
    bigarray[hash(i)] = i; 
 }

This is a good model for how one might construct a large hash table, for example.

It can be terribly slow if the big array is really large because each and every access is likely to be an expensive cache miss.

Can you speed up this function?

It is difficult, but you might be able to accelerate it a bit, or maybe more than a bit. However, it will involve doing extra work.

Here is a strategy which works, if you do it just right. Divide your big array into regions. For each region, create a stack. Instead of writing directly to the big array, when you are given a hash value, locate the corresponding stack, and append the hash value to it. Then, later, go through the stacks and apply them to the big array.

 
 for ( i = 0; i < N; i++) {
    loc = hash(i)
    add loc, i to buffer[loc / bucketsize]
 }
 for each buffer {
   for each loc,i in buffer
     bigarray[loc] = i
 }

It should be clear that this second strategy is likely to save some expensive cache misses. Indeed, during the first phase, we append to only a few stacks: the top of each stack is likely to be in cache because we have few stacks. Then when you unwind the stacks, you are writing in random order, but within a small region of the big array.

This is a standard “external memory” algorithm: people used to design a lot of these algorithms when everything was on disk and when disks were really slow.

So how well do I do? Here are my results on a Skylake processor using GNU GCC 8 (with -O3 -march=native, THP set to madvise).

cycles/access instructions/access
standard 57 13
buffered 45 36

So while the buffered version I coded uses three times as many instructions, and while it needs to allocate a large buffer, it still comes up on top.

My code is available.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

6 thoughts on “Speeding up a random-access function?”

  1. I designed something similar but with the temporary buffers containing larger batches. Only the tail of the buffer stays in cache for appending. Once a buffer is above a water line, it is emptied into the main array with good locality. The hardware prefetcher helps with the sequential reads from the buffers.

    This doesn’t scale to very large array: at one point, either the bucket size or the buffer tails won’t fit in the cache anymore. It make me wondering if this algorithm can be efficiently layered. Something like the funnelsort approach.

    I tried using non temporal writes to the buffers, but for variable or sub 64B item sizes, it still needs some place to do the write combining, like the actual WCB buffers or yet another temporary buffer in cache. This doesn’t work so well once the number of buckets gets very large.

    I will compare your code against my method once I get back to this project.

    I’m also curious to see how it compares with prefetches and a ring delay. This random accesses situation might be the only valid use case for manual prefetches.

  2. I am confused: if you know that THP setting matters here, why is rest of the text is focusing purely on cache misses? With such a huge array (4GB), address translation misses with 4K pages will likely be the dominating factor, and indeed a quick test with hugepages allowed shows that they bring a 2x-3x speedup (on Sandybridge).

      1. As far as I can tell, page faults are not relevant for performance here, because they mostly occur on initial accesses to the array (i.e. in memset). Once address mapping is populated, subsequent accesses cause TLB misses that are resolved via page-table walks, without causing faults.

        TLB misses do not necessarily imply cache misses, it’s possible to have data in cache that needs a page walk because the relevant TLB entry has been evicted.

        1. I have the same comeback. A TLB is a cache. A page walk occurs following a cache miss (at the TLB level). I’d grant you that it is not how most people would understand the term, but maybe you can believe me that this is what I had in mind. To be clear, as your first comment implies, I am aware that page size is an important factor in this benchmark.

          In case there is any doubt: I am not claiming (at all) that my solution (with or without playing with the page size) is best. It is definitively the case that you can improve it with large/huge pages (thanks for pointing it out).

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax