Iterating in batches over data structures can be much faster…

We often need to iterate over the content of data structures. It is surprisingly often a performance bottleneck in big-data applications. Most iteration code works one value at a time…

for value in datastructure {
  do something with value
}

There is a request to the data structure for a new value at each iteration. Alternatively, we can query the data structure far less often by asking the data structure to fill a buffer…

for blockofvalues datastructure {
  for value in blockofvalues {
      do something with value
  }
}

It is not automatically faster: you have to store values to a buffer and then read them again. It involves copying data from registers to memory and back. There is some inherent latency and it is an extra step.

However, if you make your buffer large enough but not too large (e.g., 1kB), the latency will not matter much and you will remain in CPU cache (fast memory). Thus you should, in the worst case, be only slightly slower. What do I mean by “slightly”? Basically, you are adding the equivalent of a memory copy over a small buffer.

When accessing data over a network, or even across processes on the same machine, it worth it to process the data in batches because the cost of the transaction is high. When working in data structures that are in your own process, the transaction cost might be low. Repeated function calls in a loop are cheap, and they can become free after inlining. To my knowledge, batched iterations is not typically available in standard libraries.

Thus, until recently, I did not pay much attention to the idea of iterating in batches over data structures. I could imagine some gains, but I expected them to be small.

In the Go implementation of Roaring bitmaps, Ben Shaw contributed a way to iterate over values in batches, recovering many values in a buffer with each function call. It helped the performance considerably (almost doubling the speed on some tests). Richard Startin then did the same in the Java implementation. It also helped a lot:

The batch iterator is always at least twice as fast as the standard iterator (…) Depending on the contents and size of the bitmap, the batch iterator can be 10x faster.

So I started to wonder… is this an underrated strategy?

I modified the popular Go bitset library and on some iteration test, the batched iteration was nearly twice as fast!

The batched code is more complex, but not so terrible:

buffer := make([]uint, 256)
j := uint(0)
j, buffer = bitmap.NextSetMany(j, buffer)
for ; len(buffer) > 0; j, buffer = bitmap.NextSetMany(j, buffer) {
     for k := range buffer {
        // do something with buffer[k]
     }
     j += 1
}

Then I modified the cbitset library. I saw, again, almost a doubling of the speed. The code is once more a bit more complicated:

size_t buffer[256];
size_t howmany = 0;
for(size_t startfrom = 0; 
         (howmany = nextSetBits(b1,buffer,256, &startfrom)) > 0 ;
          startfrom++) {
       for(size_t i = 0; i < howmany ; i++) {
         // do something with  buffer[i];
       }
}

These good results depend on what kind of data you iterate over, how you use the data, and what kind of data structure you have. Obviously, it is useless to batch iterations over an array of values. Yet my few tests provide enough evidence to conclude that batch iteration is worth investigating when speed is a limitation.

On Twitter, Milosz Tanski explained the result as follows:

One thing to remember about CPU and optimization in general is that almost hardware is designed to operate at maximum speed when it’s doing similar work on similar data. Branch prediction, prefetch, caches, op code level parallelization all make this assumption.

13 thoughts on “Iterating in batches over data structures can be much faster…”

  1. I cannot agree with this more strongly. I’ve made improvements to InfluxDB cursors to move batches of time series data between stages that with measurable performance improvements 👍🏻

  2. Try doing batch operations with gcc and its __builtin_prefetch() command. Try doing a batch of n (e.g. 8 or so; experiment with different batch sizes for your particular CPU) prefetches for n different memory addresses, immediately (or sometimes you can do other work in between because prefetching memory can be SLOW…) followed by a batch of n actual operations on the memory just fetched. I’ve seen massive latency savings in the past doing this. Good luck!

    [1] https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

    1. My experience with __builtin_prefetch() for performance has not been fruitful. I use it as part of benchmarks to help ensure that the cache is properly populated, but otherwise I rely on the processor to prefetch.

      1. It is difficult to get results using __builtin_prefetch() but don’t give up! I’ve found it easiest to get results using it in a batch. There are a special circumstances where it’s going to help: (a) Don’t make your batch size too big or too small. The CPU can only prefetch so many cache lines at a time. (b) Only use __builtin_prefetch() if you’re expecting two or more cache lines fetches not to be in the cache, otherwise it’ll just get in the way. Generally speaking the data structures you’re working on need to be much bigger than the CPU cache memory and not cached. I wonder if you have not had success because you tend to experiment with smaller ‘demo sized’ data structures rather than ‘production sized’ data structures? An example, of a situation where __builtin_prefetch() would be useful is looking up a batch of keys in a larger hash table. HTH and don’t give up on __builtin_prefetch() ! 🙂

  3. Can you give me a short C program (say 30 lines of code) where __builtin_prefetch() helps tremendously, and such that the code without __builtin_prefetch() is not simply poorly designed so as to make __builtin_prefetch() look good.

    That is, give me an example where __builtin_prefetch() helps a lot and where I cannot, using straight portable C, get the same kind of performance.

    Feel free to use lots of RAM.

  4. Have a look at [1] which uses a 1 GB block of memory and then runs a series of experiments on it by looping over it in differing batch sizes (1, 2, 4, 8, 16, 32, 64, 128, and 256), differing increments between reads (1, 65, 513, and 525,225 bytes), and with and without __builtin_prefetch(). Each experiment reads from the block of memory the same number of times; 500 million times.

    It should be possible for you to easily run this on your Linux box with little or no changes. The results will likely be sensitive to the exact CPU type, and the speed of the RAM. Please let me know how you get on. Also, I created this example especially for you, so it’s possible there are flaws / bugs. If you find any then please let me know! 🙂

    The results show that the biggest gains are using __builtin_prefetch() where auto prefetching has little effect and where the cache also has little effect. Let’s look at the inc 524,225 results which represent the slowest, non-cached memory reads. The fastest batch size on my CPU without prefetch is batch size 2 with 84.5 million reads per second. And with prefetch this jumps up to 94.3 million reads per second, or 11.5% faster. However, with a batch size of 64 then using prefetch is 40% faster.

    I think the experiment is also interesting because it shows how effective the CPU cache is. For example, the best result without prefetching is 970 million per second with batch size 16 with inc 1. And the worst is 70.4 million per second with batch size 16 and inc 524,225. So the fastest with CPU cache acceleration (without prefetching) is 13.8 times faster… over an order of magnitude faster. This also shows that between the prefetch loop and the batch loop, it would be possible to do quite a lot of CPU work on cached memory while waiting for the prefetched RAM to be cached. Perhaps a further experiment could prove this?

    Also, interesting is that without prefetch reads per second generally get consistently slower as the batch size gets bigger, except for batch size 1 or inc 1. However, that’s not the case with prefetch because as the batch size gets bigger the reads per second do not necessarily get consistently slower.

    Looking forwards to any comments.

    [1] https://gist.github.com/simonhf/caaa33ccb87c0bf0775a863c0d6843c2

  5. It’s interesting when you try this code on an Amazon box… which I did (but not for the results above). I found an unused Amazon c3.8xl box. Of course, it’s unused by me… but there maybe other (‘noisy’?) neighbours on the box doing stuff. Although the virtualization layer gives each OS running a fair slice of CPU time… AFAIK there is no way to virtualize the CPU cache… which means that the fair slice of CPU time really depends upon how much memory is cached. We saw above that having memory cached and make an order of magnitude difference.

    So I the tests seven times on the Amazon box and here is one set of results which varied enormously from test run to test run:

    500000000 incs in 8.485089 seconds or 58926901 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 23.625080 seconds or 21163949 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 8.736761 seconds or 57229446 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 35.973597 seconds or 13899083 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 9.903223 seconds or 50488613 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 8.794001 seconds or 56856941 incs per second without prefetch using batch_size 8 and inc 524225
    500000000 incs in 10.606903 seconds or 47139113 incs per second without prefetch using batch_size 8 and inc 524225

    Most test runs, this particular test took between 8.5 and 10.6 seconds. However, on two of the test runs it took 23.6 and 36 seconds! Surely this must be because of ‘noisy CPU cache neighbors’?

  6. I’ve been looking into this a bit more and discovered that when you run the same memory bandwidth benchmark on the same hardware but with different OSs then the results can differ in several different ways:

    The benchmark can simply be generally faster for some OSs.
    The benchmark can be slower for hyperthreads for some OSs.
    The benchmark percentile at 90% and above can be abnormally slow.

    For example, I have included raw results below (because cannot figure out how to attach a graph) of 6 sessions; 3 different types of ‘bare metal’ hardware from packet net [1] (ARM c2, Intel Xeon m1, and different Intel Xeon m2) running the same benchmark under 2 different OSs. In each session the benchmark is run 100s of times and the percentile results are on the x axis.

    read 720 lines: max-packet-net-centos7/c2.medium/cache-line-example-2-b.log ;[1.03 p0][1.04 p5][1.04 p10][1.04 p15][1.04 p20][1.04 p25][1.04 p30][1.05 p35][1.05 p40][1.05 p45][1.06 p50][1.06 p55][1.06 p60][1.06 p65][1.06 p70][1.06 p75][1.06 p80][1.06 p85][1.06 p90][1.07 p95][1.07 p100]
    read 720 lines: max-packet-net-centos7/m1.xlarge/cache-line-example-2-b.log ;[1.02 p0][1.03 p5][1.05 p10][1.07 p15][1.08 p20][1.08 p25][1.09 p30][1.09 p35][1.09 p40][1.09 p45][1.09 p50][1.09 p55][1.09 p60][1.10 p65][1.10 p70][1.10 p75][1.10 p80][1.10 p85][1.11 p90][1.13 p95][1.90 p100]
    read 840 lines: max-packet-net-centos7/m2.xlarge/cache-line-example-2-b.log ;[1.21 p0][1.22 p5][1.22 p10][1.22 p15][1.22 p20][1.23 p25][1.23 p30][1.24 p35][1.24 p40][1.25 p45][1.26 p50][1.26 p55][1.26 p60][1.26 p65][1.27 p70][1.27 p75][1.27 p80][1.27 p85][1.28 p90][1.29 p95][1.99 p100]
    read 720 lines: max-packet-net-ubuntu1604/c2.medium/cache-line-example-2-b.log;[0.99 p0][0.99 p5][0.99 p10][0.99 p15][0.99 p20][0.99 p25][0.99 p30][1.00 p35][1.00 p40][1.00 p45][1.00 p50][1.00 p55][1.00 p60][1.00 p65][1.00 p70][1.01 p75][1.01 p80][1.01 p85][1.01 p90][1.02 p95][1.03 p100]
    read 720 lines: max-packet-net-ubuntu1604/m1.xlarge/cache-line-example-2-b.log;[1.01 p0][1.02 p5][1.04 p10][1.07 p15][1.08 p20][1.08 p25][1.08 p30][1.08 p35][1.09 p40][1.09 p45][1.20 p50][1.23 p55][1.24 p60][1.25 p65][1.25 p70][1.25 p75][1.25 p80][1.25 p85][1.25 p90][1.26 p95][1.57 p100]
    read 840 lines: max-packet-net-ubuntu1604/m2.xlarge/cache-line-example-2-b.log;[1.21 p0][1.21 p5][1.21 p10][1.22 p15][1.22 p20][1.22 p25][1.23 p30][1.23 p35][1.23 p40][1.23 p45][1.24 p50][1.31 p55][1.31 p60][1.31 p65][1.32 p70][1.32 p75][1.35 p80][1.35 p85][1.35 p90][1.36 p95][1.36 p100]

    Notes:

    c2 and m1 have 48 CPUs each including hyper threads, while m2 has 56 CPUs.
    Each benchmark was run 15 times on each CPU.

    I have also got similar results on Amazon testing different OSs. For same reason, Amazon Linux 2 performs consistently much worse than either CentOS 7 or Ubuntu 16.04.

    [1] https://www.packet.net/

  7. I don’t understand the comment that you’re adding a “memory copy over a small buffer”. Surely the net effect is to copy the same total amount of data regardless of the size of the buffer.

    The metric that will matter here is the ratio between the time needed to copy all the data and the amount of work being done with all that data. If the data is large and there’s very little work being done on it then copying it all is going to add a lot of extra time. If the data is small and there’s lots of work being done on each datum then the time to copy it all will be insignificant and the cache efficiency will dominate.

    This is especially true if the data structure itself requires lots of effort to look up individual data. If you’re descending a btree for each datum then even if you don’t optimize that for batch processing descending it again for the next one will be much faster if all those addresses are still in cache.

    1. I don’t understand the comment that you’re adding a “memory copy over a small buffer”. Surely the net effect is to copy the same total amount of data regardless of the size of the buffer.

      I am not sure why we don’t understand each other. Yes. I agree that the size of the buffer, as long as it fits in L1 cache, is probably not super important. Did I write something different?

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax