Iterating in batches over data structures can be much faster…

We often need to iterate over the content of data structures. It is surprisingly often a performance bottleneck in big-data applications. Most iteration code works one value at a time…

for value in datastructure {
  do something with value

There is a request to the data structure for a new value at each iteration. Alternatively, we can query the data structure far less often by asking the data structure to fill a buffer…

for blockofvalues datastructure {
  for value in blockofvalues {
      do something with value

It is not automatically faster: you have to store values to a buffer and then read them again. It involves copying data from registers to memory and back. There is some inherent latency and it is an extra step.

However, if you make your buffer large enough but not too large (e.g., 1kB), the latency will not matter much and you will remain in CPU cache (fast memory). Thus you should, in the worst case, be only slightly slower. What do I mean by “slightly”? Basically, you are adding the equivalent of a memory copy over a small buffer.

When accessing data over a network, or even across processes on the same machine, it worth it to process the data in batches because the cost of the transaction is high. When working in data structures that are in your own process, the transaction cost might be low. Repeated function calls in a loop are cheap, and they can become free after inlining. To my knowledge, batched iterations is not typically available in standard libraries.

Thus, until recently, I did not pay much attention to the idea of iterating in batches over data structures. I could imagine some gains, but I expected them to be small.

In the Go implementation of Roaring bitmaps, Ben Shaw contributed a way to iterate over values in batches, recovering many values in a buffer with each function call. It helped the performance considerably (almost doubling the speed on some tests). Richard Startin then did the same in the Java implementation. It also helped a lot:

The batch iterator is always at least twice as fast as the standard iterator (…) Depending on the contents and size of the bitmap, the batch iterator can be 10x faster.

So I started to wonder… is this an underrated strategy?

I modified the popular Go bitset library and on some iteration test, the batched iteration was nearly twice as fast!

The batched code is more complex, but not so terrible:

buffer := make([]uint, 256)
j := uint(0)
j, buffer = bitmap.NextSetMany(j, buffer)
for ; len(buffer) > 0; j, buffer = bitmap.NextSetMany(j, buffer) {
     for k := range buffer {
        // do something with buffer[k]
     j += 1

Then I modified the cbitset library. I saw, again, almost a doubling of the speed. The code is once more a bit more complicated:

size_t buffer[256];
size_t howmany = 0;
for(size_t startfrom = 0; 
         (howmany = nextSetBits(b1,buffer,256, &startfrom)) > 0 ;
          startfrom++) {
       for(size_t i = 0; i < howmany ; i++) {
         // do something with  buffer[i];

These good results depend on what kind of data you iterate over, how you use the data, and what kind of data structure you have. Obviously, it is useless to batch iterations over an array of values. Yet my few tests provide enough evidence to conclude that batch iteration is worth investigating when speed is a limitation.

On Twitter, Milosz Tanski explained the result as follows:

One thing to remember about CPU and optimization in general is that almost hardware is designed to operate at maximum speed when it’s doing similar work on similar data. Branch prediction, prefetch, caches, op code level parallelization all make this assumption.

7 thoughts on “Iterating in batches over data structures can be much faster…”

  1. I cannot agree with this more strongly. I’ve made improvements to InfluxDB cursors to move batches of time series data between stages that with measurable performance improvements šŸ‘šŸ»

  2. Try doing batch operations with gcc and its __builtin_prefetch() command. Try doing a batch of n (e.g. 8 or so; experiment with different batch sizes for your particular CPU) prefetches for n different memory addresses, immediately (or sometimes you can do other work in between because prefetching memory can be SLOW…) followed by a batch of n actual operations on the memory just fetched. I’ve seen massive latency savings in the past doing this. Good luck!


    1. My experience with __builtin_prefetch() for performance has not been fruitful. I use it as part of benchmarks to help ensure that the cache is properly populated, but otherwise I rely on the processor to prefetch.

      1. It is difficult to get results using __builtin_prefetch() but don’t give up! I’ve found it easiest to get results using it in a batch. There are a special circumstances where it’s going to help: (a) Don’t make your batch size too big or too small. The CPU can only prefetch so many cache lines at a time. (b) Only use __builtin_prefetch() if you’re expecting two or more cache lines fetches not to be in the cache, otherwise it’ll just get in the way. Generally speaking the data structures you’re working on need to be much bigger than the CPU cache memory and not cached. I wonder if you have not had success because you tend to experiment with smaller ‘demo sized’ data structures rather than ‘production sized’ data structures? An example, of a situation where __builtin_prefetch() would be useful is looking up a batch of keys in a larger hash table. HTH and don’t give up on __builtin_prefetch() ! šŸ™‚

  3. Can you give me a short C program (say 30 lines of code) where __builtin_prefetch() helps tremendously, and such that the code without __builtin_prefetch() is not simply poorly designed so as to make __builtin_prefetch() look good.

    That is, give me an example where __builtin_prefetch() helps a lot and where I cannot, using straight portable C, get the same kind of performance.

    Feel free to use lots of RAM.

  4. Have a look at [1] which uses a 1 GB block of memory and then runs a series of experiments on it by looping over it in differing batch sizes (1, 2, 4, 8, 16, 32, 64, 128, and 256), differing increments between reads (1, 65, 513, and 525,225 bytes), and with and without __builtin_prefetch(). Each experiment reads from the block of memory the same number of times; 500 million times.

    It should be possible for you to easily run this on your Linux box with little or no changes. The results will likely be sensitive to the exact CPU type, and the speed of the RAM. Please let me know how you get on. Also, I created this example especially for you, so it’s possible there are flaws / bugs. If you find any then please let me know! šŸ™‚

    The results show that the biggest gains are using __builtin_prefetch() where auto prefetching has little effect and where the cache also has little effect. Let’s look at the inc 524,225 results which represent the slowest, non-cached memory reads. The fastest batch size on my CPU without prefetch is batch size 2 with 84.5 million reads per second. And with prefetch this jumps up to 94.3 million reads per second, or 11.5% faster. However, with a batch size of 64 then using prefetch is 40% faster.

    I think the experiment is also interesting because it shows how effective the CPU cache is. For example, the best result without prefetching is 970 million per second with batch size 16 with inc 1. And the worst is 70.4 million per second with batch size 16 and inc 524,225. So the fastest with CPU cache acceleration (without prefetching) is 13.8 times faster… over an order of magnitude faster. This also shows that between the prefetch loop and the batch loop, it would be possible to do quite a lot of CPU work on cached memory while waiting for the prefetched RAM to be cached. Perhaps a further experiment could prove this?

    Also, interesting is that without prefetch reads per second generally get consistently slower as the batch size gets bigger, except for batch size 1 or inc 1. However, that’s not the case with prefetch because as the batch size gets bigger the reads per second do not necessarily get consistently slower.

    Looking forwards to any comments.


Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see