Bitmaps are surprisingly efficient

Imagine you have to copy an array, and update a few values in the process. What is the most efficient implementation?

Let us look at a concrete example. I am given this array:
I want to create a new array with these values:

Most programmers would follow the following algorithm:

  1. Copy the array first;
  2. Iterate over the list of positions and updated values ([3,40], [6,100]) and correct the values.

If there are very few values that need to be updated compared to the size of the array, this approach is probably optimal. The copy of the array itself is very efficient because it is entirely vectorizable: the processor does not need to copy values one at a time, it can copy two or four at a time.

But what if 10%, 20% or even 30% of the values need to be updated after the copy? Then storing the list of positions can become wasteful. For my toy problem, I have two positions to record (3 and 6): it can use 64 bits when using 32-bit integers. If I want to be more efficient, I can use 8-bit integers, thus using a total of 16 bits. (Most modern computers favor 32-bit integers, and it is generally not computationally efficient to use integers with anything other than 8, 16, 32 or 64 bits.)

A more memory-conscious approach is to use a bitmap. That is, I store the following value using a binary notation:
I put a 1 at the third and sixth position and elsewhere a 0. This makes up the integer 72. In this manner, I never need more than one bit per value. In this case, I use only 8 bits in total, a saving of 50% compared to the alternative where I store each position using an integer.

We need a different implementation however, one where you check the bits of the bitmap before copying.

  • for every position in the array
    1. if the bitmap value is 0 then copy the value from the source array;
    2. if the bitmap value is 1 then copy the next available updated value.

This new algorithm looks inefficient. There is a lot of branching inside a tight loop. Yet the bitmap approach can be faster when the density of updates is high enough (>2%), as the next table shows.

density (%)   time     time with bitmaps    straight copy  
17 48 26 24
9 47 26 24
6 45 26 24
5 43 26 24
4 41 26 24
3 38 26 24
2 35 26 24

My C++ code is online
. I used GNU GCC 4.6.2 with only the -Ofast flag. Hardware-wise, I am using a recent MacBook Air with an Intel Core i7. (I stress that using GCC 4.6 is important. Older compilers might give different results. Also, the Core i7 is a processors with aggressive superscalar execution: cheaper processors might give different results.)

As you can see, the bitmap approach is optimal: a copy with updated values indicated by a bitmap is just as fast as a simple copy (within 10%).

(I updated the numbers on Feb. 23rd 2012. Originally, my code processed the bitmaps 32 bits at a time. I found that it was much faster to process them 8 bits at a time, probably because it allows better loop unrolling. I updated the code again on Feb. 27th 2012 after a bug report by Martin Trenkmann, the numbers were slightly updated as well.)

Conclusion Indicating exceptions using a bitmap can save memory without any penalty to the running time.

Code: Source code posted on my blog is available from a github repository.

12 thoughts on “Bitmaps are surprisingly efficient”

  1. Actually, it’s not bitmap that are fast. It’s the algorithm: it seems it’s faster to update while copying instead of doing it afterwards, random accessing again the (big) data. Locality of reference does matter.

    If exceptionpos is sorted, i believe it would be even faster to compare to *pos. You would then save all these bit operations for a simple integer comparison. This way, bitmap is just another implementation for looking up inside the copy loop if the value is to be changed.

  2. @Anton

    This is clever, but it won’t quite work in the context I have described because we want to increment the pointer to the updated value if and only if we use it.

    What you describe would work if we had two same-length vectors. (And yes, it could be slightly faster to replace branching by multiplications. Branching is really bad on many processors.)

  3. Hi Daniel.

    Interesting problem. I tried hard to beat the bitmap solution, but finally I had to surrender.

    However, I found some errors in your original code: (Note by D. Lemire: these errors were corrected and my benchmark updated)
    1. In line 163, where you set the bits of your bitmap, the AND operation is not correct, it has to be an OR operation. Thus in your bitmapcopy functions there is no update operation at all.
    2. Also in both bitmapcopy functions there is an off-by-one error with the index i.

    I corrected the errors, made my own copy versions, and ran some performance tests. You find the updated version of the code under
    (Please click the RAW button at the top to see the unwrapped result tables at the end of the listing. The columns contain pairs of milliseconds and number of update operations)

    Furthermore, I think the overall coding style is not the best and very error-prone. In C++ one should prefer iterators over raw pointers. I personally decided to use operator[] for accessing the vectors, because your problem setup uses index values. I tested a lot and I found no performance differences in using pointers, iterators or operator[]. Indeed, for the compiler all these operations are equivalent.
    A second issue are the underscored qualifiers __attribute__ and __restrict__ in your code. These are compiler extension and shall be ommitted. The compiler knows best how to optimize. See also N3242 ยง17.2(2) of the C++ ISO standard.

    OK, let’s come to my implementations:

    memcopy: Nothing to say here. Copying a chunk of memory without any modifications is faster than copying element-wise using a loop.

    updatingcopy: The simplest solution for the given problem with just clean syntax. It outperforms the patchedcopy versions and do not use raw pointer arithmetics.

    updatingcopy2: Here I tried to group an index value together with its update value in a pair to take advantage of locality. However, the effect is negligible.

    updatingcopy3: Here I tried to avoid branching and made the iteration over the vector with the (index, update value)-pairs an outer loop. I suppose this is the same approach as in patchedcopy2, except that my version is human-readable.

    Conclusion: The bitmapcopy versions are not to beat for high densities only. At the other hand the resulting code is more error-prone and hard to read. I showed that just clean and simple code gives comparable performance results, at least for low densities. When trying to optimize do not hack your compiler, especially for C++ use its type system as much as possible and avoid ugly C-style casts. What helps is locality of data to minimize cache misses. Bitmaps are an good example for this, definition of variables as close as possible to their usage is another one. Finally, maintainability of code really matters.

    Have fun!

  4. @Martin

    Thanks Martin! I really appreciate the bug report and I fixed my code.

    Note that memcpy is *not* more efficient that loop copy using my compiler and my CPU.

    You are using an older GCC (4.4). Have you tried it with a recent GCC (4.6.2)? In my tests, this makes a huge difference.

    You are also using an inferior processor. My i7 has probably better pipelining than your Core 2.

    Anyhow, here are my results with your code:

    g++-4 -Ofast -o bitmaptrenkmann bitmaptrenkmann.cpp
    density(%) memcopy loopcopy bitmapcopy bitmapcopy8bits patchedcopy patchedcopy2 updatingcopy updatingcopy2 updatingcopy3
    100 25 25 43 0 26 0 65 32000000 59 32000000 49 32000000 42 32000000 48 32000000
    16.6667 25 25 50 0 25 0 47 5333334 50 5333334 45 5333334 37 5333334 39 5333334
    9.09091 29 30 48 0 26 0 43 2909091 40 2909091 38 2909091 31 2909091 30 2909091
    6.25 25 25 47 0 26 0 41 2000000 40 2000000 40 2000000 31 2000000 30 2000000
    4.7619 25 24 46 0 26 0 39 1523810 40 1523810 39 1523810 29 1523810 29 1523810
    3.84615 26 25 44 0 27 0 39 1230770 37 1230770 38 1230770 28 1230770 28 1230770
    3.22581 27 25 47 0 25 0 37 1032259 40 1032259 36 1032259 29 1032259 28 1032259
    2.77778 26 25 43 0 26 0 36 888889 42 888889 40 888889 28 888889 34 888889
    2.43902 27 25 46 0 25 0 35 780488 44 780488 42 780488 32 780488 34 780488
    2.17391 25 24 46 0 26 0 34 695653 44 695653 40 695653 32 695653 33 695653

    Summary: Basically, using your code I get the same performance and thus the same conclusion. I agree that using STL is nicer and better (I’m a big fan of STL) than using pointer arithmetic, especially when you get the same performance!

  5. are you sure you have a macbook air with an i7? did apple give you a not-yet-to-the-public-released mba? ๐Ÿ˜‰

Leave a Reply

Your email address will not be published. Required fields are marked *