How fast is bit packing?

Integer values are typically stored using 32 bits. Yet if you are given an array of integers between 0 and 131 072, you could store these numbers using as little as 17 bits each—a net saving of almost 50%.

Programmers nearly never store integers in this manner despite the obvious compression benefits. Indeed, bit packing and unpacking is expensive. How expensive? Intuitively, you might think that recovering 32-bit integers from a stream of packed integers must be at least as expensive as copying the 32-bit integers, and possibly much more expensive. If that is your intuition, then you might be wrong. It can be cheaper to recover 32-bit integers from packed 4-bit integers because you only need to load one 32-bit word to unpack 8 integers.

Clearly, packing integers in units of 17 bits is not especially convenient. Indeed, 17 and 32 are coprime. We expect that it would be much faster to pack and unpack integers in units of 4, 8 or 16 bits, than in units of 17 bits. Indeed it is but the difference is maybe not as large as you might think.

I have implemented efficient packing and unpacking routines in C++. To simplify the implementation, we pack and unpack integers in sets of 32 numbers. I have optimized the code using the GNU GCC 4.6.2 compiler.

On my macbook air (Intel core i7), I get that the unpacking speed is not very sensitive to the specific number of bits: generally, the smaller the bit width, the faster the unpacking. The packing speed is much faster when the bit width is 8 or 16. Even so, the difference is only by a factor of two or so. The results are presented in the next figure. On the y axis, you have the time (smaller is better). On the the x axis, we have the number of bits we packed to. For example, when bit is 1, we pack 32 integers into a single 32-bit word. When the number of bits is set to 32 bits, we have a regular copy.

I also provide the raw numbers behind the figure in the next table.

bits pack (ms) unpack (ms)
1 219 211
2 215 216
3 210 205
4 198 194
5 222 214
6 229 218
7 242 222
8 167 202
9 252 240
10 243 225
11 255 235
12 246 231
13 276 244
14 279 245
15 304 255
16 183 223
17 292 252
18 297 256
19 316 266
20 300 256
21 329 280
22 321 274
23 332 278
24 299 257
25 341 289
26 340 298
27 352 295
28 336 284
29 367 311
30 357 299
31 384 319
32 256 261

Conclusion: Bit packing and unpacking can be quite fast. In particular, it can be cheaper to unpack integers from a small number of bits to 32-bit integers than to copy the same 32-bit integers. Exact results will vary depending on your compiler and CPU.

Note: Strictly speaking my implementation packs the first bits of each integer: it is not assumed that the integers are between 0 and 2bit. By adding this assumption, you can improve the packing speed somewhat (at least when the number of bits is not 8 or 16).

Further reading : We have written recent research papers that survey related schemes for this problem. Please see:

They include an extensive experimental evaluation. You can find a complete implementation of all techniques in C++11 online:
FastPFor and SIMDCompressionAndIntersection. There are also C libraries: simdcomp and MaskedVByte. If you prefer Java, please see JavaFastPFOR.

Daniel Lemire, "How fast is bit packing?," in Daniel Lemire's blog, March 6, 2012.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

18 thoughts on “How fast is bit packing?”

  1. I’m missing something… how is packing 32-bit integers into 17 bits a savings of 90%? It sounds closer to 50%.

  2. Please see my US patent no. 5,602,550, filed in 1995, granted in 1997, which describes a complete implementation of an adaptive compression utilizing bit packing, but also allowing for bit packing of deltas between successive values. This algorithm was built for speed.

  3. I’m missing something here. I was hoping to see a speed comparison between bit-packing and not bit-packing.

    Given an array of k-bit integers stored in 32-bit integers, how long does it take to copy that array? how long does it take to pack that array? how long does it take to unpack the packed array?

  4. @Patrick

    I’m missing something here. I was hoping to see a speed comparison between bit-packing and not bit-packing.

    You get the non-packed approach when bit is set to 32.

    Don’t forget that my source code is available (see link) so you can run your own tests if you want!

  5. It would be relevant to know how many numbers are in the data set being packed or unpacked, and compare that to no packing at all. Cache effects are likely to dominate above various sizes.

    @Jay Stein – The only proper response to that is: (rude language censored by D. Lemire) go crawl back under the rock you came from software patenter.

  6. What a shame, zav. Most compilers are sophisticated enough to continue parsing even in the presence of syntax errors.

    P.S. I think you meant “stopped,” not “stop.”

  7. @Marsh Ray – My compression algorithm was patented by the company where I was employed at the time. I did not think it was worth wasting anyone’s time explaining that detail. The patent application is a publicly available explanation of the algorithm, which is relevant to the current discussion, unlike your trolling.

  8. Jay, would love to check out your patent. I’ve been fascinated with the potential for this since 1995 while investigating systems and methods for storing quantized delta frames in video streams. None of my PAs are as fundamental.

    David, this is nice. Wish I had time to play with this at the moment. Thanks for the source and the correction. Cheers.

  9. Michele,

    It is not quite the opposite. The trend is the same:
    1) There is very little difference between unpacked and packed readings
    2) Some packed reads are more (though only slightly) efficient than unpacked ones.

  10. @itman @michele

    If you look closely at my code, you’ll notice that I use a lot of loops that can and should probably be unrolled. I actually leave them rolled when it makes sense so that the compiler has more options (compilers don’t typically “roll back” loops that were manually unrolled).

    Anyhow. I adjusted the code until it looked like I got optimal results with GCC 4.6 and my particular hardware. Because Michele is using GCC 4.2, I am not surprised that the results differ.

    However, even with GCC 4.2, it might be possible tweak the results with the proper optimization flags.

    As you say @itman, the results are not really all that different. But it is nice to see independent tests.

  11. Very nice post!

    I’m implementing the same idea, in C and in fewer lines. The code is here:

    Please take a look and if you want to help me I’ll appreciate that. 🙂

    I’m having errors.. eg packing at 32 int variable numbers less than 17 are fine but when its greater than 16 it doesn’t work well… I don’t know what’s the problem, will appreciate any help.

    frede dot sch at gmail dot com

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.