Last year, we published a fast C++ library to quickly compress and decompress arrays of integers. To get good compression, we use differential coding: the arrays of integers are sorted and instead of storing the integers themselves, we store the difference between successive integers. The differences are typically small integers that can be compressed efficiently. Out of habit, I ported our code to Java and published it under the name JavaFastPFOR library.
Unlike generic compression techniques like gzip or Google Snappy, we only wish to compress and uncompress integers. We provide a less general solution, but we can also get much more impressive speeds in some cases (e.g., an order of magnitude faster).
Though you cannot reach the same kind of speed in Java as you can in C++, there are many good reasons to use Java instead of C++. How good is Java at this task? Direct comparisons between Java and C++ are difficult. I would estimate that the difference is a factor of 3 and more. But Java can still be more than fast enough.
I decided to compare our results with the popular Kamikaze PForDelta library. I used a fast core i7 processor with synthetic data. We vary the compressibility of the data. For each test, I report the speed in millions of integers per second as well as the storage cost in bits per integer (in parenthesis).
Binary Packing | Kamikaze PForDelta |
---|---|
1200 (3.1) | 300 (3.3) |
1100 (7.4) | 300 (7.7) |
1000 (13) | 300 (14) |
The numbers show that the Binary Packing technique implemented in the JavaFastPFOR library can easily exceed 1 billion of integers per second in decompression speed. This is likely fast enough for most applications.
I posted the raw results if you wish the examine them more closely.
The library is available under the Apache license and is part of the Maven repository. It includes many more schemes.
Warning: These results may not be representative of what you get in an actual application. They will vary depending on the machine you use and your data. I am only providing these numbers as a ballpark indication.
Credit: This work is the result of a fruitful collaboration with many super smart people including L. Boytsov, O. Kaser, N. Kurz. All mistakes are my fault.
I was recently contemplating on whether I should use Java for algorithmic programming and I came to a conclusion. Java is not good. I mean it is great if you consider the number of libraries available. The performance is generally good. Yet, if you need top-notch performance, it is quite hard to get in Java. You would need to use awful things like manual memory management and parallel arrays…. This is awful and counterproductive. You can do much better in C++. But, if you don’t mind 2-3x loss in performance Java (Scala, or, say, OCaml) can be a nice choice.
“The assumption is that most (but not all) values in your array use less than 32 bits.”
I think this statement belongs in this post, otherwise nothing makes sense.
@Anonymous
Thanks for pointing out a shortcoming in my post. However, we do not make this assumption. Instead, we just assume that the integers have been sorted. I have edited the post accordingly.
@Itman
What is true is that there is a whole range of optimizations that are simply not possible in Java (by design).
Does it support streaming compression as integers keep appearing in the stream?
@Ishan
You can compress and uncompress data in blocks… See the “advanced” example in the example.java file:
https://github.com/lemire/JavaFastPFOR/blob/master/example.java
So, yes, I would say that you can building streaming applications using this library.
Hi,
*noob here
I want to ask you for advice about compressing integers. I have integers in range 0-255 so each of them is 8-bit long. Which compresor schould I use? JavaFastPFOR is only for “in memory” compress right?
The best strategy is to try various codecs and see which one works best for your data. There is no need to guess.
The library is low-level and won’t handle issues such as writing data to disk.