Generic number compression (zstd)

I have done a lot of work that involves compressing and uncompressing data. Most often, I work on data that has specific characteristics, e.g., sorted integers. In such cases, one can do much better than generic compression routines (e.g., zstd, gzip) both in compression ratios and performance.

But how well do these generic techniques do for random integers and floats?

  1. We generate 32-bit floats in the interval [0,1] and store then as double-precision (64-bit) floats. Roughly speaking, it should be possible to compress this data by a factor of two.
  2. We generate 64-bit integers in the range [-127,127]. We should be able to compress this data by a factor of eight (from eight bytes to one byte).

What are the results? I use zstd v1.5.2 (with default flags) and a couple of small programs.

source compression ratio
32-bit floats as 64-bit floats 2x
64-bit integers in the range [-127,127] 5x

The compression ratio is pretty good for the floating-point test, nearly optimal. For the 64-bit integers, the results are less exciting but you are within a factor of two of the ideal compression ratio.

Update: As reported in the comments, you can get much better compression ratios if you request more aggressive compression, although it also takes much more time.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

16 thoughts on “Generic number compression (zstd)”

  1. Interesting. The default compression level of zstd seems to be only 3 of 22, though. And brotli can squeeze even more out of it, although at a high price (of CPU time).

    Here are some more results for all levels of zstd, brotli and gzip:
    https://github.com/evelance/generic-number-compression

    Somehow, brotli manages to get the file size down to 44% for the float32-in-double file:

    Versions: gzip 1.12, brotli 1.0.9, zstd 1.5.2
    Checking compression for file 'testfloat.dat'
    00000000: 0000 00c0 128d e63f 0000 00a0 f321 cf3f
    00000010: 0000 00a0 2580 eb3f 0000 00e0 012a ea3f
    gzip-1 0.14s 4497086 56.21%
    gzip-5 0.26s 4217599 52.72%
    gzip-9 5.50s 4093342 51.17%
    brotli-0 0.02s 4835457 60.44%
    brotli-5 0.31s 4045934 50.57%
    brotli-9 1.55s 3986579 49.83%
    brotli-11 10.15s 3517421 43.97%
    zstd-1 0.02s 4508213 56.35%
    zstd-3 0.04s 4190227 52.38%
    zstd-8 0.17s 3878348 48.48%
    zstd-16 1.56s 3754120 46.93%
    zstd-22 2.31s 3755501 46.94%
    Checking compression for file 'testint.dat'
    00000000: 7e00 0000 0000 0000 f1ff ffff ffff ffff
    00000010: 2200 0000 0000 0000 2100 0000 0000 0000
    gzip-1 0.06s 1896180 23.70%
    gzip-5 0.15s 1675779 20.95%
    gzip-9 7.20s 1519492 18.99%
    brotli-0 0.01s 1743049 21.79%
    brotli-5 0.15s 1523142 19.04%
    brotli-9 0.48s 1521837 19.02%
    brotli-11 9.44s 1234645 15.43%
    zstd-1 0.02s 1593200 19.91%
    zstd-3 0.02s 1656052 20.70%
    zstd-8 0.12s 1675177 20.94%
    zstd-16 1.56s 1323872 16.55%
    zstd-22 2.67s 1297221 16.22%

    I am currently pondering on the implementation of a time series database for tiny embedded devices and simply compressing a list of appropriately sized (delta) values yields pretty good results 🙂

    By the way, can you recommend a good compression algorithm for uint32 timestamp values that are increasing or strictly increasing? A pointer to the right direction would be greatly appreciated.

    1. Depends how extreme you wanna get, and what the requirements are. 🙂 Like, do you need random access? What language?

      sux4j, a Java package, has a large list of data structures for this kind of thing that provide close to the information-theoretical lower bound, like the Elias-Fano encoding. There’s a C++ implementation from Facebook (https://github.com/facebook/folly/blob/main/folly/experimental/EliasFanoCoding.h). You mentioned embedded, so that’s why I threw the C++ lib in there. I bet there’s a C implementation out there too.

      Also, check out https://pdal.io/en/stable/, a LIDAR compression software.

      1. Thanks a lot for the links!

        For the requirements, the code size is ~1MB, RAM ~16MB and the database is capped at 10MB. So, actually tiny 🙂

        Random indexing and performance is generally not important, the database just needs to compress as many IoT data points as possible (our test data has ~60M data points) and deliver time segments of values for a graph.

    2. You are mistaken that the optimal compression for random floats as doubles should be “roughly equal to two”.

      The entropy of IEEE floats between 0 and 1 is (slightly under) 25 bits: 23 bits for the mantissa, plus 2 bits for the exponent, using Shannon’s formula for entropy.

      So that means that random floats should already be compressible to a ratio of 25/32 = 0.78125 = ~78% and when converted to doubles, the optimal compression ratio is 25/64 = 0.390625 = ~39%, not 50%. That explains why Stefan was able to compress these files in much less than 50%, which would have been theoretically impossible otherwise.

      1. You are mistaken that the optimal compression for random floats as doubles should be “roughly equal to two”.

        Your computation suggests that the exact answer is ~2.6x.

  2. Hi Daniel, interesting experiment, but I guess you’ll agree the results can not be generalized, as it depends dramatically on the details of the data.

    For example, the floats you’re generating are not really random, as you’re using a very small subset of the mantisa domain (half of your values will have the exact same mantisa), making every 8 bytes you generate have identical 4 bytes, and there are only 8 versions of 5-byte patterns there. zstd can surely compress that very well, with “-9” you’ll even get under your 50% threshold.

    Similarly, if you change your integer domain to [0, 255], suddenly you get to 13% compression, because you only generate 7-byte sequences of 00s, not both 00s and FFs.

    In general, you’re right, it’s easy to generate data distributions where zstd will lose badly to specific encodings. On the flip side, for any of these encodings, there will be distributions where zlib will crush it 🙂

    Side note: zstd for me is a true revolution in compression tech – the compression ratios and speed it provides makes most of the general purpose alternatives mostly obsolete IMO.

    Fun!

  3. Hi Daniel, interesting experiment, but I guess you’ll agree the results can not be generalized, as it depends dramatically on the details of the data.

    I think we are in agreement.

    I expect a codec like zstd to be often within a factor of two of a reasonable information theoretical bound when doing data engineering work. And it is often fast enough. There are specific instances, and these instances are important, where you can do much better (better compression, better speed)… and I care a lot about these instances… but if you just have generic data… then using something like zstd will be good enough… meaning that the engineering work needed to do better will not be worth the effort.

    the floats you’re generating are not really random, as you’re using a very small subset of the mantisa domain (half of your values will have the exact same mantisa)

    Am I? The code is meant to generate random numbers between 0 and 1…

     std::random_device rd;
      std::default_random_engine eng(rd());
      std::uniform_real_distribution<float> distr(0.0f, 1.0f);
      for (size_t i = 0; i < N; i++) {
        x[i] = distr(eng);
      }
    

    Admittedly, not all floats fall in this interval… Only about a quarter of them… so I expect slightly less than 30 bits of randomness per float…

    Looking at the raw data, I do not see that half of the mantissa have the exact same value… maybe I misunderstand what you meant?

     ./generate & hexdump test.dat |  head -n 10
    0000000 0000 0000 1c50 3fcb 0000 4000 e34a 3feb
    0000010 0000 e000 8443 3fee 0000 6000 6eeb 3fdb
    0000020 0000 4000 5b10 3fbf 0000 c000 f3ae 3fd8
    0000030 0000 0000 3b2b 3fe2 0000 8000 eb88 3fb4
    0000040 0000 e000 fa1a 3fe4 0000 a000 de15 3fbb
    0000050 0000 4000 1eb4 3fe5 0000 e000 833a 3fe8
    0000060 0000 c000 906b 3fe0 0000 e000 88e3 3fdf
    0000070 0000 a000 69d3 3fe2 0000 0000 4785 3f92
    0000080 0000 8000 dd59 3fe5 0000 2000 613a 3fc1
    


    Similarly, if you change your integer domain to [0, 255], suddenly you get to 13% compression, because you only generate 7-byte sequences of 00s, not both 00s and FFs.

    In that scenario, we get roughly an 8x compression ratio, so effectively as good as it gets. When I built my example, I deliberately used a signed value because I think it is more impressive that you can get a 5x compression ratio with signed values !!!

    I was neither trying to set zstd for a fall nor trying to make it look good.

    1. Argh, silly me, I meant exponent, not mantisa – half of the values will fall in the range [0.5, 1) which will have the same mantisa 🙂 Sorry, I was typing this late.

    2. (Sorry, I posted this comment in reply to Stefan above, but it’s more appropriate here. Please feel free to delete my earlier reply.)

      The entropy of IEEE floats between 0 and 1 is (slightly under) 25 bits: 23 bits for the mantissa, plus 2 bits for the exponent, using Shannon’s formula for entropy. So that means that random floats should already be compressible to a ratio of 25/32 = 0.78125 = ~78% and when converted to doubles, the optimal compression ratio is 25/64 = 0.390625 = ~39%, not 50%. That explains why Stefan was able to compress these files in much less than 50%, which would have been theoretically impossible otherwise.

Leave a Reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com.

You may subscribe to this blog by email.