Thanks for the link.

]]>zstd is generic, indeed, and there are definitively preprocessing steps that may help compression.

]]>For example, the “byte stream split” encoding recently added to the Parquet format provides a valuable preprocessing step that increases the efficiency of Zstd compression on floating-point data:

https://issues.apache.org/jira/browse/PARQUET-1622 ]]>

https://www.fujitsu.com/global/products/computing/servers/supercomputer/a64fx/ ]]>

Weird, I searched that exact document but search didn’t find it for some reason. Oh well, thanks for the correction!

]]>I agree that svcntb is nicer. Thanks.

The svlen_* intrinsics are documented in a currently available manual.

Reference:

Arm C Language Extensions for SVE

https://developer.arm.com/documentation/100987/0000/

Section 6.27.6. LEN: Return the number of elements in a vector

]]>Should probably use the documented svcntb() instead though.

I think Neoverse V1 is the only ARM processor with 256-bit vectors. Neoverse V2 has reverted to 128-bit vectors.

]]>Depends how extreme you wanna get, and what the requirements are. đ Like, do you need random access? What language?

sux4j, a Java package, has a large list of data structures for this kind of thing that provide close to the information-theoretical lower bound, like the Elias-Fano encoding. There’s a C++ implementation from Facebook (https://github.com/facebook/folly/blob/main/folly/experimental/EliasFanoCoding.h). You mentioned embedded, so that’s why I threw the C++ lib in there. I bet there’s a C implementation out there too.

Also, check out https://pdal.io/en/stable/, a LIDAR compression software.

]]>It would probably enhance zstd compression in this example:

https://lemire.me/blog/2022/11/28/generic-number-compression-zstd/

It is also obviously applicable to other formats: https://github.com/lemire/streamvbyte

]]>Of course it doesn’t. That’s why I’m here in the comments! However the blog post does directly state that there exists, or at least probably exists, some function that compresses (positive) integers well.

This is why I was quoting this piece:

> This could be useful if you have a fast function to compress integers that fails to work well for negative integers.

I’m wondering what the usefulness you mention there is like in practice. If it’s not that important of a detail it seems like it wouldn’t be included in your post. Maybe it’s not. However, I don’t think it’s a trivial detail, which is why I’m asking questions about it.

I’ll think about this some more on my own.

]]>My blog post absolutely does not answer this question:

*Is zigzag encoding the best encoding method for âgetting rid ofâ negative integers for whichever function that compresses integers well, but doesnât compress negative integers well.*

> This could be useful if you have a fast function to compress integers that fails to work well for negative integers.

This is what motivated that question (from the top of the post). I’d also need a definition of what it means for a function that compresses integers to work or not work well for negative integers :).

I guess a clearer question is — if we’re talking about zigzag encoding in terms of being a solution to “dealing with” negative integers so that they can be compressed “well” by some function that compresses integers — Is zigzag encoding the best encoding method for “getting rid of” negative integers for whichever function that compresses integers well, but doesn’t compress negative integers well.

The second part of your response I think partially answers my question. And while it does map those negative values close to 0 to small positive integers, it does also map existing positive integers to larger positive integers.

]]>Smart people make mistakes and yet the world does not fall apart.

]]>Ha, you see, my brain does it again. “which will have the same exponent” !

]]>Argh, silly me, I meant exponent, not mantisa – half of the values will fall in the range [0.5, 1) which will have the same mantisa đ Sorry, I was typing this late.

]]>The first paragraph is as follows…

*You sometimes feel the need to make all of your integers positive, without losing any information. That is, you want to map all of your integers from âsignedâ integers (e.g., -1, 1, 3, -3) to âunsigned integersâ (e.g., 3,2,6,7). This could be useful if you have a fast function to compress integers that fails to work well for negative integers.*

I am sure it could be improved, but it is meant to provide motivation.

]]>You should start with that as a motivator, instead of burying the lede.

]]>It is commonly used as part of compression routines.

]]>Thanks for your comments.

]]>I think we are in agreement.

I expect a codec like zstd to be often within a factor of two of a reasonable information theoretical bound when doing data engineering work. And it is often fast enough. There are specific instances, and these instances are important, where you can do much better (better compression, better speed)… and I care a lot about these instances… but if you just have generic data… then using something like zstd will be good enough… meaning that the engineering work needed to do better will not be worth the effort.

*the floats youâre generating are not really random, as youâre using a very small subset of the mantisa domain (half of your values will have the exact same mantisa)*

Am I? The code is meant to generate random numbers between 0 and 1…

std::random_device rd; std::default_random_engine eng(rd()); std::uniform_real_distribution<float> distr(0.0f, 1.0f); for (size_t i = 0; i < N; i++) { x[i] = distr(eng); }

Admittedly, not all floats fall in this interval… Only about a quarter of them… so I expect slightly less than 30 bits of randomness per float…

Looking at the raw data, I do not see that half of the mantissa have the exact same value… maybe I misunderstand what you meant?

./generate & hexdump test.dat | head -n 10 0000000 0000 0000 1c50 3fcb 0000 4000 e34a 3feb 0000010 0000 e000 8443 3fee 0000 6000 6eeb 3fdb 0000020 0000 4000 5b10 3fbf 0000 c000 f3ae 3fd8 0000030 0000 0000 3b2b 3fe2 0000 8000 eb88 3fb4 0000040 0000 e000 fa1a 3fe4 0000 a000 de15 3fbb 0000050 0000 4000 1eb4 3fe5 0000 e000 833a 3fe8 0000060 0000 c000 906b 3fe0 0000 e000 88e3 3fdf 0000070 0000 a000 69d3 3fe2 0000 0000 4785 3f92 0000080 0000 8000 dd59 3fe5 0000 2000 613a 3fc1

Similarly, if you change your integer domain to [0, 255], suddenly you get to 13% compression, because you only generate 7-byte sequences of 00s, not both 00s and FFs.

In that scenario, we get roughly an 8x compression ratio, so effectively as good as it gets. When I built my example, I deliberately used a signed value because I think it is more impressive that you can get a 5x compression ratio with signed values !!!

I was neither trying to set zstd for a fall nor trying to make it look good.

]]>For example, the floats you’re generating are not really random, as you’re using a very small subset of the mantisa domain (half of your values will have the exact same mantisa), making every 8 bytes you generate have identical 4 bytes, and there are only 8 versions of 5-byte patterns there. zstd can surely compress that very well, with “-9” you’ll even get under your 50% threshold.

Similarly, if you change your integer domain to [0, 255], suddenly you get to 13% compression, because you only generate 7-byte sequences of 00s, not both 00s and FFs.

In general, you’re right, it’s easy to generate data distributions where zstd will lose badly to specific encodings. On the flip side, for any of these encodings, there will be distributions where zlib will crush it đ

Side note: zstd for me is a true revolution in compression tech – the compression ratios and speed it provides makes most of the general purpose alternatives mostly obsolete IMO.

Fun!

]]>And speaking of vectorization in avx512, there is even vprold to convert many integers in a single instruction.

]]>Here are some more results for all levels of zstd, brotli and gzip:

https://github.com/evelance/generic-number-compression

Somehow, brotli manages to get the file size down to 44% for the float32-in-double file:

Versions: gzip 1.12, brotli 1.0.9, zstd 1.5.2

Checking compression for file 'testfloat.dat'

00000000: 0000 00c0 128d e63f 0000 00a0 f321 cf3f

00000010: 0000 00a0 2580 eb3f 0000 00e0 012a ea3f

gzip-1 0.14s 4497086 56.21%

gzip-5 0.26s 4217599 52.72%

gzip-9 5.50s 4093342 51.17%

brotli-0 0.02s 4835457 60.44%

brotli-5 0.31s 4045934 50.57%

brotli-9 1.55s 3986579 49.83%

brotli-11 10.15s 3517421 43.97%

zstd-1 0.02s 4508213 56.35%

zstd-3 0.04s 4190227 52.38%

zstd-8 0.17s 3878348 48.48%

zstd-16 1.56s 3754120 46.93%

zstd-22 2.31s 3755501 46.94%

Checking compression for file 'testint.dat'

00000000: 7e00 0000 0000 0000 f1ff ffff ffff ffff

00000010: 2200 0000 0000 0000 2100 0000 0000 0000

gzip-1 0.06s 1896180 23.70%

gzip-5 0.15s 1675779 20.95%

gzip-9 7.20s 1519492 18.99%

brotli-0 0.01s 1743049 21.79%

brotli-5 0.15s 1523142 19.04%

brotli-9 0.48s 1521837 19.02%

brotli-11 9.44s 1234645 15.43%

zstd-1 0.02s 1593200 19.91%

zstd-3 0.02s 1656052 20.70%

zstd-8 0.12s 1675177 20.94%

zstd-16 1.56s 1323872 16.55%

zstd-22 2.67s 1297221 16.22%

I am currently pondering on the implementation of a time series database for tiny embedded devices and simply compressing a list of appropriately sized (delta) values yields pretty good results đ

By the way, can you recommend a good compression algorithm for uint32 timestamp values that are increasing or strictly increasing? A pointer to the right direction would be greatly appreciated.

]]>Indeed, zigzag doesn’t necessarily compress per-se. At the same time, like @me (who’s that?:)) mentioned, it enables e.g. varint encoding (which, in my book, is also not compression, but hey đ )

@moonchild also mentioned adjusting the base, aka FOR encoding. If there’s a difference in positive/negative ranges, FOR indeed will create a better (smaller) range of integers. But you need to know that base upfront, which is a weakness.

In general, if someone is interested in more efficient integer compression, Daniel’s PFOR library is not the worst place to start: https://github.com/lemire/FastPFor đ

]]>Can you define what you mean by “compression ratios”? The blog post does not describe a compression routine.

This being said, zigzag encoding tends to favour values that are close (in absolute value) to zero… in the sense that such values get mapped to ‘small positive integers’.

]]>You are correct. Thank you.

]]>Can you safely cast all signed integers to unsigned integers (in current C standards)?

]]>The link is https://www.sciencedirect.com/science/article/pii/S095937802200142X

]]>I.E. let’s say our initial set is. -1, -2, -3, 1, 2, 3, 4, 5, 6, 7, 8

With zig-zag encoding applied we are left with

1, 3, 5, 2, 4, 6, 8, 10, 12, 14, 16.

Which leaves us with “gaps” (below). These gaps now make the positive integers in our initial set take up more space in their binary representation.

7, 9, 11, 13, 15.

What do compression ratios end up looking like in the varying scenarios of

1. Equal amounts of negative and positive integers

2. More or less negative and positive integers relative to each other.

It’s not, you have to cast to unsigned to make it safe.

]]>I am failing to parse this.

… but they are more likely to be hypocrites according.

This sentence looks to be incomplete. According to what?

That’s not quite what it is though, is it?

]]>You are making the lowest bit the ‘sign’ bit.

This is a sign bit like a ‘sign’ bit in floating point as opposed to the method in twos-complement.

Once you see that, it is much more intuitive.

I don’t know why it isn’t presented like this more often.

]]>Thanks for pointing out the typo.

]]>