In an earlier post, I announced Stream VByte, claiming that it was very fast. Our paper was peer reviewed (Information Processing Letters) and we shared our code.
Still, as Feynman said, science is the belief in the ignorance of experts. It is not because I am an expert that you should trust me.
There is a high-quality C++ framework to build search engines called Trinity. Its author, Mark Papadakis, decided to take Stream VByte out for a spin to see how well it does. Here is what Mark had to say:
As you can see, Stream VByte is over 30% faster than the second fastest, FastPFOR in decoding, where it matters the most, and also the fastest among the 3 codecs in encoding (though not by much). On the flip side, the index generated is larger than the other two codecs, though not by much (17% or so larger than the smallest index generated when FastPFOR is selected).
This is quite an impressive improvement in terms of query execution time, which is almost entirely dominated by postings list access time (i.e integers decoding speed).
I was pleased that Mark found the encoding to be fast: we have not optimized this part of the implementation at all… because everyone keeps telling me that encoding speed is irrelevant. So it is “accidentally fast”. It should be possible to make it much, much faster.
Mark points out that Stream VByte does not quite compress as well, in terms of compression ratios, than other competitive alternatives. That’s to be expected because Stream VByte is a byte-oriented format, not a bit-oriented format. However, Stream VByte really shines with speed and engineering convenience.
I made a start at SIMD-izing the encoding; so far my best estimate is at least 16 instructions per quadword. One hitch is that there is no SIMD unsigned gt/lt compare, so instead I’m doing shift-right and compare to zero for the three different widths. The output bitmasks are conveniently -1 or 0 however, so we can just sum them together with 3 to get the 2-bit code value for each lane.
The shuffle looks like it will need a table unless there’s some fancy way to build it — that’s still on my todo list.
Wow.
Note that there is an open issue regarding endianess…
https://github.com/lemire/streamvbyte/issues/4
It seems that the current encoder/decoder are little-endian whereas the paper describes a big-endian format. I have not had time to look into the matter, but for interoperability, this matters.
One hitch is that there is no SIMD unsigned gt/lt compare
I think that if you subtract 1<<(L-1) to L-bit integers prior to doing the signed comparison, you get an unsigned comparison.
I did overlook the fact that the sub would only need to happen once for the three comparisons — I’ll look at this again tonight.
I think you can almost always get unsigned comparisons very cheaply, which is probably why Intel did not bother adding support for it.