Don’t read your data from a straw

It is common for binary data to be serialized to bytes. Data structures like indexes can be saved to disk or transmitted over the network in such a form. Many serialized data structures can be viewed as sets of ‘integer’ values. That is the case, for example, of a Roaring bitmap. We must then read back this data. An integer could be serialized as four bytes in a stream of bytes.

We are thus interested in the problem where we want to convert an array of bytes into an array of integer values.

In C or C++, you can safely convert from a stream of bytes to in-memory values using a function like ‘memcpy’. It is fast. Or you can just cheat and do a “cast”.

What do you do in Java?

A convenient approach is to wrap the byte array into an InputStream and then wrap that InputStream into a DataInputStream, like in this example where we convert an array of bytes into an array of integers:

byte[] array = ...
int[] recipient = new int[N / 4];
DataInput di = new DataInputStream(new 
for(int k = 0; k < recipient.length; k++)
    recipient[k] = di.readInt();

The benefit of this approach is improved abstraction: you do not care whether the data comes from an array of bytes or from disk, it is all the same code. If you are have serialization and deserialization code, it is probably written in terms of OutputStream and InputStream anyhow, so why not reuse that perfectly good code?

However, Java offers a performance-oriented concept called a ByteBuffer to represent and array of bytes. It is not as high level as an input stream since it assumes that you do have, somewhere, an array of bytes.

You can achieve the same conversion as before using a ByteBuffer instead:

byte[] array = ...
int[] recipient = new int[N / 4];
ByteBuffer bb = ByteBuffer.wrap(s.array);

Here is the time required to convert 1 million 32-bit integers on my 4.2GHz 2018 iMac:

DataInputStream 10 ms
ByteBuffer 1 ms

That is, the ByteBuffer is 10x faster. My code is available.

Because I have 1 million integers, we can convert back these timings into “time per integer”: the ByteBuffer approach achieves a speed of one 32-bit integer converted per nanosecond. Given that my iMac can execute probably something like a dozen operations per nanosecond, that’s not impressive… but it is at least a bit respectable. The DataInputStream takes 10 nanosecond (or something like 40 or 50 cycles) per integer: it is grossly inefficient.

This has interesting practical consequences. In the RoaringBitmap project, Alvarado pointed out that it is faster to create a bitmap using a ByteBuffer backend, and then convert it back into an normal data structure, than to construct it directly from an input steam. And the difference is not small.

Practically, this means that it may be worth it to provide a special function that can construct a bitmap directly from a ByteBuffer or a byte array, bypassing the stream approach. (Thankfully, we have bitmaps backed with ByteBuffer to support applications such as memory-file mapping.)

Speaking for myself, I was going under the impression that Java would do a fine job reading from a byte array using an input stream. At the very least, we found that ByteArrayInputStream is not the right tool. That a ByteBuffer would be fast is not so surprising: as far as I can tell, they were introduced in the standard API precisely for performance reasons. However, a factor of ten is quite a bit more than I expected.

In any case, it is a perfectly good example of the problem whereas abstractions force you to consume data as if it went through a straw. Streams and iterators are handy abstractions but they often lead you astray with respect to performance.

Further reading. Ondrej Kokes has reproduced these results in Go.

Simple table size estimates and 128-bit numbers (Java Edition)

Suppose that you are given a table. You know the number of rows, as well as how many distinct value each column has. For example, you know that there are two genders (in this particular table). Maybe there are 73 distinct age values. For a concrete example, take the standard Adult data set which is made of 48842 rows.

How many distinct entries do you expect the table to have? That is, if you remove all duplicate rows, what is the number of rows left?

There is a standard formula for this problem: Cardenas’ formula. It uses the simplistic model where there is no relationship between the distinct columns. In practice, it will tend to overestimate the number of rows. However, despite it is simplicity, it often works really well.

Let p be the product of all column cardinalities, and let n be number of rows, then the Cardenas estimate is p * (1 – (1 – 1/p)n). Simple right?

You can implement in Java easily enough…

double cardenas64(long[] cards, int n) {
    double product = 1;
    for(int k = 0;  k < cards.length; k++) {
      product *= cards[k];
    return product 
         * (1- Math.pow(1 - 1.0/product,n));

So let us put in the numbers… my column cardinalities are 16,16,15,5,2,94,21648,92,42,7,9,2,6,73,119; and I have 48842 rows. So what is Cardenas’ prediction?


At least, that’s what the Java function returns.

Why is that? The first problem is that 1 – 1/p is 1 when p is that large. And even if you could compute 1 – 1/p accurately enough, taking it to the power of 48842 is a problem.

So what do you do?

You can switch to something more accurate than double precision, that is quadruple precision (also called binary128). There is no native 128-bit floats in Java, but you can emulate them using the BigDecimal class. The code gets much uglier. Elegance aside, I assumed it would be a walk in the park, but I found that the implementation of the power function was numerically unstable, so I had to roll my own (from multiplications).

The core function looks like this…

double cardenas128(long[] cards, int n) {
    BigDecimal product = product(cards);
    BigDecimal oneover = BigDecimal.ONE.divide(product,
    BigDecimal proba = BigDecimal.ONE.subtract(oneover,
    proba = lemirepower(proba,n);
    return product.subtract(
       product.multiply(proba, MathContext.DECIMAL128),

It scales up to billions of rows and up to products of cardinalities that do not fit in any of Java’s native type. Though the computation involves fancy data types, it is probably more than fast enough for most applications.

My source code is available.

Update: You can avoid 128-bit numbers by using the log1p(x) and expm1(x) functions; they compute log(x + 1) and exp(x) – 1 in a numerically stable manner. The updated code is as follow:

 double cardenas64(long[] cards, int n) {
  double product = 1;
  for(int k = 0;  k < cards.length; k++) {
    product *= cards[k];
  return product * 
    -Math.expm1(Math.log1p(-1.0/product) * n);

(Credit: Harold Aptroot)