Simple table size estimates and 128-bit numbers (Java Edition)

Suppose that you are given a table. You know the number of rows, as well as how many distinct value each column has. For example, you know that there are two genders (in this particular table). Maybe there are 73 distinct age values. For a concrete example, take the standard Adult data set which is made of 48842 rows.

How many distinct entries do you expect the table to have? That is, if you remove all duplicate rows, what is the number of rows left?

There is a standard formula for this problem: Cardenas’ formula. It uses the simplistic model where there is no relationship between the distinct columns. In practice, it will tend to overestimate the number of rows. However, despite it is simplicity, it often works really well.

Let p be the product of all column cardinalities, and let n be number of rows, then the Cardenas estimate is p * (1 – (1 – 1/p)n). Simple right?

You can implement in Java easily enough…

double cardenas64(long[] cards, int n) {
    double product = 1;
    for(int k = 0;  k < cards.length; k++) {
      product *= cards[k];
    }
    return product 
         * (1- Math.pow(1 - 1.0/product,n));
 }

So let us put in the numbers… my column cardinalities are 16,16,15,5,2,94,21648,92,42,7,9,2,6,73,119; and I have 48842 rows. So what is Cardenas’ prediction?

Zero.

At least, that’s what the Java function returns.

Why is that? The first problem is that 1 – 1/p is 1 when p is that large. And even if you could compute 1 – 1/p accurately enough, taking it to the power of 48842 is a problem.

So what do you do?

You can switch to something more accurate than double precision, that is quadruple precision (also called binary128). There is no native 128-bit floats in Java, but you can emulate them using the BigDecimal class. The code gets much uglier. Elegance aside, I assumed it would be a walk in the park, but I found that the implementation of the power function was numerically unstable, so I had to roll my own (from multiplications).

The core function looks like this…

double cardenas128(long[] cards, int n) {
    BigDecimal product = product(cards);
    BigDecimal oneover = BigDecimal.ONE.divide(product,
       MathContext.DECIMAL128);
    BigDecimal proba = BigDecimal.ONE.subtract(oneover,
    MathContext.DECIMAL128);
    proba = lemirepower(proba,n);
    return product.subtract(
       product.multiply(proba, MathContext.DECIMAL128),
        MathContext.DECIMAL128).doubleValue();
    }

It scales up to billions of rows and up to products of cardinalities that do not fit in any of Java’s native type. Though the computation involves fancy data types, it is probably more than fast enough for most applications.

My source code is available.

Update: You can avoid 128-bit numbers by using the log1p(x) and expm1(x) functions; they compute log(x + 1) and exp(x) – 1 in a numerically stable manner. The updated code is as follow:

 double cardenas64(long[] cards, int n) {
  double product = 1;
  for(int k = 0;  k < cards.length; k++) {
    product *= cards[k];
  }
  return product * 
    -Math.expm1(Math.log1p(-1.0/product) * n);
}

(Credit: Harold Aptroot)

Published by

Daniel Lemire

A computer science professor at the Université du Québec (TELUQ).

6 thoughts on “Simple table size estimates and 128-bit numbers (Java Edition)”

  1. To address the precision problem you could also take the first few terms of the binomial expansion of (1 – 1/p)^n – with a large p the terms go to zero essentially immediately so only the first few terms matter. The first term is 1, which nicely cancels out with the the preceeding (1 - part. Finally, you can multiply in the initial p since it cancels out also, since every other term is of the form C / p^m.

    So you are left only with a few terms like A + B/p + C/p^2 … where A, B, C are the binomial coefficents (are these easy to calculate when big? maybe).

    I dunno if that’s better or worse, but it’s a common approach to tackling this kind of problem: re-arranging the problematic part of the formula.

    1. Since p is HUGE, the zero’th order approximation is going to be pretty much perfect.
      (1-1/p)^n~ 1-n/p,
      expand everything out, and to zeroth’s order the approximation is n.

      If you want to go further, do a Taylor series in 1/p and you get
      n – ((-1 + n) n)/(2 p) + ((-2 + n) (-1 + n) n)/(6 p^2)

      But these correction terms are miniscule.
      The exact (to 20 digits) answer is
      48841.999999999999398
      and that correction (of magnitude 10^-12) is the first order in 1/p correction. The second order correction is magnitude 10^-30.

      So, yeah, for all practical purpose, no need for 128-bit arithmetic, just approximate Cardenas ~= n

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax