Suppose that you are given a table. You know the number of rows, as well as how many distinct value each column has. For example, you know that there are two genders (in this particular table). Maybe there are 73 distinct age values. For a concrete example, take the standard Adult data set which is made of 48842 rows.

How many distinct entries do you expect the table to have? That is, if you remove all duplicate rows, what is the number of rows left?

There is a standard formula for this problem: Cardenas’ formula. It uses the simplistic model where there is no relationship between the distinct columns. In practice, it will tend to overestimate the number of rows. However, despite it is simplicity, it often works really well.

Let p be the product of all column cardinalities, and let n be number of rows, then the Cardenas estimate is p * (1 – (1 – 1/p)^{n}). Simple right?

You can implement in Java easily enough…

double cardenas64(long[] cards, int n) { double product = 1; for(int k = 0; k < cards.length; k++) { product *= cards[k]; } return product * (1- Math.pow(1 - 1.0/product,n)); }

So let us put in the numbers… my column cardinalities are 16,16,15,5,2,94,21648,92,42,7,9,2,6,73,119; and I have 48842 rows. So what is Cardenas’ prediction?

Zero.

At least, that’s what the Java function returns.

Why is that? The first problem is that 1 – 1/p is 1 when p is that large. And even if you could compute 1 – 1/p accurately enough, taking it to the power of 48842 is a problem.

So what do you do?

You can switch to something more accurate than double precision, that is quadruple precision (also called binary128). There is no native 128-bit floats in Java, but you can emulate them using the BigDecimal class. The code gets much uglier. Elegance aside, I assumed it would be a walk in the park, but I found that the implementation of the power function was numerically unstable, so I had to roll my own (from multiplications).

The core function looks like this…

double cardenas128(long[] cards, int n) { BigDecimal product = product(cards); BigDecimal oneover = BigDecimal.ONE.divide(product, MathContext.DECIMAL128); BigDecimal proba = BigDecimal.ONE.subtract(oneover, MathContext.DECIMAL128); proba = lemirepower(proba,n); return product.subtract( product.multiply(proba, MathContext.DECIMAL128), MathContext.DECIMAL128).doubleValue(); }

It scales up to billions of rows and up to products of cardinalities that do not fit in any of Java’s native type. Though the computation involves fancy data types, it is probably more than fast enough for most applications.

**Update**: You can avoid 128-bit numbers by using the log1p(x) and expm1(x) functions; they compute log(x + 1) and exp(x) – 1 in a numerically stable manner. The updated code is as follow:

double cardenas64(long[] cards, int n) { double product = 1; for(int k = 0; k < cards.length; k++) { product *= cards[k]; } return product * -Math.expm1(Math.log1p(-1.0/product) * n); }

(Credit: Harold Aptroot)

To address the precision problem you could also take the first few terms of the binomial expansion of (1 – 1/p)^n – with a large p the terms go to zero essentially immediately so only the first few terms matter. The first term is 1, which nicely cancels out with the the preceeding

`(1 -`

part. Finally, you can multiply in the initial p since it cancels out also, since every other term is of the form`C / p^m`

.So you are left only with a few terms like A + B/p + C/p^2 … where A, B, C are the binomial coefficents (are these easy to calculate when big? maybe).

I dunno if that’s better or worse, but it’s a common approach to tackling this kind of problem: re-arranging the problematic part of the formula.

Forgot to add: I haven’t tried it, so I could definitely be full of it…

Since p is HUGE, the zero’th order approximation is going to be pretty much perfect.

(1-1/p)^n~ 1-n/p,

expand everything out, and to zeroth’s order the approximation is n.

If you want to go further, do a Taylor series in 1/p and you get

n – ((-1 + n) n)/(2 p) + ((-2 + n) (-1 + n) n)/(6 p^2)

But these correction terms are miniscule.

The exact (to 20 digits) answer is

48841.999999999999398

and that correction (of magnitude 10^-12) is the first order in 1/p correction. The second order correction is magnitude 10^-30.

So, yeah, for all practical purpose, no need for 128-bit arithmetic, just approximate Cardenas ~= n

You can use the lim (1+1/n)^n approximation to $e$ to solve this problem mathematically. An example computation: http://www.moderndescartes.com/essays/birthday_paradox/

This can also be done in standard “double” arithmetic, using the log1p and expm1 functions, which are inteded exactly to counter this sort of instabilities.

See my answer here for how to perform exactly this calculation: https://math.stackexchange.com/a/2317045/65548 . I guess this method is going to be much faster than using BigDecimal.

Aptroot on twitter suggested product * -Math.expm1(Math.log1p(-1.0/product) * n).