Suppose that you are working on a business problem where you have multiple attributes… maybe you have a table with multiple columns such as “age, gender, income, location, occupation”.

You might be interested in determining whether there are relations between some of these attributes. Maybe the income depends on the gender or the age?

That is fairly easy to do. You can take the gender column and the income column and do some statistics. You can compute Pearson’s correlation or some other measure.

If you have *N* attributes, you have *N* (*N*-1) / 2 distinct pairs of attributes, so there can be many pairs to check, but it is not so bad.

However, what if you have established that there is no significant relationship between any of your attributes when you take them two-by-two. Are you done?

Again, you could check for all possible sets (e.g., {age, gender, income}, {income, location, occupation}). The set of all possible sets is called the power set. It contains 2^{N} sets. So it grows exponentially with *N*, which means that for any large value of *N*, it is not practical to check all such sets.

But maybe you think that because you checked all pairs, you are done.

Maybe not.

Suppose that *x* and *y* are two attributes taking random integer values. So there is no sensible dependency between *x* and *y*. Then introduce *z* which is given by *z* = *x* + *y*. Clearly, *x* and *y* determine *z*. But there is no pairwise dependency between any of *x*, *y*, *z*.

To be precise, in Java, if you do the following…

Random r = new Random(); for(int k = 0; k < N; k++) { x[k] = r.nextInt(); y[k] = r.nextInt(); z[k] = x[k] + y[k]; }

then there is no correlation between (*y*, *z*) or (*x*, *z*) even though *x* + *y* = *z*.

So if you look only at (*x*, *y*), (*y*, *z*) and (*x*, *z*), this tells less than you might think about (*x*, *y*, *z*).

Thus, checking relationships pairwise is only the beginning…

Hi, if z = x + y, surely there are correlations between (x, z) and (y, z)? Perhaps there needs to be a nonlinear relationship – e.g. an XOR of binary random variables would have this property in this situation

Hi, if z = x + y, surely there are correlations between (x, z) and (y, z)?Is there?

Given full knowledge of

x, what can you say aboutz? By my assumptions, all you can say is thatzisxplus some random unknow value… so you cannot know whatzis givenx.If you have a data set with a bunch of values of xyz, and you measure the correlation between (x,z) and (y,z) with linear regression you’ll find that both have positive correlation. The correlation will be noisy, but in general higher values of x will likely have higher values of z. Thus in a statistical sense (x,z) and (y,z) aren’t independent.

Thus in a statistical sense (x,z) and (y,z) aren’t independent.They are in this example:

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2017/12/12/Test.java

The point of my blog post is that pairwise relations do not capture all relations.

This experimental results seems to occur because you are adding two uniformly-random full-range integers, and the addition overflows. This “destroys” the correlation. Replace both calls of r.nextInt() with r.nextInt(100), and I get a correlation of about 0.7 on the second test. I didn’t do the math, but I can imagine that “your” x, y and z are indeed all independent.

Your actual point is, of course, still completely valid: pairwise independence does not imply mutual independence. Thanks for the public service announcement 🙂

p(z=a,x=b) = p(y=a-b,x=b) = p(y=a-b)*p(x=b)

now for independence to hold, p(z=a)=p(x+y=a) must be always equal to p(y=a-b) for arbitrary a and b. Why would this generally be the case?

It depends on the variances of random variables x and y. If x has a variance much larger than that of y, you may see a strong correlation between x and z but none between y and z. If x and y both have small variances, you would be correct.

Thanks for this work, Daniel. It disproves one of the key assumptions I made in Calcite’s data profiler https://issues.apache.org/jira/browse/CALCITE-1616. Now I need to find an alternative approach.

For this work, my definition of “independence” is as follows. x and y are independent if the number of distinct values of (x, y) is as you would expect given the number of distinct values of (x) and (y) individually in a particular sample.

Example 1. In the Customers table, zipcode is dependent on id. There are 1,000,000 records in the table, 1,000,000 distinct values of id, 41,665 distinct values of zipcode, and 1,000,000 distinct values of (id, zipcode) combined. id is in fact a key, so it is unsurprising that zipcode is dependent upon it.

Example 2. In the Customers table, zipcode is almost dependent on state. Among the 1,000,000 records, there are 50 states, 41,665 zipcodes, and 41,811 combinations of (state, zipcode). In other words, a few zipcodes cross state boundaries, but not many.

Of course, you can’t prove that there is a functional dependency by looking at a sample of the data; someone could come along in a minute and add a record that breaks the dependency. But for some applications (e.g. query optimization) it is useful to be able to find groups of columns that are approximately functionally dependent.

Your example reminds me of exotic normal forms. If (x, y), (y, z) and (x, z) are unique, then we have a textbook example of a relation that is in 4th normal form but not 5th normal form. See https://en.wikipedia.org/wiki/Fifth_normal_form. Conventional wisdom is that tables that are in 4NF but not in 5NF are rare in the real world, but I’m not sure the same can be said if (x, y), (y, z) and (x, z) are not necessarily keys.

You could build a 2D matrix consisting of the counts of each distinct tuple. Then, for each x, you can compute P(Y=y|X=x) by dividing the count at (x,y) by the sum of the column, if this number is close to 1, you have a functional dependency of Y on X. If P(X=x|Y=y), computed by dividing by the sum of the row, is close to zero, you know the dependency is not mutual. For instance the probability that the state is NY given the zip is 10001 is 1, whereas the probability that the zip is 10001 given that the state is NY is approximately zero.

Richard, Rather than storing a count at each (x, y) I am storing a boolean – whether any records exist that have that (state, zipcode) combination – and then doing an approximate sum of the booleans. Thus the whole matrix is summarized by a single integer. Clearly your approach has more information, but my challenge is to make do with less information, because I want to cover all possible matrices.

Don’t know much about statistics but I guess this is related:

Multivariate Dependence Beyond Shannon Information

jld, I don’t know much statistics either but the paper you cite seems to be exactly on target. I will read and digest – thanks!