Many real-life data sets have power laws or Zipfian distributions. An integer-valued random variable X follows a power law with parameter a if P(X=k) is proportional to k-a. Panos asked what the sum of two power laws was. He cites Wilke at al. who claim that the sum of two power laws X and Y with parameters a and b is a power law with parameter min(a,b).

I relate this problem to the sum of exponentials. Any engineer knows that if a>b, then eat + ebt will be approximately eat for t sufficiently large.

Hence, I think that the sum of power law distributions X and Y is a power law distribution with parameter min(a,b) if you are only interested in large values of k in P(X+Y=k).

For extra credit, help me solve this problem. Suppose that I have two power laws with the same parameter. Is their sum a power law with the same parameter? (I predict it does not!)

Egghe showed in The distribution of N-grams that even if the words follow a power law, the n-grams won’t!

Disclaimer: Yes, I am being lazy. I could work it out.

Many democratic systems require vote diversity. You do not get elected prime minister of Canada by rallying the largest number of voters. You also need to have your votes spread out over several regions.

Similarly, Scott Karp argues that completely open social networks fail. He takes two examples: Digg and Wikipedia.

Digg recommends web sites based on user votes. They recently modified their algorithm:

The algorithm change effectively holds back from the homepage any story that is Dugg by the same groups of friends, i.e. a group that is not “diverse,” (…)

As for wikipedia, Karp points out that it is not a really open system since a group of editors have a great deal of control.

Stephen Downes asks an interesting question: what constraints make a network effective?

The wisdom of crowds is not obtained by mere voting. What is required — as the new Digg algorithm explicitly recognizes — is diversity.

I would like to formalize this problem. You are given a set of users and their votes on several issues as in the Digg community. You are not given out explicitly what the cliques — or set of friends — are. Is there a canonical way to take into account diversity when counting votes?

According to Curt Monash, few people should be buying high-end Database management systems:

There are relatively few applications that wouldn’t run perfectly well on PostgreSQL or EnterpriseDB today. (…)

What’s more, these mid-range database management systems can have significant advantages over their high-end brethren. The biggest is often price, for licenses and maintenance alike. Beyond that, they can be much easier to administer then their more complex counterparts. (…)

And what these mid-range DBMS don’t do today, they likely will do soon. (…)

EnterpriseDB is equal or superior in every way I can think of to Oracle7, a few security certifications perhaps excepted.

If you work for an organization that has expensive contracts with Oracle or Microsoft for their DBMS, it is most certainly in vain.

Meanwhile, the world of open source Business Intelligence is getting more interesting every day. We now have Pentaho Mondrian, Jedox, Birt, Enhydra Octopus, and so on. In 2005, I asked whether open source was ready for Business Intelligence. The question seems less controversial in 2008, doesn’t it?

Most of the database industry has been commoditized. If you stick around with these old schemas, you lose.

Tag clouds are graphical representations of attributes and their relative importance. In a recent paper, we have argued that tag clouds might help bridge the gap between collaboration and Business Intelligence.

Here are some fun things to do with tag clouds:

  • In our paper, a tag cloud computation is the equivalent of an approximate orthogonal top-k range query. There has been little work in this area. We propose error measures for this problem. Our own approach is based on the pre-computation of icebergs.
  • Unlike bar charts, a tag cloud can have 50, 100 or 150 attributes. It makes it easier to collaborate because you do not need so often to rely on hierarchies. However, tag clouds tend to mix badly with non-nominal dimensions such as time or price. More generally, more work is needed on multidimensional tag clouds.
  • The problem of optimally drawing tag clouds is still very much open.

Now that January 2008 is coming to an end, maybe it is time to give 2007 a final loop. According to my logs, my most popular blog posts in 2007 are:

« Previous PageNext Page »

Powered by WordPress