Slashdot: Why Is Data Mining Still A Frontier?

Slashdot asks “Why Is Data Mining Still A Frontier?” The article itself is not very exciting, but the comments are great. Here are some I like:

I would suggest that, in practice, the real difficulty is that the problems that need to really be solved for data mining to be as effective as some people seem to wish it was are, when you actually get down to it, issues of pure mathematics. Research in pure mathematics (and pure CS which is awfully similar really) is just hard. Pretending that this is a new and growing field is actually somewhat of a lie.

Available datasets are not themselves in anything like normal relational form, and so have potential internal inconsistencies. And that gets in the way before you even have the chance to try to form intelligent inferences based on relations between data sets, which of course are terribly inconsistent.

The ultimate problem, is that for most datasets, there are an infinite (at least), set of relations that can be induced from the data. This doesn’t even address the issue, that the choice of available data is a human task. However, going back to assuming we have all the data possible, you still need to have a specific performance task in mind.

To sum it up:

  • Data Mining requires hard and fancy Mathematics.
  • Data cleaning and integration is hard.
  • There are infinitely many ways to mine data and it is not obvious a priori what is useful.

I think Data Mining is a beautiful research topic. However, as the comments indicate, it is very hard and it requires a wide ranging expertise.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

2 thoughts on “Slashdot: Why Is Data Mining Still A Frontier?”

  1. I’ve always wanted to get into Data Mining, but I never learned where to start. Daniel – do you have any suggestions? Should I start with a book, read some reference manuals, subscribe to some blogs…?

  2. There are good books out there, but if you are only interested in some applications, it is easier to look for specific papers. Data Mining is a wide ranging field and studying it all is virtually impossible. For example, most agree that OLAP and data warehousing are a crucial part of Data Mining or at least closely related to it (which begs to question as to how you define “Data Mining”). Then, you have time series (wavelets, Fourier,…), machine learning and so on.

Leave a Reply to Daniel Lemire Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.