Slashdot asks “Why Is Data Mining Still A Frontier?” The article itself is not very exciting, but the comments are great. Here are some I like:

I would suggest that, in practice, the real difficulty is that the problems that need to really be solved for data mining to be as effective as some people seem to wish it was are, when you actually get down to it, issues of pure mathematics. Research in pure mathematics (and pure CS which is awfully similar really) is just hard. Pretending that this is a new and growing field is actually somewhat of a lie.

Available datasets are not themselves in anything like normal relational form, and so have potential internal inconsistencies. And that gets in the way before you even have the chance to try to form intelligent inferences based on relations between data sets, which of course are terribly inconsistent.

The ultimate problem, is that for most datasets, there are an infinite (at least), set of relations that can be induced from the data. This doesn’t even address the issue, that the choice of available data is a human task. However, going back to assuming we have all the data possible, you still need to have a specific performance task in mind.

To sum it up:

- Data Mining requires hard and fancy Mathematics.
- Data cleaning and integration is hard.
- There are infinitely many ways to mine data and it is not obvious
*a priori*what is useful.

I think Data Mining is a beautiful research topic. However, as the comments indicate, it is very hard and it requires a wide ranging expertise.

I’ve always wanted to get into Data Mining, but I never learned where to start. Daniel – do you have any suggestions? Should I start with a book, read some reference manuals, subscribe to some blogs…?

There are good books out there, but if you are only interested in some applications, it is easier to look for specific papers. Data Mining is a wide ranging field and studying it all is virtually impossible. For example, most agree that OLAP and data warehousing are a crucial part of Data Mining or at least closely related to it (which begs to question as to how you define “Data Mining”). Then, you have time series (wavelets, Fourier,…), machine learning and so on.