Toward data-driven science

Science and business, so far, have been mostly model driven. That is, you collect a few data points, just enough to fit your model. Then you proceed from your model. However, things have changed:

old new
Manually take samples of the water in a nearby lake (4 times a year) Setup a wireless sensor in the lake (5000 samples a day)
Model an algorithm and test it once on expensive mainframe computer Build dozens of prototypes and test them on cheap laptops
Have an accountant prepare a business intelligence report, once a year See how the business is doing through your dynamic data warehouse

Hence, improving access to data is fast becoming a critical issue. In a thought-provoking post, Andre Vellino sketches the future of data Information Retrieval. Some key points:

  • Back in the early nineties, we had many electronic documents, but a comparatively poor infrastructure to share them. Then came the web and the search engines such as Google. Currently, we have many good data sets, but sharing and indexing them is painful. Clearly, we need to produce a better infrastructure for sharing data!
  • Research papers should reference data sets, by a unique identifier (such as a Digital Object Identifier), so that we can ask “What research relied on this data set?” or “Where can I find the data these authors have used?”

This is one instance where funding agencies should step in and encourage this work. It is not enough to encourage researchers to share their data. We need better tools too!

Daniel Lemire, "Toward data-driven science," in Daniel Lemire's blog, May 3, 2010.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

9 thoughts on “Toward data-driven science”

  1. Interesting. This data-centric, “use the world as its own model” approach is the exact same advocated by subsumption architecture robotics and other “embodied intelligence” AI work.

  2. In addition to a way to reference data sets, I’d like to see every analysis being by asserting a checksum of the data. So you’re working with this data set, but are you starting with *exactly* this data, or are you doing a little off-record “cleaning” before you begin?

    Sometimes data need to be cleaned, maybe a great deal. But it should be done on the record.

  3. @John

    Of course, the problem is that, right now, sharing the data sets (cleaned or not) is a bit difficult. I use large data sets, and sometimes I have to do not trivial processing on them. How do I share my results? I can post files on my own web site, but that’s hardly satisfying.

    But just imagine if you could drill down to the data sets people have used, and do an analysis of your own? I think research would really be improved.

  4. Thanks Daniel. I experienced the “Where can I find the data these authors have used?” problem several times. Testing recommender system performance is a good example where the results heavily depend on the data set. Even if you use the same data to test algorithms, sometimes you end up with different outcomes, just because you did strange 5 folding procedure or you have other artifacts. I like the idea of a “unique identifier for datasets” very much.

  5. I like the idea of a “unique identifier for datasets” very much.

    Technically trivial (a SHA checksum over the whole content as the ID) but not yet perceived as a solution.
    I don’t think the DOI will do any good, it is geared toward propriety enforcement not usability.

  6. @kevembuangga The idea of DOIs for datasets is more geared to making it possible to reference datasets as “publications” so that (a) they count in peer-review for promotions and (b) they can be referenced in journal articles. The DOI idea enforces the application of meta-data (author / title / abstract etc.) so that, among other things, datasets can be searched / discovered. See DataCite :

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.