Statistics is overrated: the rise of data science

With the industrial and scientific revolutions, we saw the rise of enormous bureaucracies collecting reliable numbers. For the first time in history, we could ask about the total production of silver in England and get a meaningful answer.

But data is rarely complete. We most often only have partial views. Thankfully clever people observed that it is generally unnecessary to do the full computation. From a small representative sample, you can almost always tell what the whole population looks like. To know the average height of the American, you do not need to collect the height of every single American… a few hundreds or thousands is enough… as long as they are representative.

So we went from a pre-industrial world where people rarely quantified anything, to a world where everything must be accounted for. When we can’t count, we sample and estimate with good margins of error. So far so good. But what to do with all these numbers? Well, we must do something, anything, and if it sounds impressive and reputable, all the better!

There is a deluge of research papers using fancy statistical tests. Among those are the p-value significant tests. Except that, hardly anybody knows what a p-value actually means. And does any of it help anyone?

Does steak cause cancer? Who knows? We have all these statistical “proofs” of contradictory results, all based on a glorious statistical analysis. Where is the evidence that it brings us closer to the truth?

Even the American Statistical Association says that p values cannot determine whether a hypothesis is true or whether results are important. (Baker, 2016) And the famous statistician Andrew Gelman goes further: the problems are deeper, and the solution is not to reform p-values or to replace them with some other statistical summary or threshold.

Why do people go on? Is it because it brings an air of respectability to the whole process?

Meanwhile, silly computer scientists actually do separate spam from real emails. We really do defeat human beings at Chess and Go. We really do figure out whether your credit card purchase was a fraud or not.

The end-game for computer scientists is to match and surpass the human mind in its ability to process information. The end game for medical researchers is to keep us all perfectly healthy. What is the end-game for statistics? Do statisticians bring us ever closer to the statistical truth or do they expect us to churn out an ever greater number of p-values each year?

Like librarians and journalists, statisticians are ripe for disruption. There is a new discipline called “data science”. Ironically, it was founded by statisticians in 2001, shortly before the human genome project was completed (2003). If you look around, you will find many young (and not-so-young) people calling themselves data scientists.

They all exploit data, they all make it speak, they all try to bring out value from data. But how many of them are statistics college major do you think?

Software ate libraries and newspapers and it is now eating statistics.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

9 thoughts on “Statistics is overrated: the rise of data science”

  1. I would halfway disagree. Yes, some branches of classical statistics are ripe for takeover. That doesn’t mean that there’s not great theoretical work going on in statistics. What is the best prior to put on a complicated distribution? is an example of a question where stats remains relevant. However, from an applied perspective, I think that statistics without data analysis is at best marginally useful.

  2. There seem to be to largely separate fields called “statistics”. One is the classical statistics, which has its roots in the social sciences, and which is also widely used in medicine.

    Another, more computational field, has been prominent since at least the 90s. Some of its practicioners call themselves statisticians, while others prefer to be called mathematicians, computer scientists, physicists, or electrical engineers. Regardless of what they call themselves, everyone seems to be doing more or less the same thing. As far as I undertand, “data science” is supposed to be a new name for this field of computational statistics, though it sounds more like marketing BS in the same way as “big data”.

  3. With all due respect, you need to do some reading about inductive inference and the history of science. The goal of statistical inference is to draw conclusions about the real world from data, whereas in computer science, we are mostly just interested in exploiting statistical regularities to make predictions, filter spam, and so on. You had better hope that the people doing medical research are performing randomized trials and controlling for the type I errors of their causal inferences. Statistics did NOT start in the social sciences, although social science is where statistics are most commonly abused. Statistics started in physics and chemistry, but took its biggest leaps in agricultural research under R. A. Fisher. Recommended blogs: Andrew Gelman (http://andrewgelman.com/) and Deborah Mayo (http://errorstatistics.com/).

    1. I am long time follower of Andrew Gelman and he referenced my own blog on at least one occasion.

      Medical research is a mess specifically because of statistics.

      1. Hi, Daniel! I’m interested in more of what you have to say regarding Statistics making medical research a mess. Can you please elaborate? I am intending on getting my Statistics degree in medical research– but am having doubts myself in terms of its efficacy. I would love to get your opinion.

    2. Where do you think the word “statistics” comes from? Its literal meaning is something like “(the science) of the state”.

      Statistics started as the study of demographic and economic data, and later widened its scope to the gathering and analysis of data in general. Because of this legacy, some traditional universities still place statistics in the Faculty of Social Sciences.

  4. Well your conclusion seems to very much depend where you draw the line between statistics and data science. At least for some it is the same thing.

  5. Stupid you are not. But ignorant and stupid, this blog post is. Arrogant and prideful, you are. You want to throw the baby out with the bath water. Computer scientist, I am, but sadly some computer scientists let it go to their head and start trashing other fields that they fully not understand. If people don’t know what a p-value is, that says more about them than about statistics. Common sense.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.