# Data scientists need to learn about significant digits

Suppose that you classify people on income or gender. Your boss asks you about the precision of your model. Which answer do you give? Whatever your software tells you (e.g., 87.14234%) or a number made of a small and fixed number of significant digits (e.g., 87%).

The latter is the right answer in almost all instances. And the difference matters:

1. There is a general principle at play when communicating with human beings: you should give just the relevant information, nothing more. Most human beings are happy with a 1% error margin. There are, of course, exceptions. High-energy physicists might need the mass of a particle down to 6 significant digits. However, if you are doing data science or statistics, it is highly unlikely that people will care for more than two significant digits.
2. Overly precise numbers are often misleading because your actual accuracy is much lower. Wikipedia tells us that the number of significant digits implies some knowledge about your uncertainty:

Uncertainty may be implied by the last significant figure if it is not explicitly expressed.The implied uncertainty is ± the half of the minimum scale at the last significant figure position. For example, if the mass of an object is reported as 3.78 kg without mentioning uncertainty, then ± 0.005 kg measurement uncertainty may be implied.

So if you give 4 digits, you are telling us that you know the true value very precisely. Yes, you have 10,000 samples and properly classified 5,124 of them so your mathematical precision is 0.5124. But if you stop there, you show that you have not given much thought to your error margin. First of all, you are probably working out of a sample. If someone else redid your work, they might have a different sample. Even if one uses exactly the same algorithm you have been using, implementation matters. Small things like how your records are ordered can change results. Moreover, most software is not truly deterministic. Even if you were to run exactly the same software twice on the same data, you probably would not get the same answers. Software needs to break ties, and often does so arbitrarily or randomly. Some algorithms involve sampling or other randomization. Cross-validation is often randomized.

I am not advocating that you should go as far as reporting exact error margins for each and every measure you report. It gets cumbersome for both the reader and the author. It is also not the case that you should never use many significant digits. However, if you write a report or a research paper, and you report measures, like precision or timings, and you have not given any thought to significant digits, you are doing it wrong. You must choose the number of significant digits deliberately.

There are objections to my view:

• “I have been using 6 significant digits for years and nobody ever objected.” That is true. There are entire communities that have never heard about the concept of significant digit. But that is not an excuse.
• “It sounds more serious to offer more precision, this way people know that I did not make it up.” It may be true that some people are easily impressed by very precise answers, but serious people will not be so easily fooled, and non-specialists will be turned off by the excessive precision.

Daniel Lemire, "Data scientists need to learn about significant digits," in Daniel Lemire's blog, January 29, 2019.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 3 thoughts on “Data scientists need to learn about significant digits”

1. John the Scott says:

most excellent post. i recommend gustafson’s book for another angle on digital error.

```https://www.amazon.com/End-Error-Computing-Chapman-Computational/dp/1482239868/ref=sr_1_1?s=books&ie=UTF8&qid=1548866338&sr=1-1&keywords=the+end+of+error ```

2. ttoinou says:

Of course you’re right.

If you’re exchanging information with scientists / engineers you could also provide with every F figure its Â±P “precision” (Y% of chance to be in the Gaussian centered on F with k(Y)*P standard-deviation, k to be computed from Y). That way if the person you’re giving information to needs to compute a new statistic, it can combines Gaussian models and have a new (F’ Â± P’)

3. Michael Nelson says:

I would add to the statement “serious people will not be so easily fooled.” When I see such precision in reduces my confidence in the source. My internal “bozo” warning light comes on.
I had the concept of significant digits pounded in to my head by my (very excellent) high school science teachers. Now I have an aversion to over-precision.

You may subscribe to this blog by email.