Data scientists need to learn about significant digits

Suppose that you classify people on income or gender. Your boss asks you about the precision of your model. Which answer do you give? Whatever your software tells you (e.g., 87.14234%) or a number made of a small and fixed number of significant digits (e.g., 87%).

The latter is the right answer in almost all instances. And the difference matters:

  1. There is a general principle at play when communicating with human beings: you should give just the relevant information, nothing more. Most human beings are happy with a 1% error margin. There are, of course, exceptions. High-energy physicists might need the mass of a particle down to 6 significant digits. However, if you are doing data science or statistics, it is highly unlikely that people will care for more than two significant digits.
  2. Overly precise numbers are often misleading because your actual accuracy is much lower. Yes, you have 10,000 samples and properly classified 5,124 of them so your mathematical precision is 0.5124. But if you stop there, you show that you have not given much thought to your error margin. First of all, you are probably working out of a sample. If someone else redid your work, they might have a different sample. Even if one uses exactly the same algorithm you have been using, implementation matters. Small things like how your records are ordered can change results. Moreover, most software is not truly deterministic. Even if you were to run exactly the same software twice on the same data, you probably would not get the same answers. Software needs to break ties, and often does so arbitrarily or randomly. Some algorithms involve sampling or other randomization. Cross-validation is often randomized.

I am not advocating that you should go as far as reporting exact error margins for each and every measure you report. It gets cumbersome for both the reader and the author. It is also not the case that you should never use many significant digits. However, if you write a report or a research paper, and you report measures, like precision or timings, and you have not given any thought to significant digits, you are doing it wrong. You must choose the number of significant digits deliberately.

There are objections to my view:

  • “I have been using 6 significant digits for years and nobody ever objected.” That is true. There are entire communities that have never heard about the concept of significant digit. But that is not an excuse.
  • “It sounds more serious to offer more precision, this way people know that I did not make it up.” It may be true that some people are easily impressed by very precise answers, but serious people will not be so easily fooled, and non-specialists will be turned off by the excessive precision.

Rethinking Hamming’s questions

Richard Hamming is a famous computer scientist. In his talk You and Your Research, Hamming recounts how asked researchers three questions which I paraphrase:

  1. What are the important problems of your field?
  2. What important problems are you working on?
  3. If what you are doing is not important, and if you don’t think it is going to lead to something important, why are you working on it?

It is important to qualify what Hamming meant by “important problem”. He meant not only the result of the quest (curing cancer) but also the path taking you there (by using a new generation of antibiotics).

My thoughts on these questions:

  • In truth, hardly anyone knows what the important problems are. We typically know which methods are at our disposal, we know of many easy tasks, but we are effectively blind to any objective beyond that which has been done.

    We know something about our track record in this respect because, for example, researchers frequently make assessments about their fields and where it should go. It is not uncommon that when you read these assessments, years later, they appear hopelessly naive and misguided.

    Hamming believed that great scientists knew the important problems. I doubt it. I’d call it hubris.

  • To make matters worse, any popular answer is almost surely worthless to the individual. If an objective and a method are known to be promising, you can bet good money that people better than you have been working on it for some time already. In some sense, this almost ensures that whatever the great scientists tell us is the future, should be viewed with suspicion. It might be their future, but it much less likely to be your future.
  • Most importantly, I claim that most people do not care whether they work on important problems or not. My experience is that more than half of researchers are not even trying to produce something useful. They are trying to publish, to get jobs and promotions, to secure grants and so forth, but advancing science is a secondary concern. That’s why most people are not troubled by Hamming’s questions. And, of course, Hamming already observed that researchers who do not care for his questions tend not to go very far. He assumed that it is because they do not “know” the important problems, but I believe that a much more reasonable point of view is that they don’t care.

What we need, of course, is to filter out people who do not really care. I think that’s where informal settings help.

Tell smart people to work on what is important to them, but don’t tell them (ever) what exactly they must do. Do not reward any of them visibly for any repeatable action.

Soon enough, only the people who are will remain.

Science and Technology links (January 26th, 2019)

  1. We are training many more doctors (PhDs) than we need, when looking at the number of new faculty positions. In science, this has been true since at least the 1980s. It is not uncommon for even so-so faculty positions to receive dozens of top-notch applicants with PhDs. Yet it is often believed that for a new PhD, anything short of a professorship is a failure. Logically, PhD students should prepare for a job in industry and I often try to prepare my own students in this way. Some students believe that hinting that they need to prepare for an industry job is an insult. Jeff Dean is probably one of the ten most famous computer scientists in the world. He works for Google and he has a PhD. He wrote on Twitter:

    When I was finishing grad school, I was applying for both academic faculty positions and industrial research positions. Only got one academic interview (& offer), and not from a top- or even mid-tier place, so I went into industrial research. It’s turned out okay.

    My own unpopular view is that governments put too much money toward funding new PhDs. We confuse too easily training new PhDs with “doing science”. We generate many, many PhDs that are too often poorly prepared for the actual jobs they may find.

  2. Cape Town (South Africa) and Magadan (Russia) are 22 thousands kilometers apart. It would take two years to walk the distance. Though it is a long time, it also puts in perspective the fact that it often took hundreds of years for technological innovations to spread from one corner of the Earth to the other.
  3. We have no serious evidence that artificial sweeteners are harmful.
  4. The Guardian tells us that up to 1,500 private flights will bring world leaders to Davos to take urgent action on climate change. Al Gore, the author of the celebrated “Inconvenient Truth” documentary on climate change resides in a 10,000 square-foot home with eight bathrooms, he uses orders of magnitude more electricity than regular folks. He also charters private flights. Similarly, in my experience, climate-change researchers (including some close colleagues of mine) routinely fly to remote locations. The counterargument is usually that people buy carbon offsets. Roughly speaking, what these offsets do is, for example, pay for a poor village to receive a solar panel, thus replacing their inefficient CO2-emitting generator. That is what happens when these carbon offset programs are genuine and well managed. However, do these offets really work? That is, will the poor village with subsized electricity now use this new wealth to buy a new car? Who gets to measure the environmental effect of these carbon offsets, and what are their incentives? Anderson wrote an essay for Nature a few years ago on this topic:

    Offsetting is worse than doing nothing. It is without scientific legitimacy, is dangerously misleading and almost certainly contributes to a net increase in the absolute rate of global emissions growth.

    Corbera and Martin are similarly skeptical:

    carbon-offsetting activities have (…) not been able to meet their stated goals (…) the carbon-offset market is designed to serve polluters

    A major initiative to offset climate change is REDD: in short, rich countries pay poor countries to preserve their forests. The key idea, just like other carbon offsets, is that the rich country does not need to change its practices. In Why REDD will fail, DeShazo et al. explain why it may not work as you expect:

    the eventual goal of REDD is that developed countries will pay developing countries to protect and enhance their forests in order to offset carbon emissions. This practice allows developed countries to claim that they are reducing emissions when no actual reduction has taken place (…) should a developing country adopt REDD (…) this does not equate to a reduction in deforestation.

    Carbon offsets work in specific ways: they make some people feel better and allow virtue signalling. However, you should not believe that we could all live in mansions with eight bathrooms and fly private jets without increasing our carbon emissions, if only we paid for carbon offsets. You should not believe that researchers and world leaders can keep on organizing conferences at Davos without significant environmental impacts. These are extraordinary claims that require extraordinary evidence. They just happen to be convenient beliefs.

  5. As we age, our bones tend to break more easily. We also accumulate senescent cells: cells that should be dead but somehow survive. Though it is not yet ready for actual therapies, we now have technology to remove senescent cells. Farr and Khosla find that removing senescent cells in old mice prevent age-related bone loss and frailty.
  6. A bacteria that causes gum disease might cause Alzheimer’s.
  7. Scandinavians with weird names were more likely to emigrate to the United States. People with weird names would be more individualistic. Both times my wife was pregnant, I insisted on us picking original first names. I would define myself as an individualist.

Science and Technology links (January 19th, 2019)

  1. Losing even just a bit of weight can be enough to restore fertility in women.
  2. Digital technology does not appear to have a significant negative effect on teenagers, according to an article published by Nature.
  3. According to an article published by Nature, woody vegetation cover over sub-Saharan Africa increased by 8% over the past three decades. This is part of a larger trend: the Earth is getting greener.
  4. The number of children in India has peaked. This suggests that the population of India will peak soon as well.
  5. Scientists are looking for chemical factors that either decrease or increase with age: normalizing these factors could slow or reverse aging. Nature reports on a factor called MANF which decreases with age in mice, human beings and flies. It seems than MANF supplements could have anti-aging benefits.
  6. Netflix is approaching 150 million subscribers. This means that about 2% of the world’s population is made of Netflix subscribers. This almost certainly underestimates the true reach of Netflix.
  7. In an article in the reputed Guardian newspaper from 2004, we can read:

    A secret report, suppressed by US defence chiefs and obtained by The Observer, warns that major European cities will be sunk beneath rising seas as Britain is plunged into a ‘Siberian’ climate by 2020. Nuclear conflict, mega-droughts, famine and widespread rioting will erupt across the world. (…) Already (…) the planet is carrying a higher population than it can sustain. By 2020 ‘catastrophic’ shortages of water and energy supply will become increasingly harder to overcome, plunging the planet into war.

  8. Our brains are made of white and grey matter. Obesity correlates with lower gray matter volume in the brain.
  9. Men and women have (statistically) different cognitive abilities, with women being better at verbal tasks whereas men are better at spatial tasks. This survey concludes that homosexuals have abilities closer to the other sex:

    The meta-analysis revealed that homosexual men performed like heterosexual women in both male-favouring (e.g., spatial cognition) and female-favouring (e.g., verbal fluency) cognitive tests, while homosexual women performed like heterosexual men only in male-favouring tests. The magnitude of the sexual orientation difference varied across cognitive domains (larger for spatial abilities).

Faster intersections between sorted arrays with shotgun

A common problem within databases and search engines is to compute the intersection between two sorted array. Typically one array is much smaller than the other one.

The conventional strategy is the “galloping intersection”. In effect, you go through the values in the small arrays and then do a binary search in the large array. A binary search is a simple but effective algorithm to search through a sorted array. Given a target, you compare with with the midpoint value. If your target is smaller than the midpoint value, you search in the first half of the array, otherwise you search in the second half. You can recurse through the array in this manner, cutting the search space in half each time. Thus the search time is logarithmic.

If the small array has M elements and the large array has N elements, then the complexity of a galloping search is O(M log N). In fact, you can be more precise: you never need more than M * log N + M comparisons.

Can you do better? You might.

Let me describe an improved strategy which I call “shotgun intersection”. It has been in production use for quite some time, through the CRoaring library, a C/C++ implementation of Roaring Bitmaps.

The idea is that galloping search implies multiple binary searches in sequence through basically the same array. Doing them consecutively might not be best. A binary search, when the large array is not in cache, is memory-bound: it waits for the memory subsystem to deliver the data. So you are constantly waiting. What if you tried to do something else while you wait. What about starting right away on the next binary search?

That is how a shotgun search works. You take, say, the first four values from the small array. You load the midpoint value from the large array, then you compare all of your four values against this midpoint value. If the target value is larger, you set a corresponding index so that the next search will hit the second half of the array. And so forth. In effect, shotgun search does many binary searches at once.

I make my Java code available, if you want a full implementation.

Does it help? It does. Sometimes it helps a lot. Let us intersect an array made of 32 integers with an array made of 100 million sorted integers. I use a cannonlake processor with Java 8.

1-way 1.3 microseconds
4-way 0.9 microseconds

Credit: Shotgun intersections are based on an idea and an initial implementation by Nathan Kurz. I’d like to thank Travis Downs for inspiring discussions.

Science and Technology links (January 12th, 2019)

  1. You can buy a 512GB memory card for $140 from Amazon.
  2. We have been told for decades to avoid saturated fats, the kind found in meat, cheese and butter. After an extensive review, Grasgruber et al. conclude:

    Our results do not support the association between cardiovascular diseases and saturated fat, which is still contained in official dietary guidelines. In the absence of any scientific evidence connecting saturated fat with cardiovascular diseases, these findings show that current dietary recommendations regarding cardiovascular diseases should be seriously reconsidered.

  3. Stoet and Geary propose a new measure of gender inequality and conclude…

    We found that low levels of human development are typically associated with disadvantages for girls and women, while medium and high levels of development are typically associated with disadvantages for boys and men. Countries with the highest levels of human development are closest to gender parity, albeit typically with a slight advantage for women.

  4. Budish argues that Bitcoin would be hacked if it became sufficiently important economically.
  5. We can use deep-learning software to reconstruct intelligible speech from the human auditory cortex.
  6. Nelson’s research indicates that there will be more calories available per capita in 2050 than today, in all income quintiles, and even in an extreme climate change scenario. However, both obesity and dietery shortages (e.g., vitamin A) will remain a challenge:

    We must shift our emphasis from food security to nutrition security.

Science and Technology links (January 5th, 2019)

  1. There are nearly 70,000 centenarians in Japan.
  2. China’s population fell by 1.27 million in 2018.
  3. Obesity is associated with increased senescent cell burden and neuropsychiatric disorders, including anxiety and depression. Clearing senescent cells could alleviate these symptons.
  4. Women who come from families where the mother is the breadwinner are less likely to pursue a science degree at university level.
  5. Stronger men live longer:

    Muscular strength is inversely and independently associated with death from all causes and cancer in men, even after adjusting for cardiorespiratory fitness and other potential confounders.

  6. Inflammation is a crude but effective part of our immune system that serves as a first line of defense. It is necessary but tends to get out of hand as we age. Arai et al. found that your level of inflammation was a potential predictor of how likely you were to reach extreme old age:

    Inflammation is the prime candidate amongst potential determinants of mortality, capability and cognition up to extreme old age

  7. An article published by Nature argues for rejuvenation strategies:

    Ageing is associated with the functional decline of all tissues and a striking increase in many diseases. Although ageing has long been considered a one-way street, strategies to delay and potentially even reverse the ageing process have recently been developed. Here, we review four emerging rejuvenation strategies—systemic factors, metabolic manipulations, senescent cell ablation and cellular reprogramming—and discuss their mechanisms of action, cellular targets, potential trade-offs and application to human ageing.

  8. Childhood leukaemia is a feature of developed societies and has to do with the fact that children are insufficiently exposed to bacteria and infections.
  9. China has landed a rover on the far side of the Moon.
  10. While climate change models predict that as the world warms, biomass will decompose more quickly, which would send a lot more carbon dioxide into the atmosphere, new research finds that it may work the other way around: warmer temperature may reduce decomposition and carbon emissions.
  11. You can defeat Google’s captchas (meant to prove you are a human being) by taking the audio test and feeding it back to Google’s services to get a transcription.

Memory-level parallelism: Intel Skylake versus Intel Cannonlake

All programmers know about multicore parallelism: your CPU is made of several nearly independent processors (called cores) that can run instructions in parallel. However, our processors are parallel in many different ways. I am interested in a particular form of parallelism called “memory-level parallelism” where the same processor can issue several memory requests. This is an important form of parallelism because current memory subsystems have high latency: it can take dozens of nanoseconds or more between the moment the processor asks for data and the time the data comes back from RAM. The general trend has not been a positive one in this respect: in many cases, the more advanced and expensive the processor, the higher the latency. To compensate for the high latency, we have parallelism: you can ask for many data elements from the memory subsystems at the same time.

In earlier work, we showed that current Intel processors (Skylake microarchitecture) are limited to about ten concurrent memory requests whereas Apple’s A12 processor scale to 40 or more memory requests.

Intel just released a more recent microarchitecture (cannonlake) and we have been putting it to the test. Is Intel improving?

It seems so. In a benchmark where you randomly access a large array, using a number of separate paths (which I call “lanes”), we find that the cannonlake processor appears to support twice as many concurrent memory requests as the skylake processors.

The Skylake processor has lower latency (70 ns/query) compared to the Cannonlake processor (110 ns/query). Nevertheless, the Cannonlake is eventually able to beat the Skylake processor in bandwidth by a wide margin (12 GB/s vs 9 GB/s).

The story is similar to the Apple A12 experiments.

This suggests that even though future processors may not have lower latency when accessing memory, we might be better able to hide this latency through more parallelism.

Even if you are writing single-threaded code, you ought to think more and more about parallelism.

Our code is available.

Credit: Though all the mistakes are mine, this is joint work with Travis Downs.

Further details: Processors access the memory through pages. By default, many Intel systems have “small” pages (4kB). When doing random accesses in large memory regions, you are likely to access too many pages, so that you incur expensive “page misses” that lead to “page walks”. It is possible to use large page sizes, even “huge pages”. But since memory is allocated in pages and you may end up with many under-utilized pages if they are too large. In practice, under-utilized pages (sometimes called “memory fragmentation) can be detrimental to performance. To get the good results above, I use huge pages. Because there is just one large memory allocation in my tests, memory fragmentation is not a concern. With small pages, the Cannonlake processor loses its edge over Skylake: they are both limited to about 9 concurrent requests. Thankfully, on Linux, programmers can request huge pages with a madvise call when they know it is a good idea.

Important science and technology findings in 2018

  1. The Gompertz-Makeham law predicts statistically the mortality rate of human beings. The key takeaway is that it is an exponential function. Every few years, the mortality rate of a human being doubles. It is not unique to human beings: most mammals and many other animals have an exponentially rising mortality rate over time. It does not affect all animals, however. Lobsters do not appear to age like we do. Many trees age in reverse, meaning that their mortality rate diminishes over time. In 2018, a scientist studying naked mole rates for decades published an analysis of over 3,000 rats and found that their mortality rate remains constant throughout their life. We do not know why naked mole rates age differently from most other mammals.
  2. Type 1 diabetes is when your pancreas is unable to supply insulin to your cells. Though it can be treated with expensive and inconvenient insulin shots, there is no cure. In 2018, we found that a heart-disease drug can partially reverse type 1 diabetes. This could make some diabetics less dependent on insulin.
  3. Though artificial intelligence has been making a lot of progress in tasks such as image classification and game playing, we are still a long way from being able to intelligently animate human-like body parts like hands. Simple tasks like folding laundry or turning a door knob are still a massive challenge. In 2018, researchers from OpenAI have trained a human-like robot hand to manipulate objects like we would.
  4. Older people tend to have a less efficient immune system. In 2018, we learned that we can at least partially reverse age-related immune-system decline using drugs. It substantially reduces infections in older people.
  5. The human genome project set forth in 1990 to map the human chromosomes. We thought at the time that human beings would have 100,000 genes, but they have only about 25,000 genes. The map was completed in 2003. Yet applications of the human genome project have been scarce. In 2018, the first gene-silencing drug was approved in the USA.
  6. Electrocardiograms (ECG) have been used since the 19th century to monitor human hearts. The first commercially available ECG machines were produced at the beginning of the 20th century but they remain specialized devices mostly just used within hospitals. They are also somewhat invasive. In 2018, Apple released a watch with government-approved ECG (heart monitoring) capabilities.
  7. CRISPR/Cas9 is technique developed in 2012 to edit the genes of living organisms. It is unclear whether it is safe to use it on human beings… Using this technique, a Chinese researcher helped produce the first genetically modified babies, they may be immune to HIV.

Science and Technology links (December 29th, 2018)

  1. Low-dose radiation from A-bombs elongated lifespan and reduced cancer mortality relative to un-irradiated individuals (Sutou, 2018):

    The US National Academy of Sciences (NAS) presented the linear no-threshold hypothesis (LNT) in 1956, which indicates that the lowest doses of ionizing radiation are hazardous in proportion to the dose. This spurious hypothesis was not based on solid data. (…) A-bomb survivors (…) showed longer than average lifespans. Average solid cancer death ratios of(…) A-bomb survivors (…) were lower than the average for Japanese people (…), essentially invalidating the LNT model. Nevertheless, LNT has served as the basis of radiation regulation policy. If it were not for LNT, tremendous human, social, and economic losses would not have occurred in the aftermath of the Fukushima Daiichi nuclear plant accident. For many reasons, LNT must be revised or abolished, with changes based not on policy but on science.

    This is important work. I am surprised at how few people know about hormesis. Many people assume that if you avoid stress, toxins and challenges, you will maximize your health and longevity. That is just flat out wrong.

  2. In climate talks, we use year 1850 as a reference: the implicit goal is to maintain Earth’s global temperature close to the global temperature that existed in year 1850 (say within 1.5 degrees). To my knowledge, nothing makes year 1850 special. In fact, in the absence of both ancient and recent carbon emissions from agriculture and industrialization, current global average temperatures would likely have been about 1.3 degrees lower than they were around 1850 (Vavrus et al., Nature 2018). We were headed down toward another glaciation and, instead, due to human beings, we are headed toward warmer and warmer temperatures. Left unchecked, neither direction is desirable. As argued by Deutsch in the Beginning of Infinity, we have no choice but to accept that there is no such thing as “sustainability” (which assumes an ideal steady state) and that we must learn to engineer our climate.
  3. Human beings do not originate from a single region in Africa: the story of our origin is complicated, involving a mix of different populations and cultures.
  4. McGuff and Litte (2009) write:

    there is no additional physiological advantage afforded to one’s body, including endurance or cardio benefits, by training that lasts more than six to nine minutes a week.

  5. The videogame Fortnite lead to 3 billion dollars in profit for its creators. It has 200 million players.
  6. In older mammals, the skin loses fat. Zhang et al. restored the ability of skin in older mice to store fat, thus making the skin more resistant to some infections.
  7. Students who make friends and study with them tend to do better. This is painfully obvious to anyone who has given serious thought to how schooling works.