## Science and Technology links (June 23rd, 2017)

Elon Musk, Making Humans a Multi-Planetary Species, New Space. June 2017, 5(2): 46-61.

Reportedly, Ikea is working on augmented reality software that would allow you to see the furniture in your home before buying it.

Current virtual reality headsets provide a good experience, but if you ever tried to read text while where one of the headsets, you may have noticed that it is quite hard. It seems that it is because the image resolution is too low. When using prototypes with very high resolution, you can “read text across the room”.

You would think that a tree that is over 200 years old would have accumulated a lot of (random) genetic mutations. However, it seems that it is not the case: as trees grow old, even very old, they preserve their genes intact. We do not know why or how.

## Top speed for top-k queries

Suppose that I give you a stream of values and you want to maintain the top-k smallest or largest values. You could accumulate all these values, sort them all and keep the smallest k values.

You can do somewhat better by using a binary heap to construct a priority queue. And you can also use the QuickSelect algorithm. I compared naive implementations of both the approach based on a binary heap and on QuickSelect in a previous blog post.

Let us review the code of these techniques. I use JavaScript because it is supported everywhere.

We can sort and slice…

```var a = new Array();
for (var i = 0; i < streamsize; i++) {
a.push(rand(i));
}
a.sort(function(a, b) {
return a - b
});
```

We can dump everything naively into a priority queue (backed by a binary heap)…

```var b = new FastPriorityQueue(reverseddefaultcomparator);
for (var i = 0; i < k; i++) {
}
for (i = k; i < streamsize; i++) {
b.poll();
}
var answer = b.array.slice(0, k).sort(function(a, b) {
return a - b
});
```

Each time we call poll in the priority queue, we remove the largest element. Thus we maintain a list of k smallest elements.

I now want to consider less naive implementations of these ideas.

At all time, we have access to the largest element in the priority queue through the peek function. Long-time reader Kendall Willets pointed out to me that we can make good use of the peek function as follows.

```var b = new FastPriorityQueue(reverseddefaultcomparator);
for (var i = 0; i < k; i++) {
}
for (i = k; i < streamsize; i++) {
var x = rand(i);
if (x < b.peek()) {
b.poll();
}
}
var answer = b.array.slice(0, k).sort(function(a, b) {
return a - b
});
```

This has the potential of reducing substantially the amount of work that the priority queue has to do.

Michel Lemay pointed out to me what the Google Guava library does, which is to use a small buffer (of size k) and to run QuickSelect on this small buffer instead of the whole stream of data.

```var a = new Array();
for (var i = 0; i < 2 * k; i++) {
a.push(rand(i));
}
QuickSelect(a, k, defaultcomparator);
var eaten = 2 * k;
while (eaten < streamsize) {
for (var i = 0; i < k; i++) {
a[k + i] = rand(i + eaten);
}
QuickSelect(a, k, defaultcomparator);
eaten += k;
}
var answer = a.slice(0, k).sort(function(a, b) {
return a - b
});
```

David Eppstein described to me something slightly more complex where we use our knowledge of the pivot in the QuickSelect algorithm to only merge potential candidates.

```var a = new Array();
var i = 0;
for (; i < 2 * k; i++) {
a.push(rand(i));
}
QuickSelect(a, k, defaultcomparator);
var pivotvalue = a[k];
var locationinbuffer = k;
for (; i < streamsize; i += 1) {
var newvalue = rand(i);
if (newvalue < pivotvalue) {
a[locationinbuffer++] = newvalue;
}
if (locationinbuffer == 2 * k) {
QuickSelect(a, k, defaultcomparator);
locationinbuffer = k;
}
}
if (locationinbuffer != k) {
QuickSelect(a, k, defaultcomparator);
}
var answer = a.slice(0, k).sort(function(a, b) {
return a - b
});
```

It is not exactly what David described, but I believe that it captures the essence of his idea.

So how do these techniques fare?
I have implemented a benchmark that you can run yourself, here are my raw results:

```\$ node test.js
Platform: linux 4.4.0-38-generic x64
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Node version 4.5.0, v8 version 4.5.103.37

starting dynamic queue/enqueue benchmark
FastPriorityQueue x 3,095 ops/sec ±0.15% (97 runs sampled)
FastPriorityQueue-KWillets x 18,875 ops/sec ±0.59% (93 runs sampled)
sort x 490 ops/sec ±0.37% (94 runs sampled)
QuickSelect x 8,576 ops/sec ±0.31% (97 runs sampled)
QuickSelect-naiveGoogleGuava x 6,003 ops/sec ±0.20% (98 runs sampled)
QuickSelect-naiveDavidEppstein x 7,317 ops/sec ±0.22% (98 runs sampled)
```

So in my particular test, the approach proposed by Kendall Willets, using a priority queue with pruning based on the peek value, is a net winner. Your results might vary, but I think that the Willets approach is attractive. It is simple, it uses little memory and it should provide consistent performance.

Exercise for the reader: I am using a random input in my benchmark. As pointed out by David Eppstein, this means that we can bound the probability that the ith value is in the top-k by min(1, k/i), so that the Willets check only fails O(k log n) times. How well would the different techniques do for nearly sorted data?

Update: As pointed out in the comments, we can further improve the Willets approach by merging the add and poll functions into a single function. The code is similar…

```var b = new FastPriorityQueue(reverseddefaultcomparator);
for (var i = 0; i < k; i++) {
}
for (i = k; i < streamsize; i++) {
var x = rand(i);
if (x < b.peek()) {
b.replaceTop(x);
}
}
var answer = b.array.slice(0, k).sort(function(a, b) {
return a - b
});
```

So what is the performance this time?

```\$ node test.js
Platform: linux 4.4.0-38-generic x64
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Node version 4.5.0, v8 version 4.5.103.37

starting dynamic queue/enqueue benchmark
FastPriorityQueue x 3,137 ops/sec ±0.13% (97 runs sampled)
FastPriorityQueue-KWillets x 19,027 ops/sec ±0.37% (94 runs sampled)
FastPriorityQueue-KWillets-replaceTop x 22,200 ops/sec ±0.46% (95 runs sampled)
sort x 479 ops/sec ±0.42% (92 runs sampled)
QuickSelect x 8,589 ops/sec ±0.31% (98 runs sampled)
QuickSelect-naiveGoogleGuava x 5,661 ops/sec ±0.22% (97 runs sampled)
QuickSelect-naiveDavidEppstein x 7,320 ops/sec ±0.25% (98 runs sampled)
```

So we improve the performance once more!

## Science and Technology links (June 16th, 2017)

How much bandwidth do we have? It seems that each of our eyes has 1 megabyte per second. That’s about 100GB per day assuming you close one of your eyes and you never sleep.

I found a fascinating article on “academic urban legends” including the myth that spinach is a good source of iron:

First, it falsified an idea that I had carried with me since I was a child, that spinach is an excellent source of iron. The most striking thing, however, was that a single decimal point, misplaced 80 years ago, had affected not just myself and my now deceased parents, but also a large number of others in what we place on our table.

(Read the whole article, it is not as simple as a decimal point.)
(via John D. Cook)

There is an app for the Apple Watch called cardiogram that is taking the wearers of the watch by storm:

Cardiogram uses the Apple Watch’s built-in heart sensor to give you advice about heart health — for instance, showing when you’re unusually stressed out by checking your heart rate.

I really need to get myself one of these watches!

As we age, we tend to lose muscle mass, to get weaker. The process is not entirely well understood, but it seems that it could be mediated at least in part by a loss of activity… As we age, we exercise less and this leads to detrimental changes in our metabolism. There is some hard evidence for this theory:

(…) we find that sedentary but not active humans display an age-related decline in the mitochondrial protein that is associated with muscle loss.

(via P. D. Mangan)

China just switched on the world’s largest floating solar power plant. It is still tiny (40MW) in comparison with many conventional power plants.

You can find melatonin at your local drug store or you can order it from Amazon. Many people take it at night to help them sleep. It seems that it could prevent neurological diseases. I sleep very well, so I am not tempted to take melatonin, but if I ever have trouble sleeping, I might quickly change my mind.

WordPress (the company) is shutting down its San Francisco office because employees never show up, preferring to work from home. I tend to work mostly from home myself, and I might never leave my home, except that I have graduate students and postdoctoral fellows to meet, and I am also department chair, so I have to show up for meetings from time to time. This being said, for programming or writing, I am far more productive at home. If I were a full-time programmer, I would certainly expect my employer to let me work from home. Somewhat ironically, it is part of my job to try to entice graduate students to show up to our lab.

When your skin gets red and your body aches, you are probably suffering from inflammation. It is a crude response from our immune system that helps us fight viruses and bacterias. We can suppress inflammation using something as simple as aspirin. However, inflammation is also involved in tissue repair. A recent paper shows that a protein that drives inflammation is critical for muscle regeneration.

The number of scientists in the biomedical field is growing exponentially (…)

Contrary to some assertions, exponential advances in science and technology are not due to some fundamental law of nature but rather the exponential increase in the number of scientists doing science. (…) Biological research progress doesn’t generally scale with computer chip density although some specific areas might scale with available computational power. In other areas, you still need to spend a large number of hours in the lab actually doing the work.

Science, even medical science, is not homogeneous. Many areas of science advance at a breakneck speed, even when relatively few people are involved. Back in 2012, doing gene editing was science-fiction for most people, but today high school students can do it, affordably. Back a few years ago, advanced computing applications such as face recognition required a Ph.D. in the field and advanced mathematics… whereas today you can get it done in a few hours using free software libraries. Meanwhile, other research areas, like cancer research, remain painfully slow and expensive. What we hope is that the areas that make rapid progress can rescue the slow moving fields. This has happened time and time again in our history. That is, it would have been useless for the Roman Empire to try to build the Internet. No matter how much they would have invested, the progress would have been non-existent. However, the Roman Empire could have benefited greatly from an industrial revolution.

Senescent cells are “old” cells that won’t die. We now have the technology to kill them, albeit that’s not yet perfected to the point where your doctor can get rid of your senescent cells. What do senescent cells do? A lot of bad. A recent article shows that senescent cells drive the “fatty liver” disease.

## QuickSelect versus binary heap for top-k queries

In a previous post, I considered the problem of finding the k smallest (or k largest) elements from a stream of values.

The naive approach is to collect all the values in an array, sort the array, and return the first k values. When an index is not available, that is how some relational databases reportedly solve SELECT/ORDER BY/LIMIT queries. It is also how many programmers will solve this problem.

Using a JavaScript benchmark, I showed that an approach based on a binary heap could be several times faster. That is a net win for computer science since an approach based on a binary heap has complexity O( n log k ) where n is the number of elements. Moreover the binary heap is attractive because it has an optimal storage requirement of O( k ).

Lots of people objected that we could do even better using an algorithm called QuickSelect (e.g., Rasmus Pagh). Whereas QuickSelect has far higher memory requirements than a binary heap (O( n )) and a worst complexity of O( n2 ), it has a superior average case complexity of O( n ).

Finally, Anno Langen sent me a code sample in JavaScript, so I no longer any excuse not to test QuickSelect. I put together a JavaScript benchmark that you can run and review.

In this benchmark, we receive 1408 random integers and we must collect the smallest 128.

The approach using a binary heap can run about 37,000 times a second, whereas QuickSelect runs 45,000 times per second, or about 20% faster. They are both about an order of magnitude faster than the naive sort/slice approach.

For all practical purposes, 20% faster is negligible in this case. I have actually hit a sweet spot where QuickSelect and the binary heap are comparable.

What are other cases of interest?

• If you only seek the top 5 elements out of an array, then the binary heap is likely to beat QuickSelect, irrespective of how many elements I have. The binary heap will fit in one or two cache lines, and the log k factor will be small.
• If I keep k at a sizeable 128, but I increase substantially the size of the array, then QuickSelect will start to win big.
• However, if I keep increasing the array size, the benefits of QuickSelect might start to fade. Indeed, QuickSelect will start to operate in RAM whereas the binary heap will remain in CPU cache. QuickSelect will become increasingly limited by potentially expensive cache misses.
• QuickSelect still has the worst case quadratic-time scenario that could be triggered by really bad input data. The binary heap is more likely to provide consistent speed.

What else could we do?

• Martin Cohen suggested a simple insertion sort on the basis that, for large streams, you are unlikely to encounter many candidates for the top-k position. This would probably work well for very long streams, but it could degenerate badly in some cases.
• Michel Lemay refered to a fancier approach used by Google Guava: maintain a buffer of 2k elements initially empty. Fill it up. Once it is full, use QuickSelect on it and discard half of it. Rinse and repeat. This seems like an appealing idea and Michel testifies that it provides very good practical performance.

So I am probably going to have to return to this problem.

Follow-up blog post: Top speed for top-k queries.

## Science and Technology links (June 9th, 2017)

This week, Apple told us about two new interesting pieces of technology.

• On the one hand, we have ARKit. It makes it easy for developers to build and deploy “augmented reality” applications to the iPhone in your pocket. What is augmented reality? In the visual space, it is a way to add “virtual” objects on top of what you’d normally see. So, for example, a furniture company could make it easy for you to “see” the new furniture in your living room as if you owned it. That’s nice except that looking at your mobile phone is an awkward way to do augmented reality. We’d prefer to have something like “smart glasses”. It is surely coming in the near future.
• On the other hand, Apple released Core ML which is a way for app developers to make use of fancy machine learning techniques without too much effort. Suppose you want to build an app that can look at people’s skin to help diagnose a disease. You could train a model using your PC, and then drag-and-drop this model into your app, and then deploy it to all new iPhones.

Suppose that we had smart watches that measure your pace, your sleep, your heart rate and, one might dream, your glucose level… combine this with Core ML and you might have an app that assesses your health and makes personalized recommendations.

It is interesting to look at the contrast between Apple and the rest of the industry. Google and Microsoft want us to use their services, Apple wants us to buy devices on which we can run apps. This leads to a different focus.

Speaking of Apple, its new iPad Pro uses a processor designed by Apple. It is faster than the Intel processor Apple uses in its cheapest laptops. Given that even the cheapest laptop from Apple is good enough for most people since its cheap laptops are about “mid-range” compared to other vendors, we can say that an iPad Pro could, in theory, replace adequately a laptop for most people, if we only cared about performance. To be fair to Intel, the Apple processor used in its iPad Pro has extra cores and it runs at a much higher clock frequency than the mobile Intel processor used in the laptops (though the laptop has more RAM). Still, we are at the point where top-of-the-line mobile phones and tablets are more powerful than mid-range laptops. It will be interesting to see how the trend goes in coming years.

Brent Smith and Greg Linden review two decades of recommender systems at Amazon.com. Do read it no matter what your scientific level is, it is worth your time. The mathematics can be safely skipped if you want.

A professor was fascinated by the fact that cats that had their spinal cord severed at birth could still walk. After decades, he managed to get some positive results in human beings that had their spinal cord damaged. That is, paralyzed human beings can regain some control of their legs. As a researcher, he was the subject of mockery.

If you combine machine learning and radiology, you can make predictions about one’s health, including longevity.

Digital therapeutics is the idea that you can use software as medicine. I don’t think it is as crazy as it sounds. That you can cure a disease with a pill is not so intuitive, it is only because we are used to it that we take it for granted. I do think we may see software in the future being a strong component of medical therapies. It is much cheaper to deploy software than to distribute and monitor pills.

Sugar is often thought to be bad for you, but a special type of sugar called trehalose seems to protect against cardiovascular diseases. You can find trehalose on Amazon. It is widely used in Japan, so it ought to be safe, at least in modest quantities.

## Quickly returning the top-k elements: computer science vs. the real world

A standard problem in computer science is to quickly return the smallest or largest K elements of a list. If the list is made of N elements, then we can solve this problem in time N log (K) using a binary heap.

How would the average programmer solve this problem? My guess is that unless K is 1, then most programmers would simply sort the list and pick the first or last K elements.

That’s not such a bad deal. The difference between log (K) and log (N) might not be so large. The logarithm function grows slowly.

It could even be that a well optimized sort function could beat a poorly optimized binary heap.

To test it out in the real world, I wrote a benchmark using a JavaScript library. In this benchmark, I generate 1408 random integers and I seek to select the smallest 128 integers.

You might ask: why JavaScript? Because JavaScript is everywhere and we should care about its performance.

The code with a priority queue backed by a binary heap looks like this… we eat the first 128 integers, and then we do a push/poll danse to maintain always the smallest 128 integers. (The blocks parameter is set at 10 in my tests.)

```var b = new FastPriorityQueue(defaultcomparator);
for (var i = 0 ; i < 128  ; i++) {
}
for (i = 128 ; i < 128 * blocks  ; i++) {
b.poll();
}
```

The sorting routine is somewhat simpler. Just construct a big array, sort it and then extract the first few values:

```var a = new Array();
for (var i = 0 ; i < 128 * (blocks + 1)  ; i++) {
a.push(rand(i));
}
a.sort()
return a.slice(0,128);
```

How does it fare?

The approach using a binary heap can run about 37,000 times a second, whereas the sorting approach is limited to about 4,000 times a second. So a factor-of-ten difference.

In this instance, computer science wins out: using a binary heap pays handsomely.

Another way programmers might implement a top-K is by a SQL ORDER BY/LIMIT query. My friend, professor Antonio Badia, checked how PostgreSQL solves these queries and he believes that they result in a full sort unless there is a tree index on the relevant attribute. Can other databases, such as Oracle or SQL Server do better than PostgreSQL? It is possible, but PostgreSQL is hardly a naive database engine.

Interestingly, it might be quite challenging for a database user to somehow implement a binary heap solution on top of a relational database. Database engines rarely give you direct access to the underlying data structures.

So all your fancy computer science knowledge might be of limited use if your data is managed by an engine you cannot hack.

## Science and Technology links (June 2nd, 2017)

Methylene blue rejuvenates old skin. You can buy methylene blue on Amazon and spray it on your face. It is reportedly safe, even in high concentration. Of course, you are likely to turn blue.

A school in the UK is offering a degree in eSport (video games).

According to Bloom et al., ideas — and in particular the exponential growth they imply — are getting harder and harder to find.

The US pulled out of the Paris Agreement on climate change. What does that mean in concrete terms?

(…) since Paris was an entirely voluntary deal, Trump’s climate policies weren’t constrained in any meaningful way by our participation in it. Conversely, his climate policies won’t be substantively different now that we’re pulling out of Paris.

Samsung plans to make processors with 4nm features in 2020. As far as I know, this is half as small as any current processor.

The cortex is the part of the brain of mammals that is responsible for most higher-level brain functions. We sometimes lose some of the neurons in the cortex which, if this damage accumulate, can lead to serious cognitive problems. French researchers have shown that we can replenish these neurons using stem cells.

Japan is facing a decade-long period of stagnation, and it is running out of workers. So it is reportedly moving to legalize drone deliveries by 2020.

Some company claims that the blood plasma from young people can rejuvenate older people. This seems highly speculative and I would not recommend you try it. I would bet that the results are wrong. I am sure we can rejuvenate old human beings, but probably not through plasma transfusions.

## Unsigned vs. signed integer arithmetic

Given any non-negative integers x and d, we can write uniquely x = q d + r where q (the quotient) and r (the remainder) are non-negative and r is less than d. We write r = x mod d.

Most modern processors represent signed and unsigned integers in the following manner:

• Unsigned integers as simply 32-bit or 64-bit binary numbers. E.g., we write the number 15 as 0b0...01111. Programmers write bits from right to left, starting with the least significant bits: the right-most bit has value 1, the second right-most bit has value 2, and so forth.

We have to worry about overflows: e.g., when adding two 32-bit integers that won’t fit in 32 bits. What happens if there is an overflow? How your programming language handles it is one thing, but processors often make it easy to compute the operations “modulo the word size”. This means that x + y is the same as (x + y) mod 2w where w is the word size in bits (typically 32 or 64).

Dividing by two can be achieved by “right-shifting” the bits by 1. Multiplying by two is equivalent to “left-shifting” the bits by 1.

• Signed integers are implemented at the processor level in a manner similar to unsigned integers, using something called Two’s complement. We can think about Two’s complement as a way of mapping signed values to unsigned (binary) values. Positive values (in [0,2w-1)) are mapped to themselves (m(x)= x) whereas negative values (in [-2w-1, -1]) are mapped to the complement (m(x)= 2wx).

You can distinguish negative integers from positive integers since their left-most bit is set to true. That’s convenient.

Programmers prefer to say that we can negate a number by flipping all its bits (sometimes called “one’s complement”) and adding the value one. You can verify that if you take a number, flip all its bits and add the original number, you get a binary value with all the bits set to 1, or 2w-1. So the result follows by inspection but I prefer my formulation (m(x)= 2wx) as it makes the mathematics clearer.

A fun problem with Two’s complement is that you have more negative numbers than you have positive (non-zero) numbers. This means that taking the absolute number of a signed integer is a bit tricky and may lead to an overflow. Thus we cannot do something like a/b = sign(b) * a / abs(b) or a * b = sign(a) * sign(b) * abs(a) * abs(b) since there is no safe absolute-value function.

The nice thing with Two’s complement is that because the processor implements unsigned integer arithmetic modulo a power of two (e.g., (x + y) mod 2w ), then you “almost” get the signed arithmetic for free. Indeed, suppose that I want to add two values x and y but that one of them (y) is negative. Then I first apply my map to transform them into unsigned values (y becomes 2w-y), and I end up with x + 2w-y mod 2w which is just x - y mod 2w as one would expect.

So we get that multiplications (including right shifts), additions, and subtractions are nearly identical operations whether you have signed or unsigned integers.

This means that a language like Java can avoid unsigned integers for the most part. Signed and unsigned integers more or less work the same.

When you program in assembly, it is not quite true. For example, on x86 processors, the MUL (unsigned multiplication) and IMUL (signed multiplication) instructions are quite different. However, as long as you only care for the multiplication modulo the word size, you can probably use them interchangeably: a critical difference is that the unsigned multiplication actually compute the full result of the multiplication, using another word-sized register to store the more significant bits. Other processors (e.g., ARM) work slightly differently, of course, but the need to distinguish between signed and unsigned integers still arise from time to time.

The division, remainder and right shifts are something else entirely.

Let us divide by two. The value -2 is mapped to 0b111...10. With unsigned arithmetic, we would simply shift all bits right by one, to get 0b0111...1, but that’s no longer a negative value in Two’s complement notation! So when we shift right signed integers, we typically replicate the left-most bit, so that we go from 0b111...10 to 0b111...11.

This works well as long as we only have negative even numbers, but let us consider a negative odd number like -1. Then doing the signed right shift trick, we go from 0b111...11 to 0b111...11, that is (-1>>1) is -1. It is not entirely unreasonable to define -1/2 as -1. That’s called “rounding toward -infinity”. However, in practice people seem to prefer to round toward zero so that -1/2 is 0.

The net result is that whereas division by two of unsigned integers is implemented as a single shift, the division by two of signed integers ends up taking up a handful of instructions.

## Science and Technology links (May 26th, 2017)

Linux, the operating system driving cloud computing and the web, was developed using an open source model. For a time, Linux was seen as a direct competitor to Microsoft, but things have changed and Microsoft is happy to see Linux as just a piece of technology. Because of how large and complicated the software got, the Linux project manager, Linus Torvalds, ended up writing its own source control tool, Git. It quickly became a standard. Today, Windows is built on Git. This shows the power of open source. “Open source” is a concept just as powerful and important for our civilization as “the scientific method”. Though both science and open source can wipe out business models, they are also engines of innovation making us all richer. Microsoft does not use Linux and Git because it gave up on having viable competitors, but rather because it understands that fighting open source is about as useful as fighting the scientific method. (As an aside, modern implementations of Git are accelerated with compressed bitmaps called EWAH, something I personally worked on.)

Organic food is better for you, right? Galgano et al. (2016) find no evidence of such benefits:

The organic food market is growing in response to an ever increasing demand for organic products. They are often
considered more nutritious, healthier, and free from pesticides than conventional foods. However, the results of scientific studies do not show that organic products are more nutritious and safer than conventional foods.

Ok but organic food is better for the environment, right? Maybe not because organic farming requires more land and more animals:

Furthermore, higher on-farm acidification potential and global warming potential per kilogram organic milk implies that higher ammonia, methane, and nitrous oxide emissions occur on farm per kilogram organic milk than for conventional milk. Total acidification potential and global warming potential per kilogram milk did not differ between the selected conventional and organic farms.

A company called Warby Parker has a mobile app you can use to sidestep entirely optometrists and opticians in some cases. The main point seem to be that a lot of what opticians do can be computerized easily and that it is not hard to check your prescription.

Omega-3 polyunsaturated fatty acids reduce mitochondrial oxidant emissions, increase postabsorptive muscle protein synthesis, and enhance anabolic responses to exercise in older adults.

You have heard that teenage acne was unrelated to diet, right? Not so fast:

We found a positive association between intake of skim milk and acne. This finding suggests that skim milk contains hormonal constituents, or factors that influence endogenous hormones, in sufficient quantities to have biological effects in consumers.

The epidemic incidence of adolescent acne in Western milk-consuming societies can be explained by the increased insulin- and IGF-1-stimulation of sebaceous glands mediated by milk consumption.

We found a positive association with acne for intake of total milk and skim milk. We hypothesize that the association with milk may be because of the presence of hormones and bioactive molecules in milk.

Keytruda is the first cancer drug that targets a genetic dysfunction rather than specific cancer type. This type of drug might open the door to cheap and effective genetic cancer therapies. Note that Keytruda is actually approved for use in the US, so this is not purely speculative.

Volvo makes self-driving garbage-collecting trucks. A report from Goldman Sachs suggests that self-driving cars could destroy 25,000 jobs per month in the US. That sounds like a lot, but I don’t think it is anything dramatic, if true. See also the Sector Disruption Report (May 2017) by James Arbib and Tony Seba that sees the effect of self-driving car as enormous:

This will keep an additional \$1 trillion per year in Americans’ pockets by 2030, potentially generating the largest infusion of consumer spending in history

Arthritis is a terrible disease where some people end up living in near constant pain. Yet it could be prevented with good diet and exercise, according to an article in Nature.

Jeff Bezos, Amazon’s CEO, wants us to build permanent settlements on the Moon. This is the same man who wants to deliver us goods using drones, and help cure aging through the clearance of senescent cells.

Scientists have linked 52 genes to intelligence. Let me caution you: no we cannot build super smart babies by manipulating genes. Not yet at least.

## Counting exactly the number of distinct elements: sorted arrays vs. hash sets?

Suppose that you have ever larger sets of 64-bit integers, and you want to quickly find out how many distinct integers there are. So given {10, 12, 10, 16}, you want an algorithm to output 3, as there are three distinct integers in the set. I choose 64-bit integers, but strings would do fine as well.

There are sensible algorithms to estimate this number, but you want an exact count.

Though there are many good ways to solve this problem, most programmers would first attempt to use one of these two techniques:

• Create a hash set. Throw all the values in the hash set (implemented with a hash table). Then check how many values are found in the hash set in the end. In C++, you might implement it as such:
```size_t distinct_count_hash(const uint64_t * values, size_t howmany) {
std::unordered_set<uint64_t> hash(values, values + howmany);
return hash.size();
}
```
• Put all the values in an array, sort the array then run through it, deduplicating the values. In C++, you might implement it as follows:
```size_t distinct_count_sort(const uint64_t * values, size_t howmany) {
std::vector<uint64_t> array(values, values + howmany);
std::sort(array.begin(), array.end());
return std::unique(array.begin(), array.end()) - array.begin();
}
```

Which is best? Sorting has complexity O(n log n) whereas insertion in a hash set has expected constant time O(1). That would seem to predict that the hash set approach would always be best.

However, there are many hidden assumptions behind textbook naive big-O analysis, as is typical. So we should be careful.

Simple engineering considerations do ensure that as long as the number of distinct elements is small (say no larger than some fixed constant), then the hash set approach has to be best. Indeed, sorting and copying a large array with lots of repeated elements is clearly wasteful. There is no need for fancy mathematics to understand that scenario.

But that’s not the difficult problem that will give you engineering nightmares. The nasty problem is the one where the number of distinct elements can grow large. In that case, both the array and the hash set can become large.

Which is best in that difficult case? I wrote a small C++ benchmark which you can run yourself.

 N hash set (cycles/value) array sort (cycles/value) 1,000 220 161 10,000 260 163 100,000 340 200 1,000,000 820 245 10,000,000 1,100 282

So when there are many distinct values to be counted, sorting an array is an efficient approach whereas the hash table should be avoided.

How can we understand this problem? One issue is that as the hash table becomes large, it comes to reside in RAM (as it no longer fits in CPU cache). Because of how hash sets work, each operation risks incurring an expensive cache miss. A single retrieval from RAM can take dozens of CPU cycles. Meanwhile, sorting and scanning an array can be done while avoiding most cache misses. It may involve many more operations, but avoiding cache misses can be worth it.

What if I kept cranking up the data size (N)? Would the hash set ever catch up? It might not.

The problem is the underlying assumption that you can access all memory using a constant time. That’s not even close to true.