Science and Technology links (June 29th 2019)

  1. Consuming pornography has no effect on sexual desire for one’s partner.
  2. Exercise may increase the likelihood that a woman will become pregnant.
  3. Consuming more protein make it less likely that you will become frail as you age.
  4. There is no correlation between how often university professors fly and their academic success. Furthermore, academics who study environmental issues like global warming fly just as much as the other professors.
  5. In mammals, the heart does not regenerate. Dead heart cells are dead and no replacement is produced. New research shows that this can be changed, in mice, by pumping them up with a small molecule called miR-294. The treated mice saw good heart regeneration.
  6. In human beings, an important part of the immune system, the thymus, simply disappears over time, leaving older people with a weaker immune system. In mice, this can be reversed using transgenic exosomes.

Bounding the cost of the intersection between a small array and a large array

Consider the scenario where you are given a small sorted array of integers (e.g., [1,10,100]) and a large sorted array ([1,2,13,51,…]). You want to compute the intersection between these two arrays.

A simple approach would be to take each value in the small array and do a binary search in the large array. Let the size of the large array be n, then a binary search takes up to log n comparisons and data accesses. Of course, log n  may not be an integer, so I need to round it up to the smallest integer at least as large as log n: ceil( log n ). If the small array contains k elements, then I need a total of k ceil( log n ) comparisons. Each comparison involves one data access in the large array.

Is k log n comparisons the best bound? Clearly not: if k is almost as big as n, then k log n can far exceeds k+n. Yet a standard merge algorithm (as in the Merge Sort algorithm) requires as little as k+n comparisons. So for the bound to be reasonable, we must require that k is much smaller than n.

Even so, intuitively, it is wasteful to do k distinct binary searches. If the values in the small array are in sorted order, then you have new knowledge after doing the first binary search that you can reuse to help in the next binary search.

How few comparisons can you do? We can use an information-theoretical argument. The basic idea behind such an argument is that for an algorithm based on a decision tree to generate M distinct possible output, it must involve up to ceil( log( ) ) decisions in the worst case, a count corresponding to the height of the corresponding balanced tree.

To be clear, this means that if an adversary gets to pick the input data, and you have divine algorithm, then I can give you a lower bound on the number of decisions that your algorithm takes. It is entirely possible that, in practice, you will do better given some data; it is also entirely possible that you could do much worse, evidently. Nevertheless, let us continue.

We can recast the intersection problem as the problem of locating, for each key in the small value, the largest value in the large array that is small or equal to the key. How many possible outputs (of this algorithm) are there? I have to choose k values from a set of n elements, but the order does not matter: that’s nk/k! possibilities. This means that if my algorithm relies on a (balanced) decision tree, it must involve ceil( k log n – log k! ) decisions.

Let us put some numbers in there. Suppose that n is 4 million and k is 1024. Then ceil( log n ) is 22 so each key in the small array will involve 22 decisions. On a per key basis, our tighter model gives something like log n – (log k!)/k. The log of the factorial is almost equal to k log k by Stirling’s approximation, so we can approximate the result as log n – log k.

Let us plug some numbers in these formula to track the lower bound on the number of decisions per key in the small array:

k log n log n – (log k!) / k
8 22 20
16 22 19
64 22 17
1024 22 13

This suggests that you may be able to do quite a bit better than k distinct binary searches.

Often, however, it is not so much the decisions that one cares about as much as the number of accesses. Can this value 13 in table when n is 1024 be taken a lower bound on the number of accesses? Not as such because you can access one value from the large array and then reuse it for many comparisons.

To complicate matters, our systems have cache and cache lines. The cache means that repeated accesses to the same value can be much cheaper than accesses to distinct values and may even be free (if the value remains in a register). And our cache does not operate on a per-value basis, but rather data is loaded in chunks of, say, 64 bytes (a cache line).

All in all, my derivations so far may not be highly useful in practical applications, except maybe to argue that we ought to be able to beat the multiple binary search approach.

Can we sketch such an approach? Suppose that we start by sampling B different values in the big array, at intervals of size n / ( B + 1 ). And then we issue the same k binary searches, but targeting one of the interval of size n / ( B + 1 ). This will incur B  + k log (n / B + 1 )) accesses in the large array. Let us pick B so as to minimize the number of accesses. We could pick B to be k / ln (2) – 1. We get k / ln (2) – 1 + k log (n ln (2) / k) accesses in the large array. For simplicity, I am ignoring the fact that B must be an integer. I am also ignoring the cost of matching each value in the small array to the right subinterval. However, if I am only interested in the number of accesses in the large array, it is an irrelevant cost. Even so, under the assumption that k is small, this is a small cost (O(k)).

Is k / ln (2) + k log (n ln (2) / k) accesses in the large array much better than the naive approach doing k log n accesses? It is unclear to me how much better it might be on real systems once caching effects are accounted for.

Credit:  Travis Downs for the initial questions and many ideas. The mistakes are mine.

Further reading: SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience 46 (6), 2016

Science and Technology links (June 22nd 2019)

    1. An implemented chip can improve long-term memory. It is currently impractical, but a start-up company will try to bring this technology to market.
    2. Never before so many simultaneous and distinct new approaches have been in development in the history of drug research.
    3. Once more, we have evidence that exercise can rejuvenate your muscles.
    4. After controlling for sun exposure and other factors, sunscreen neither causes nor prevents skin cancers. Sun exposure itself is associated with fewer cancers and better cardiovascular health. Of course, increased sun exposure is also associated with an increased rate of skin cancers. Yet the association is relatively modest, e.g., living all day under the sun probably does not double your risk of skin cancers (while it has other strong benefits). Note that there are other benefits to sunscreen than cancer prevention.
    5. Fast food is most popular among the upper-middle income brackets.

How fast is getline in C++?

A standard way to read a text file in C++ is to call the getline function. To iterate over all lines in file and sum up their length, you might do as follows:

while(getline(is, line)) {
    x += line.size();
}

How fast is this?

On a Skylake processor with a recent GNU GCC compiler without any disk access, it runs at about 2 GB/s. That’s slower than the maximal throughput of a good flash drive. These results suggest that reading a text file in C++ could be CPU bound in the sense that buying an even faster disk would not speed up your single-threaded throughput.

nanoseconds per byte 0.46 ns
speed 2.0 GB/s

My code is available.

If you write code that processes the strings generated by the getline function calls, in the worst case, the total time will be the sum of the time required by the getline function plus the time required by your code. In other words, you are unlikely the achieve speeds near 2 GB/s.

In comparison, a software library like simdjson can parse and validate JSON inputs, doing everything from Unicode validation to number parsing, at a speed of over 2 GB/s.

I have not tried to do so, but you can locate lines and iterate over them at much greater speeds than 2 GB/s.

What should we do with “legacy” Java 8 applications?

Java is a mature programming language. It was improved over many successive versions. Mostly, new Java versions did not break your code. Thus Java was a great, reliable platform.

For some reason, the Oracle engineers decided to break things after Java 8. You cannot “just” upgrade from Java 8 to the following versions. You have to update your systems, sometimes in a significant way.

For management purposes, my employer uses an ugly Java application, launched by browsers via something called Java Web Start. I am sure that my employer’s application was very modern when it launched, but it is now tragically old and ugly. Oracle has ended maintenance of Java 8 in January. It may stop making Java 8 available publicly at the end of 2020. Yet my employer’s application won’t work with anything beyond Java 8.

Java on the desktop is not ideal. For a business applications, you are much better off with a pure Web application. It is easier to maintain, secure, it is more portable. Our IT staff knows this, they are not idiots. They are preparing a Web equivalent that should launch… some day… But it is complicated. They do not have infinite budgets and there are many stakeholders.

What do we do while something more modern is being built?

If you are a start-up, you can just switch to the open-source version of Java 8 like OpenJDK. But we are part of a large organization. We want to rely on supported software: doing otherwise would be irresponsible.

So what do we do?

I think that their current plan is just to stick with Java 8. They have an Oracle license, so they can keep on installing Java 8 on PCs even if Oracle pulls the plug.

But is that wise?

I think that a better solution would be to switch to Amazon Corretto. Amazon recruited James Gosling, Java’s inventor. It feels like the future of Java may move in Amazon’s hands.

Update: RedHat is offering paid support for OpenJDK 8.

Science and Technology links (June 15th 2019)

Science and Technology links (June 7th 2019)

Nearly Divisionless Random Integer Generation On Various Systems

It is common in software to need random integers within a range of values. For example, you may need to pick an object at random in an array. Random shuffling algorithms require such random integers.

Typically, you generate a regular integers (say a 64-bit integer). From these integers, you find a way to get integers without a range. O’Neill showed that this apparently unimportant operation can be more expensive than generating the original random integers.

Naively, you could take the random integer and compute the remainder of the division by the size of the interval. It works because the remainder of the division by D is always smaller than D. Yet it introduces a statistical bias. You can do better without being slower. The conventional techniques supported in Java and Go require at least one or two integer division per integer generated. Unfortunately, division instructions are relatively expensive.

If you are willing to suffer a slight statistical bias, you can generate a floating-point values in [0,1) and multiply the result by the size of the interval. It avoids the division but introduces other overhead.

There is a better approach that requires no division most of the time:

uint64_t nearlydivisionless ( uint64_t s ) {
  uint64_t x = random64 () ;
  __uint128_t m = ( __uint128_t ) x * ( __uint128_t ) s;
  uint64_t l = ( uint64_t ) m;
  if (l < s) {
    uint64_t t = -s % s;
    while (l < t) {
      x = random64 () ;
      m = ( __uint128_t ) x * ( __uint128_t ) s;
      l = ( uint64_t ) m;
    }
  }
  return m >> 64;
}

Let us suppose that I shuffle an array of size 1000. I generate 64-bit integers (or floating-point numbers) which I convert to indexes. I use C++ but I reimplement the algorithm used in Java. Let me look at the number of nanoseconds per input key being shuffled on a Skylake processor (Intel):

Java’s approach 7.30 ns
Floating point approach (biased) 6.23 ns
Nearly division-free 1.91 ns

So the division-free approach is much better.

Is this result robust to the hardware? Let us try on Cannon Lake processor where the division instruction is faster…

Java’s approach 3.75 ns
Floating point approach (biased) 8.24 ns
Nearly division-free 2.53 ns

The division-free approach is less beneficial because of the faster division, but the gains are still there.

What about an ARM processor? Let us try on an Ampere Skylark.

Java’s approach 20.67 ns
Floating point approach (biased) 14.73 ns
Nearly division-free 8.24 ns

Again, the division-free approach wins.

Practically speaking, avoiding integer divisions is a good way to generate faster code.

I make my code available. To ease reproducibility, I have used docker containers: you should be able to reproduce my results.

Further reading: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019

Science and Technology links (June 1st 2019)

  1. The DeepMind engineers built an artificial intelligence (a software program) that can learn to play 3D shooter games at super-human levels.
  2. Researchers have found a way to grow large quantities of cells that can generate blood; it works for mice.
  3. Europe passed a law to protect privacy online (GDPR). It comes with annoying warnings regarding cookies. This law had a significant negative effect on emerging European technology firms (startups). It has also increased Google’s monopoly power at the expense of smaller players.
  4. When adjusting for the unreliability of solar and wind, existing coal will be cheaper than new solar and wind in all the major regions until at least 2040“.
  5. There has a been a massive drop in the number of books borrowed from libraries by college students. Myself, I switched to ebooks a decade ago and my office contains a single book: Knuth’s book.
  6. Even old people suffering from Alzheimer’s make new neurons (brain cells).

Science and Technology links (May 25th 2019)

  1. Oculus released the Quest, its latest VR goggles. It requires no wiring, no PC. It supports six degrees of freedom, so it is “true VR”. They cost less than $600. The reviews are critical. The goggles still weight too much for long term use and the software is much too limited. I have ordered a pair which I should get in a couple of weeks. I expect that it would work with our GapMinder VR.Oculus is owned by Facebook and there are privacy concerns that come with the Oculus software.
  2. Though many jobs do not legally require a college education, it is increasingly the case that employers require a college degree. It might be irrational according to a Harvard-Business-School report:

    While a majority of employers pay between 11% and 30% more for college graduates, many employers also report that non-graduates with experience perform nearly or equally well on critical dimensions like time to reach full productivity, time to promotion, level of productivity, or amount of oversight required. Moreover, employers incur significant indirect costs. Seeking college graduates makes many middle- skills jobs harder to fill, and once hired, college graduates demonstrate higher turnover rates and lower engagement levels. A systemic view of the total economics of hiring college graduates shows that companies should be extraordinarily cautious before raising credential requirements for middle- skill positions and should not gravitate toward college graduates based only on a vague notion that it might improve the quality of their workforce.

    Requiring college degrees is a policy that harms some communities:

    Degree inflation particularly hurts populations with college graduation rates lower than the national average, such as Blacks and Hispanics, age 25 years and older.

    Apple’s CEO Tim Cook stated that half of Apple’s US employment last year was made up of people who did not have four-year degrees. Cook stresses that there is a mismatch between what colleges teach and assess, and the actual business needs.

    Some of the best software engineers I have worked with did not have a college degree or even any college. I have done some of my best work with these people.

    My impression is that the leading software companies put little weight on even a computer-science degree in the sense that they have applicants undergo extensive technical testing. Such testing would be foreign to lawyers, nurses or medical doctors.

    I still think that if you are going to go to college, studying computer science is a wise move. But I would also not put a degree as a requirement when hiring if I could as an employer.

  3. Many of my environmentally conscious colleagues attempt to offset the footprint of their international travel and conferences by the purchase of carbon credits. The general idea is that you can keep on flying many times a year, you can keep on organizing international events, because you give money so that people in Brazil plant trees, forgo farming and oil in favor of solar panels, and so forth. It is like magic! But if an idea is incredibly convenient, maybe it is just not true… Song does an excellent job of arguing against carbon credits.
  4. Between 40% to 50% of your income is due to your genes.
  5. According to an article in Nature, blocking the CD22 protein with antibodies rejuvenates the brain of a mouse. There is hope that it might be applicable to human beings as well.
  6. Google software can translate your speech using your own voice.
  7. Xenon gas can protect the brain and neurons after a brain injury, in mice.
  8. Smartphones can detect who is carrying them by their gait.