Visiting all values in an array exactly once in “random order”

Suppose that you want to visit all values in an array exactly once in “random order”. You could do it by shuffling your array but it requires some extra storage.

You want your code to use just a tiny bit of memory, and you want the code to be super fast. You do not want to assume that your array size is a power of two.

One way to do it is to use the fact that (a x + b) modulo n will visit all integer values in [0,n) exactly once as x iterates through the integers in [0, n), as long as a is coprime with n. Being coprime just means that the greatest common divisor between a and n is 1. There are fast functions to compute the greatest common divisor between a and n.

A trivial coprime number would be a = 1, but that’s bad for obvious reasons. So we pick a coprime number in [n/2,n) instead. There is always at least one no matter what n is.

Enumerating all coprime numbers in [n/2,n) could get tiresome when n is very large, so maybe we just look at up to 100,000 of them. There is no need to actually store them in memory, we can just select one at random, so it requires very little memory.

To see why the mathematics work, suppose that ( a x + b ) modulo n = ( a x' + b ) modulo n,
then a (x - x') modulo n = 0 which only happens when (x – x’) is a multiple of n
because a and n are coprime. Thus if you map a consecutive range of n
values x, you will get n distinct values ( a x + b ) modulo n.
The choice of the parameter a is critical however: if you set a to 1 or 2,
even if it is coprime with n, the result will not look random.

The running code is ridiculously simple:

    public int getCurrentValue() {
      return ( (long) index * prime + offset ) % ( maxrange);

    public boolean hasNext() {
      return index < maxrange;

    public int next() {
      int answer = getCurrentValue();
      index ++;
      return answer;

There are various ways to optimize this code further (e.g., you can increment a counter instead of doing a multiplication), but it is likely faster than most other approaches people come up with in practice. Of course, it is not really random in the sense that no (good) statistician should accept the result as a fair shuffle of the indexes. Still, it might be “good enough” to fool your colleagues into thinking that it is random.

While my implementation assumes that you are visiting the values in order, you can go back in time, or jump forward and backward arbitrarily.

I make my Java code available. It can be made more elegant, but it should work just fine in your projects.

(As pointed out by Leonid Boytsov, this approach is reminiscent of the Linear congruential generators that are used to produce random numbers.)

If you can find ways to make the result “look” more random without significantly making it slower and without increasing memory usage, please let us know.

You can find ready-made solutions to visit all values in an array with a power of two number of elements. And by restricting your traversal to the subset of elements in [0,n) from a larger virtual array having a power of two size, you will have an alternative to the approach I describe, with the caveat that your main code will require branching. The computational complexity of a call to “next” becomes O(n) whereas I use a small, finite, number of instructions.

Computing the inverse of odd integers

Given x, its (multiplicative) inverse is another value y such that x y = y x = 1. We all know that the multiplicative inverse of x is 1/x and it exists as long as x is non-zero. That’s for real numbers, or at least, rational numbers.

But the idea of a multiplicative inverse is more general.

It certainly fails integers in general. I.e., there is no integer x such that 2 x is 1. But, maybe surprisingly, all odd integers have an inverse if you use normal computer arithmetic. Indeed, when your processor computes x y, then it actually outputs x y modulo 232 or x y modulo 264 depending on whether you use 32-bit or 64-bit instructions. (The value x y modulo 264 can be defined as the remainder of the division of x y by 264.)

Let me briefly explain why there must be an inverse. You can skip this part if you want. Take any odd integer x. Because x is odd, then it is not divisible by 2. In fact, that’s what it means to be odd. But this also means that powers of x will also be odd. E.g., xk is also odd for any integer k. Ok. So xk modulo 264 is never going to be zero. The only way it could be zero is if xk were divisible by 264, but that’s impossible because it is an odd value. At the same time, we only have a finite number of distinct odd values smaller than 264. So it must be the case that xk modulo 264 = xk' modulo 264 for some pair of powers k and k'. Assume without loss of generality that k is larger than k'. Then we have that xk-k' modulo 264 = 1 modulo 264 (I am not proving this last step, but you can figure it out from the previous one). And thus it follows that xk-k'-1 is the inverse of x. If you did not follow this sketch of a proof, don’t worry.

So how do you find this inverse? You can brute force it, which works well for 32-bit values, but not so well for 64-bit values.

Wikipedia has a page on this, entitled modular multiplicative inverses. It points to an Euclidean algorithm that appears to rely on repeated divisions.

Thankfully, you can solve for the inverse efficiently using very little code.

One approach is based on “Newton’s method”. That is, we start with a guess and from the guess, we get a better one, and so forth, until we naturally converge to the right value. So we need some formula f(y), so that we can repeatedly call y = f(y) until y converges.

A useful recurrence formula is f(y) = y (2 - y x ) modulo 264. You can verify that if y is the 64-bit inverse of x, then this will output y. So the formula passes a basic sanity test. But would calling y = f(y) repeatedly converge to the inverse?

Suppose that y is not quite the inverse, suppose that x y = 1 + z 2N for some z and some N that is smaller than 64. So y is the inverse “for the first N bits” (where “first” means “least significant”). That is, x y modulo 2N = 1.

It is easy to find such a y for N greater than zero. Indeed, let y = 1, then x y = 1 + z 21.

Ok, so substituting x y = 1 + z 2N in
y (2 - y x ) modulo 264, we get
y (2 - ( 1 + z 2N) ) modulo 264
y (1 - z 2N ) modulo 264.
So I set y' = f(y) = y (1 - z 2N ) modulo 264.
What is x y'? It is (1 + z 2N ) (1 - z 2N ) modulo 264
or (1 - z2 22 N ) modulo 264.

That is, if y was the inverse “for the first N bits”, then y' = f(y) is the inverse “for the first 2 N bits”. In some sense, I double my precision each time I call the recurrence formula. This is great! This means that I will quickly converge to the inverse.

Can we do better, as an initial guess, than y = 1? Yes. We can start with a very interesting observation: if we use 3-bit words, instead of 32-bit or 64-bit words, then every number is its own inverse. E.g., you can check that 3*3 modulo 8 = 1.

(Marc Reynolds pointed out to me that you can get 4 bits of accuracy by starting out with x * x + x - 1 .)

So a good initial guess is y = x, and that already buys us 3 bits. The first call to the recurrence formula gives me 6 bits, then 12 bits for the second call, then 24 bits, then 48 bits, then 96 bits, and so forth. So, we need to call our recurrence formula 4 times for 32-bit values and 5 times for 64-bit values. I could actually go to 128-bit values by calling the recurrence formula 6 times.

Here is the code to compute the inverse of a 64-bit integer:

uint64_t f64(uint64_t x, uint64_t y) {
  return  y * ( 2 - y * x );

static uint64_t findInverse64(uint64_t x) {
   uint64_t y = x; 
   y = f64(x,y);
   y = f64(x,y);
   y = f64(x,y);
   y = f64(x,y);
   y = f64(x,y);
   return y;

I wrote a complete command-line program that can invert any odd number quickly.

Each call to the recurrence formula should consume about 5 CPU cycles so that the whole function should take no more than 25 cycles or no more than the cost of a single integer division. Actually, it might be cheaper than a single integer division.

Because of the way we construct the inverse, if you somehow knew the 32-bit inverse, you could call the recurrence formula just once to get the 64-bit inverse.

How did we arrive at this formula ( y (2 - y x ))? It is actually a straight-forward application of Newton’s method as one would apply it to finding the zero of g(y) = 1/y - x. So there is no magic involved.

My code seems to assume that I am working with unsigned integers, but the same algorithm works with signed integers, and in binary form, it will provide the same results.

Reference and further reading: Granlund and Montgomery, SIGPLAN Not. (1994). Some people point me at On Newton-Raphson iteration for multiplicative inverses modulo prime powers by Dumas (2012).

Credit: Marc Reynolds ask on Twitter for an informal reference on computing the multiplicative inverse modulo a power of two. It motivated me to write this blog post.

The Xorshift1024* random number generator fails BigCrush

In software, we use random number generators to simulate randomness. They take the form of functions which return numbers that appear random. To test their quality, we apply statistical tests. The goal standard is a statical test called BigCrush. It tests that quality of 32-bit random values.

In an earlier post, I reported that contrary to what you can read on the Xorshift Wikipedia page, the Xorshift128+ random number generator fails BigCrush. This is somewhat significant because Xorshift128+ has been widely adopted.

The Xorshift Wikipedia page also states that a more expensive generator, xorshift1024*, passes BigCrush. So I wondered, does it, indeed, pass BigCrush?

So I used my testing framework to run BigCrush. I use four different “seeds” and I only worry about a failure if it occurs with all four seeds, and with an excessively improbable p-value. Because xorshift1024* generates 64-bit outputs, and BigCrush requires 32-bit inputs, I only keep the 32 least significant bits of each word produced by xorshift1024*.

Here are my results:

 ./ testxorshift1024star*.log
reviewing xorshift1024star lsb 32-bits
Summary for xorshift1024star lsb 32-bits (4 crushes):
- #81: LinearComp, r = 29: FAIL!! -- p-values too unlikely (1 - eps1, 1 - eps1, 1 - eps1, 1 - eps1) -- ALL CRUSHES FAIL!!

reviewing xorshift1024star lsb 32-bits (bit reverse)
Summary for xorshift1024star lsb 32-bits (bit reverse) (4 crushes):
- #80: LinearComp, r = 0: FAIL!! -- p-values too unlikely (1 - eps1, 1 - eps1, 1 - eps1, 1 - eps1) -- ALL CRUSHES FAIL!!

So xorshift1024* fails BigCrush systematically when providing the values as is (log file 1, 2, 3, 4), and it fails again with reversing the bit order (log file 1, 2, 3, 4).

So the Wikipedia entry is misleading. Both xorshift128+ and xorshift1024* fail BigCrush.

My code is available for review, and you can rerun the tests for yourself.

How fast are malloc_size and malloc_usable_size in C?

When programming in C, we allocate memory dynamically using the malloc function, by passing how many bytes we wish to allocate. If successful, the function returns a pointer to the newly allocated memory. Later we can free the allocated memory with the free function.

In general, unless you are keeping track of it yourself, you cannot recover the number of bytes you wanted to allocate. That’s because the malloc function can allocate more memory than you request. For example, if you call malloc(1), you should not assume that only one byte was allocated.

However, you can ask the system to report the number of allocated bytes. It will always be at least as much as you requested but could be more (possibly a lot more). Under macOS, you call malloc_size and under Linux you call malloc_usable_size.

I have never seen these functions actually used in the wild. But they could possibly be useful. For example, if you allocate memory for a string buffer, why would you painfully keep track of how much memory you wanted when you can simply request the potentially more useful allocated number of allocated bytes?

At a minimum, these functions could help with debugging: you could easily check dynamically that you have enough memory allocated throughout the life of your problem.

One reason to avoid these functions is that they are not standard. This means that if your code relies on them, it is possible that you could have difficulties porting your code to a new platform in the future.

Another good reason to track the number of bytes yourself is performance.

So how fast are they?

Time (CPU cycles)~50~10

On Linux, malloc_usable_size is fast enough for almost every purpose, except maybe for your most performance-critical code. On macOS, malloc_size should be used sparingly.

This matches a performance pattern I have observed: the standard C libraries under Linux have very fast memory allocation compared to Apple platforms.

My source code is available.

Credit: This blog post was inspired by Slava Pestov.

Science and Technology links (September 15th, 2017)

An actual robot that can write in Japanese passed university entrance tests in Japan. How well did it do? It bested 80% of all human candidates. It is not good enough to enter the very best Japanese university, but it is good enough to enter in many other universities. The robot’s artificial intelligence shows better reading comprehension, as measured by the tests, than most human beings. The lead designer stresses that the robot does not really understand what it reads, but then it seems that something like 30% of high school students, in Japan, can’t pass basic reading comprehension tests. I stress that the robot had to use the same forms that the student used. It had to write with a pen.

China uses facial scanners to monitor university dorms. To enter the dorms, students must have their card and they must be recognized by the camera.

This week, Apple released the iPhone X. You unlock the phone through facial recognition (by looking at the phone). The new phones have a new processor, the A11 Bionic. According to some reports, it matches recent high-end Intel chips in performance, for both single-threaded and multi-threaded performance. It is hard to say whether this comparison is meaningful, but the results do indicate that it is a significantly faster processor than what the older iPhone 7 uses (Apple refers to a 30% gain). The term “bionic” is a reference to the support, in silicon, for machine-learning (or AI) code. Also, Apple took to designing its own graphics processors, instead of licensing an external design. Apple also released a version of its watch that can be used to make phone calls, without relying on a nearby phone.

Another point worth stressing is that the new iPhones are dropping support for 32-bit applications, at the hardware level. It might allow Apple to optimize the silicon somewhat, though I am unclear as to how much of a benefit this move provides in terms of performance and/or power usage. In any case, this means that it is all 64-bit applications from now on. In comparison, Microsoft Windows is still available in a 32-bit edition.

The current therapies for tooth decay are rather invasive. Scientists in Belfast believes that tooth decay could be reversed using the body’s own repair mechanism, using aspirin.

The next step is to go and try to figure out how you are going to apply the aspirin to the teeth, to regenerate the dentine and to replace the need for fillings.

And before you try it: putting an actual aspirin on your tooth won’t work.

Wired reports that we have achieved a breakthrough in human-computer interface. Using an armband, you can interact with your computer which implies that you can type on a keyboard without having an actual keyboard. This product from CTRL-Labs sounds like a more powerful product than the Myo armband that you can get cheaply on Amazon. However, the CTRL-Labs armband is unreleased and the Wired article reads a bit like an ad.

Scholarship is conservative, tinkering is progressive

We are not rational beings. We cannot string simple sequences of logical arguments without making mistakes. We cannot reason probabilistically.

There is ample evidence that rational arguments fail to convince. That’s not surprising given how poor we are at evaluating rational arguments.

We teach what we have learned. We repeat what we have heard.

People hold up to their conservative models… science and scholarship are fundamentally conservative. Being a scholar is a bet against progress. If you write a book about a given field, and the field gets fundamentally transformed, all your hard earned expertise becomes worthless.

And yet we make progress. Not on all fronts, and not always fast enough… but we move forward. Kuhn advocated that progress happens through “paradigm shifts”.

How is that possible if every intellectual is a conservative?

Not everybody, not every culture, is betting against progress. For every theoretician that insists on keeping his model, there is an engineer or an entrepreneur whose mind is set on constructing reality.

To construct a new reality, you cannot rely on a comforting theoretical framework, as these tend to be fundamentally conservative. So you have to accept that you will be tinkering.

Long before we had Maxwell’s laws, we had tinkerers designing applications for electricity. Long before we had thermodynamics, we had useful engines.

How many people place their bet on progress? This depends very much on your culture. You can very easily make it untenable to be a tinkerer if you ensure that progress cannot happen. Somehow the Western civilization embraced tinkerers, at least in part.

But tinkering is dangerous. Inventing electricity can make the world safer or riskier. You can’t tell which one ahead of time. You have to try.

It is not immediately obvious why a civilization would allow tinkering. But I think it has to do with the fact that without some tinkering, civilizations tend to be fragile, in Nassim Taleb’s sense.

We do not live in a static world. There are strong external shocks. We deplete our ressources, we face new diseases, we enter into unforeseen wars…

In comparison, tinkering causes harm to the civilization, but at a fairly low intensity. And a tinkering civilization becomes more robust because it is kept on its toes. That’s ultimately a lot better than waiting for a massive collapse.

The Xorshift128+ random number generator fails BigCrush

In software, we generate randomness with algorithms called “pseudo-random number generator”, sometimes simply called “random number generator” or RNG. A popular random number generator is xorshift128+: it is used by many JavaScript engines. It was designed by an influential computer science professor, Sebastiano Vigna, who has done a lot of great work. I suspect that the JavaScript engineers made a fine choice by selecting xorshift128+.

How can you tell that your random number generator has a good quality? A standard test is TestU01 designed by professor L’Ecuyer from the University of Montreal. TestU01 comes with different sets of tests, but BigCrush is the gold standard. A good random number generator should pass BigCrush when you pass the values as-is, as well as when you reverse the bits produced by the random number generator. Indeed, Vigna writes about another popular random number generator:

A recent example shows the importance of testing the reverse generator. Saito and Matsumoto [2014] propose a different way to eliminate linear artifacts (…) However, while their generator passes BigCrush, its reverse fails systematically (…) which highlights a significant weakness in its lower bits.

Passing BigCrush, even after reversing the bits, is not an impossible task. The SplittableRandom class in Java appears to pass (Steele et al., 2014), and so does the PCG family. And, of course, all cryptographic random number generators pass BigCrush, irrespective of the order of the bits.

On Wikipedia, we can read about xorshift128+ passing BigCrush robustly:

the following xorshift128+ generator uses 128 bits of state (…) It passes BigCrush, even when reversed.

Is Wikipedia correct? It offers a reference by Vigna who invented xorshift128+ (Vigna, 2017). Vigna’s journal article states:

(…) we propose a tightly coded xorshift128+ generator that does not fail any test from the BigCrush suite of TestU01 (even reversed) (…) xorshift128+ generator (…) is currently the fastest full-period generator we are aware of that does not fail systematically any BigCrush test (not even reversed)

That seems like a pretty definitive statement. It admits that there might be statistical failure, but no systematic one. So it would seem to support the Wikipedia entry.

The xorshift128+ code is not difficult:

#include <stdint.h>
uint64_t s[2];
uint64_t next(void) {
  uint64_t s1 = s[0];
  uint64_t s0 = s[1];
  uint64_t result = s0 + s1;
  s[0] = s0;
  s1 ^= s1 << 23; // a
  s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
  return result;

Can we check the claim? The BigCrush benchmark is available as free software, but it is a pain to set it up and run it. So I published a package to test it out. Importantly, you can check my software, compile it, run it, review it… at your leisure. I encourage you to do it! I use Vigna’s own C implementation.

Statistical tests are never black and white, but you can use p-values to arrive at a reasonable decision as to whether the test passes or not. The BigCrush implementation does this analysis for us. To make things more complicated, random number generators rely on an initial “seed” that you input to initiate the generator. Provide a different seed and you will get different results.

So I did the following when running BigCrush:

  • I use four different input seeds. I only care if a given test fails for all four seeds. I use 64-bit seeds from which I generate the needed 128-bit seed using another generator (splitmix64), as recommended by Vigna. I just chose my seeds arbitrarily, you can try with your own if you have some free time!
  • I only care when BigCrush gives me a conclusive p-value that indicates a clear problem.
  • I use the least significant 32 bits produced by xorshift128+, in reversed bit order. (BigCrush expects 32-bit inputs.)

Let us review the results.

seed: 12345678

 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):
       Test                          p-value
 68  MatrixRank, L=1000, r=0          eps  
 71  MatrixRank, L=5000               eps  
 80  LinearComp, r = 0              1 - eps1

seed: 412451

The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):
       Test                          p-value
 68  MatrixRank, L=1000, r=0          eps  
 71  MatrixRank, L=5000               eps  
 80  LinearComp, r = 0              1 - eps1

seed: 987654

The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):
       Test                          p-value
 68  MatrixRank, L=1000, r=0          eps  
 71  MatrixRank, L=5000               eps  
 80  LinearComp, r = 0              1 - eps1

seed: 848432

 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):
       Test                          p-value
 24  ClosePairs mNP1, t = 9          9.2e-5
 24  ClosePairs NJumps, t = 9        1.0e-4
 68  MatrixRank, L=1000, r=0          eps  
 71  MatrixRank, L=5000               eps  
 80  LinearComp, r = 0              1 - eps1


Xorshift128+ fails BigCrush when selecting the least significant 32 bits and reversing the bit order. The evidence is clear: I used four different seeds, and it failed MatrixRank and LinearComp in all four cases. Thus the Wikipedia entry is misleading in my opinion.

While I reversed the bit orders, you can also get systematic failures by simply reversing the byte orders. You select the least significant 32 bits, and reverse the resulting four bytes.

The recommended replacement for xorshift128+, xoroshiro128+, also fails BigCrush in a similar manner (as you can verify by yourself). Yet the xoroshiro128+ Wikipedia page could serve as an unequivocal recommendation:

Xoroshiro is the best general purpose algorithm currently available. (…) Mersenne Twister might still be a better choice for highly conservative projects unwilling to switch to such a new algorithm, but the current generation of statistically tested algorithms brings a baseline of assurance from the outset that previous generations lacked.

I feel that this would need to be qualified. In my tests, Xoroshiro is no faster than SplittableRandom (which I call splitmix64 following Vigna’s terminology), and SplittableRandom passes BigCrush while Xoroshiro does not. Recall that the PCG functions also pass BigCrush and they are quite fast. There are other choices as well.

To be clear, there is probably no harm whatsoever at using xorshift128+, but it does systematically fail reasonable statistical tests. If your core argument for choosing a generator is that it passes hard statistical test, and it fails, I think you have to change your argument somewhat.

Science and Technology links (September 8th, 2017)

I always naively assumed that the discovery that the speed of light is finite was recent. Yet back in the 17th century, scientists had already concluded that the speed of light must be finite due to astronomical aberrations.

Scientists rejuvenated the brain of old mice using a single protein (ARC).

We can use software to write fake reviews that can pass as being useful. This can be used to potentially spam review sites.

Nature reports that injecting the brains of monkeys with stem cells seems to help combat Parkinson’s disease. By the way, we can transform skin cells directly into neurons.

Magic Leap is a company that claims to be working on very advanced augmented reality glasses. They have received more money than God. Their technology is supposedly beyond anything others are working on. We now have access to one of their patents. This might be the first sign of the future. Or not.

As we age, our mitochondria (power cells) become less efficient, fewer in numbers and damaged. UCLA scientists showed how to improve matters in fruit flies by dosing them with a single protein.

Microsoft released Seeing AI, an iPhone app that let you take pictures and then it tries to guess what the picture represents. If you take a picture of an individual, it will try to guess the mood and the age of the individual. The app is free and works well.

The magazine Wired reports on experiments where we took away all forms of sugars from mice. It seems beneficial. A related Canadian study found that eating proteins all day long can help you remain more muscular. In an extensive study based on 18 countries, sugars (carbohydrates) were associated with higher mortality whereas fat was associated with lower mortality.

If you have a PlayStation 4 and you are looking for a relaxing game this week-end, I recommend Uncharted: The Lost Legacy by Naugty Dog. It is a beautiful game.

Nature reports on a new class of senolytics, these are products that kill the senescent cells we accumulate as we age. If you have been paying attention to my posts, senescent cells are currently the “next big thing” in anti-aging. We have pretty much established that if we could safely remove senescent cells from your body, you would be more fit, assuming that you are past your prime. The research question is whether we can safely remove senescent cells, at an affordable cost.

Go does not inline functions when it should

I have designed a benchmark that I run in different programming languages. Two languages that I like are Go (from Google) and Java (from Oracle). My expectation would be that Go and Java should have similar performance in due time. Indeed, both are modern garbage-collected languages supported by major corporations with deep pockets.

It used to be that Go was handicapped in my benchmark because the language did not offer a simple way to compute population counts using specialized processor instructions. Go 1.9 changes that with the introduction of the math/bits package. I was hoping for Go to catch up to Java.

So how does Go fare against Java now?

Let us sum this up:

Java 85 ms0.6 ms4 ms
Go 1.910 ms1.2 ms6 ms

So Go is often running at half the speed compared to Java. The Count benchmark in Go is about two times faster than it was prior to Go 1.9, but it is still far from Java.

What could explain such a difference? If you look at the code, it is all very simple loops.

Modern programming involves many short functions. Programming with short functions makes code review much easier, and it avoids code duplication.

Calling a function can be a relatively expensive process. The system has to copy the data in the right registers, allocate stack memory, and so forth. Moreover, functions are somewhat opaque to the optimizing compiler so that many easy optimizations are simply not possible without function inlining.

As programmers, we usually do not worry about the performance cost of having many function calls because we expect the function calls to be mostly optimized away. One way the system optimizes away function calls is by “inlining” the function… in effect, it “copies” the code in place, so that there is no function call at all. That’s hardly bleeding edge technology in 2017: most optimizing compilers have been doing it for decades.

Inlining is not always useful: when used indiscriminately, it can grow the size of the executables. You do not want to inline large functions. But if you have a long loop that repeatedly calls the same small function, you can expect to greatly benefit by inlining the function in question.

Yet, as far as I can tell, Go is terribly shy about inlining function calls even when it would obviously make sense. We can verify that Go does not inline small hot functions within tight loops in my benchmark by examining the assembly:

$ go test -c && go tool objdump -S -s BenchmarkLemireCreate bitset.test |grep CALL
  0x5193d3		e848c9ffff		CALL*BitSet).Set(SB)

What the Set function does is simple: it sets a single bit in a single machine word, but Go won’t inline it, possibly because there is a branch involved. We can double the speed of the Create test if we manually inline the Set function calls and do some minor surgery on how the data gets reallocated.

Even when Go apparently inline the function calls, as in the math/bits calls, it seems to surround the single instruction that should be emitted by guarding code. In effect, the processor checks that the instruction in question is supported each and every time it needs to be called. That can probably reduce the performance of the instruction by a factor of two.

Should you care? I think you should. Having half the speed means that you might end up using two cores to solve the same problem in the same time. That’s twice the energy usage! And that is if you are lucky: parallelizing is hardly free from complexity and pitfalls.

Of course, my benchmark is probably not representative of whatever systems people build, but it is also not crazily unrealistic. It is likely representative of many performance bottlenecks.

Go has to get inlining right.

Further reading:

George Tankersley has a GopherCon 2017 talk (I want to Go fast) about how to get Go to inline small functions. He gets good results, but the process is not pretty. It is clearly a case where he is fighting against the optimizing compiler in a manner that should not be necessary.

Appendix. My results are reproducible.

Let us first run the benchmark using Java 8:

$ git clone
$ cd microbenchmarks
$ mvn install
$ java -cp target/microbenchmarks-0.0.1-jar-with-dependencies.jar me.lemire.microbenchmarks.bitset.Bitset

Benchmark                   Mode  Samples  Score   Error  Units
m.l.m.b.Bitset.construct    avgt        5  0.008 ± 0.001   s/op
m.l.m.b.Bitset.count        avgt        5  0.001 ± 0.000   s/op
m.l.m.b.Bitset.iterate      avgt        5  0.005 ± 0.000   s/op

Let us run the benchmark using Go 1.9:

$ go get
$ cd ~/go/src/
$ go test -bench=Lemire
goos: linux
goarch: amd64
BenchmarkLemireCreate-2    	     100	  10440889 ns/op
BenchmarkLemireCount-2     	    1000	   1271236 ns/op
BenchmarkLemireIterate-2   	     200	   6111728 ns/op
ok	4.414s

Science and Technology links (September 1st, 2017)

Richer countries tend to have higher longevity and lower fertility. What is cause and effect? The Hajnal line separates Western Europe from Eastern Europe and it seems like going back centuries ago, people on the West side of the line had lower fertility due to women marrying late or not at all. It is unclear what explains this cultural fact, but it is maybe significant that it preceded the rise of capitalism and the industrial revolution. Thus one could theorize that cultural changes often drive economics and technology. That is, we got the industrial revolution because more women married late, as opposed to the industrial revolution freeing women from early marriage.

There are now SD cards (the tiny cards you put in your Android smartphone) with a 400GB capacity. That’s more storage capacity than many laptops today.

It looks like canakinumab, an anti-arthritis drug, can help fight heart diseases. This is yet more evidence that uncontrolled inflammation is a driver for age-related conditions. By the way, you can drastically reduce your heart disease risks by taking a daily dose of aspirin (caveat: it has the side effect of making it more likely that you will bleed out in case of injury).

Osteocalcin is produced in our bones. As we age, we produce less of it. In previous work, it was shown to rejuvenate the muscles of old mice. It recent work, it was shown to rejuvenate memory (in mice). It should be quite safe to increase one’s level of osteocalcin.

After the age of 25, your exercise capacity tends to diminish year after year. Recent research suggests that this is largely preventable, however. High-intensity training (HIIT) can do wonders (in mice): “Those four mice who had exhibited the kinds of deficits that correlate to frailty in humans (…) The HIIT actually reversed frailty in them.” To be clear, HIIT does not include taking a walk or running for 30 minutes. You have to push your body to its limits. Surprisingly, it is well tolerated even for very old individuals.

Aubrey de Grey is a controversial gerontologist, controversial because he proposed 20 years ago that we could stop aging using engineering. It is evidently much less controversial today. He recently gave a very nice talk entitled How old can we get? In this talk, using what appears to be solid statistics, he shows that human beings appear to hit a “wall” near age 115. That is, while your risk of death increases exponentially with age, doubling every few years… things get much, much worse for those of us making it close to 115 years of age. That is quite interesting because we do not know why that is. The context is that many more of us make it to age 90, or to age 100. In Japan, the number of centenarians has exploded. Yet despite these longevity gains, despite the fact that there are billions of us, hardly anyone ever gets close to 120 years of age. It appears that something happens when you get close to 115 years old that makes your risk of death rise “super exponentially”. You could easily conjecture that the body “wears out”, and in some sense you are certainly right, but “wearing out” tends to follow statistical laws… if all cars of a given make fail exactly one day after the warranty expires, it is not mere “wearing out”, it is programmed death. So is there some kind of fail safe that ensures none of us can make it beyond 115 years of age? Probably there is nothing so sinister, but it makes for a nice narrative.