Fast random shuffling

In a random shuffle, you want to take the elements of a list and reorder them randomly. In a “fair” random shuffle, all possible permutations must be equally likely. It is surprisingly hard to come up with a fair algorithm. Thankfully, there is a fast and easy-to-implement algorithm: the Fisher-Yates shuffle. It is a rather intuitive algorithm and there are YouTube videos about it… so, I will just point you at a piece of C code:

for (i=size; i>1; i--) {
   int p = random_bounded(i); // number in [0,i)
   swap(array+i-1, array+newpos); // swap the values at i-1 and p
}

What can we expect to limit the speed of this algorithm? Let me assume that we do not use fancy SIMD instructions or parallelization.

If the input array is not in the cache, and we cannot fetch it in time or it is just too large, then cache faults will dominate the running time. So let us assume that the array is in the CPU’s cache.

If we have N input words, we go through the loop N – 1 times. At each iteration of the loop, you need to read two values and write two other values. A recent x64 processor can only store one value to memory per cycle, so we cannot do better than two cycles per input word. In the very next iteration, you may need to read one of the recently written values. So, two cycles per input word is probably optimistic.

What else could be the problem? The generation of the random numbers could hurt us. Let us assume that we are given a random number generation routine that we cannot change. For this blog post, I will stick with PCG.

What remains? Notice how the Fisher-Yates shuffle requires numbers in a range. The typical techniques to generate random numbers in a range involve frequent divisions.

For example, you might want to look at how the Go language handles it:

func (r *Rand) Int31n(n int32) int32 {
	max := int32((1 << 31) - 1 - (1<<31)%uint32(n))
	v := r.Int31()
	for v > max {
		v = r.Int31()
	}
	return v % n
}

This function always involves two divisions. Java, the PCG library… all involve at least one division per function call, often many more than one. Sadly, divisions are many times more expensive than any other operation, even on recent processors.

In an earlier blog post, I showed how to (mostly) get around divisions.

In general, no map from all 32-bit integers to a range can be perfectly fair. In practice, the effect is quite small unless your range is close to the maximal value of an integer. Thus you can simply use the following function:

uint32_t random_bounded(uint32_t range) {
  uint64_t random32bit =  random32(); //32-bit random number 
  multiresult = random32bit * range;
  return multiresult >> 32;
}

Maybe you feel bad about introducing a slight bias. You probably should not since the random-number generation itself is unlikely to be perfect.

Still, we can correct the bias. Recall that some of the values are mapped ceil(4294967296/range) times whereas others are mapped floor(4294967296/range) times. By sometimes redrawing a new random value, we can avoid entirely the bias:

uint32_t random_bounded(uint32_t range) {
  uint64_t random32bit =  random32(); //32-bit random number 
  multiresult = random32bit * range;
  leftover = (uint32_t) multiresult;
  if(leftover < range ) {
      threshold = -range % range ;
      while (leftover < threshold) {
            random32bit =  random32();
            multiresult = random32bit * range;
            leftover = (uint32_t) multiresult;
      }
   }
  return multiresult >> 32;
}

This looks quite a bit worse, but the “if” clause containing divisions is very rarely taken. Your processor is likely to mostly ignore it, so the overhead of this new function is smaller than it appears.

So how do we fare? I have implemented these functions in C, using them to compute a random shuffle. Before each shuffle, I ensure that the array is in the cache. I report the number of clock cycle used per input words, on a recent Intel processor (Skylake). As usual, my code is available.

Random shuffle timings, varying the range function
range functioncycles per input word
PCG library18.0
Go-like20.1
Java-like12.1
no division, no bias7
no division (with slight bias)6

Avoiding divisions makes the random shuffle runs twice as fast.

Could we go faster? Yes. If we use a cheaper/faster random number generator. However, keep in mind that without SIMD instructions or multi-core processing, we cannot realistically hope to reach the lower bound of 2 cycles per input words. That is, I claim that no function can be 3 times faster than the fastest function we considered.

You can save a little bit (half a cycle per input word) if you replace the 32-bit PCG calls by 64-bit calls, processing input words in pairs. Using SIMD instructions, we could go even faster, but I do not have access to a SIMD-accelerated PCG implementation… We could, of course, revisit the problem with different random-number generators.

A fast alternative to the modulo reduction

Suppose you want to pick an integer at random in a set of N elements. Your computer has functions to generate random 32-bit integers, how do you transform such numbers into indexes no larger than N? Suppose you have a hash table with a capacity N. Again, you need to transform your hash values (typically 32-bit or 64-bit integers) down to an index no larger than N. Programmers often get around this problem by making sure that N is a power of two, but that is not always ideal.

We want a map that as fair as possible for an arbitrary integer N. That is, ideally, we would want that there are exactly 232/N values mapped to each value in the range {0, 1 ,…, N – 1} when starting from all 232 32-bit integers.

Sadly, we cannot have a perfectly fair map if 232 is not divisible by N. But we can have the next best thing: we can require that there be either floor(232/N) or ceil(232/N) values mapped to each value in the range.

If N is small compared to 232, then this map could be considered as good as perfect.

The common solution is to do a modulo reduction: x mod N. (Since we are computer scientists, we define the modulo reduction to be the remainder of the division, unless otherwise stated.)

uint32_t reduce(uint32_t x, uint32_t N) {
  return x % N;
}

How can I tell that it is fair? Well. Let us just run through the values of x starting with 0. You should be able to see that the modulo reduction takes on the values 0, 1, …, N – 1, 0, 1, … as you increment x. Eventually, x arrives at its last value (232 – 1), at which point the cycle stops, leaving the values 0, 1, …, (232 – 1) mod N with ceil(232/N) occurrences, and the remaining values with floor(232/N) occurrences. It is a fair map with a bias for smaller values.

It works, but a modulo reduction involves a division, and divisions are expensive. Much more expensive than multiplications. A single 32-bit division on a recent x64 processor has a throughput of one instruction every six cycles with a latency of 26 cycles. In contrast, a multiplication has a throughput of one instruction every cycle and a latency of 3 cycles.

There are fancy tricks to “precompute” a modulo reduction so that it can be transformed into a couple of multiplications as well as a few other operations, as long as N is known ahead of time. Your compiler will make use of them if N is known at compile time. Otherwise, you can use a software library or work out your own formula.

But it turns out that you can do even better! That is, there is an approach that is easy to implement, and provides just as good a map, without the same performance concerns.

Assume that x and N are 32-bit integers, consider the 64-bit product x * N. You have that (x * N) div 232 is in the range, and it is a fair map.

uint32_t reduce(uint32_t x, uint32_t N) {
  return ((uint64_t) x * (uint64_t) N) >> 32 ;
}

Computing (x * N) div 232 is very fast on a 64-bit processor. It is a multiplication followed by a shift. On a recent Intel processor, I expect that it has a latency of about 4 cycles and a throughput of at least on call every 2 cycles.

So how fast is our map compared to a 32-bit modulo reduction?

To test it out, I have implemented a benchmark where you repeatedly access random indexes in an array of size N. The indexes are obtained either with a modulo reduction or our approach. On a recent Intel processor (Skylake), I get the following number of CPU cycles per accesses:

modulo reductionfast range
8.12.2

So it is four times faster! No bad.

As usual, my code is freely available.

What can this be good for? Well… if you have been forcing your arrays and hash tables to have power-of-two capacities to avoid expensive divisions, you may be able to use the fast range map to support arbitrary capacities without too much of a performance penalty. You can also generate random numbers in a range faster, which matters if you have a very fast random number generator.

So how can I tell that the map is fair?

By multiplying by N, we take integer values in the range [0, 232) and map them to multiples of N in [0, N * 232). By dividing by 232, we map all multiples of N in [0, 232) to 0, all multiples of N in [232, 2 * 232) to one, and so forth. To check that this is fair, we just need to count the number of multiples of N in intervals of length 232. This count must be either ceil(232/N) or floor(232/N).

Suppose that the first value in the interval is a multiple of N: that is clearly the scenario that maximizes the number of multiples in the interval. How many will we find? Exactly ceil(232/N). Indeed, if you draw sub-intervals of length N, then every complete interval begins with a multiple of N and if there is any remainder, then there will be one extra multiple of N. In the worst case scenario, the first multiple of N appears at position N – 1 in the interval. In that case, we get floor(232/N) multiples. To see why, again, draw sub-intervals of length N. Every complete sub-interval ends with a multiple of N.

This completes the proof that the map is fair.

For fun, we can be slightly more precise. We have argued that the number of multiples was maximized when a multiple of N appears at the very beginning of the interval of length 232. At the end, we get an incomplete interval of length 232 mod N. If instead of having the first multiple of N appear at the very beginning of the interval, it appeared at index 232 mod N, then there would not be room for the incomplete subinterval at the end. This means that whenever a multiple of N occurs before 232 mod N, then we shall have ceil(232/N) multiples, and otherwise we shall have floor(232/N) multiples.

Can we tell which outcomes occur with frequency floor(232/N) and which occurs with frequency ceil(232/N)? Yes. Suppose we have an output value k. We need to find the location of the first multiple of N no smaller than k 232. This location is ceil(k 232 / N) Nk 232 which we just need to compare with 232 mod N. If it is smaller, then we have a count of ceil(232/N), otherwise we have a count of floor(232/N).

Further reading:

(Update: I have made the proof more intuitive following a comment by Kendall Willets.)

I do not use a debugger

I learned to program with BASIC back when I was twelve. I would write elaborate programs and run them. Invariably, they would surprise me by failing to do what I expect. I would struggle for a time, but I’d eventually give up and just accept that whatever “bugs” I had created were there to stay.

It would take me a decade to learn how to produce reliable and useful software. To this day, I still get angry with people who take it for granted that software should do what they expect without fail.

In any case, I eventually graduated to something better: Turbo Pascal. Turbo Pascal was a great programming language coupled with a fantastic programming environment that is comparable, in many ways, to modern integrated development environments (IDEs). Yet it is three decades old. It had something impressive: you could use a “debugger”. What this means is that you could run through the program, line by line, watching what happened to variables. You could set breakpoints where the program would halt and give you control.

At the time, I thought that programming with a debugger was the future.

Decades later, I program in various languages, C, JavaScript, Go, Java, C++, Python… and I almost never use a debugger. I use fancy tools, and I certainly do use tools that are called debuggers (like gdb), but I almost never step through my programs line-by-line watching variable values. I almost never set breakpoints. I say “almost” because there are cases where a debugger is the right tool, mostly on simple or quick-and-dirty projects, or in contexts where my brain is overwhelmed because I do not fully master the language or the code. This being said I do not recall the last time I used a debugger as a debugger to step through the code. I have a vague recollection of doing so to debug a dirty piece of JavaScript.

I am not alone. In five minutes, I was able to find several famous programmers who took positions against debuggers or who reported barely using them.

I should make it clear that I do not think that there is one objective truth regarding tools. It is true that our tools shape us, but there is a complex set of interactions between how you work, what you do, who you work with, what other tools you are using and so forth. Whatever works for you might be best.

However, the fact that Linus Torvalds, who is in charge of a critical piece of our infrastructure made of 15 million lines of code (the Linux kernel), does not use a debugger tells us something about debuggers

Anyhow, so why did I stop using debuggers?

Debuggers were conceived in an era where we worked on moderately small projects, with simple processors (no thread, no out-of-order execution), simple compilers, relatively small problems and no formal testing.

For what I do, I feel that debuggers do not scale. There is only so much time in life. You either write code, or you do something else, like running line-by-line through your code. Doing “something else” means (1) rethinking your code so that it is easier to maintain or less buggy (2) adding smarter tests so that, in the future, bugs are readily identified effortlessly. Investing your time in this manner makes your code better in a lasting manner… whereas debugging your code line-by-line fixes one tiny problem without improving your process or your future diagnostics. The larger and more complex the project gets, the less useful the debugger gets. Will your debugger scale to hundreds of processors and terabytes of data, with trillions of closely related instructions? I’d rather not take the risk.

My ultimate goal when work on a difficult project is that when problems arise, as they always do, it should require almost no effort to pinpoint and fix the problem. Relying on a debugger as your first line of defense can be a bad investment, you should always try to improve the code first.

Rob Pike (one of the authors of the Go language) once came to a similar conclusion:

If you dive into the bug, you tend to fix the local issue in the code, but if you think about the bug first, how the bug came to be, you often find and correct a higher-level problem in the code that will improve the design and prevent further bugs.

I don’t want to be misunderstood, however. We need to use tools, better tools… so that we can program ever more sophisticated software. However, running through the code line-by-line checking the values of your variables is no way to scale up in complexity and it encourages the wrong kind of designs.

How fast is tabulation-based hashing? The downsides of Zobrist…

In practice, hashing is the process of taking an input, such as a string, and turning it into an integer value. It is a fundamental tool in programming, as most software relies on hashing in one way or another. We often expect hashing to appear “random” so that any two strings are unlikely to have the same hash value.

One of the earliest and simplest forms of hashing is Zobrist hashing. Suppose you want to hash strings of up to 16 characters. For each possible 16 character positions, you generate a table made of 256 randomly chosen words. To hash a given string, you pick the first character, look up the corresponding chosen word in the first table, then you pick the second character and look up the word in the second table… and you XOR the retrieved words. In C, you get something like this:

for (size_t i = 0; i < length ; i++ )
      h ^= hashtab [ i ] [ s[ i ] ];
return h;

Zobrist hashing is an instance of a more general class of hashing said to be tabulation-based.

The benefits of Zobrist are obvious. It is simple. It can be implemented in seconds. It has excellent theoretical properties and is generally very convenient.

The first downside is also obvious: Zobrist hashing uses a lot of memory. To be able to hash all 4-byte strings to 64-bit values, you need 8 KB. But let us ignore this obvious downside.

State-of-the-art universal hash functions such as CLHash can process over 10 bytes per cycle on a recent Intel processor. The good old CityHash can process over 4 bytes per cycle.

How fast can Zobrist hashing be? To get a better idea, I implemented Zobrist hashing as a small C library. How fast is it? Not very fast. I can barely process 0.65 bytes per cycle when hashing repeatedly the same short string, taking into account the function-call overhead. Thus, tabulation-based hashing is probably about an order of magnitude slower than commonly used fast hash functions, assuming you can avoid cache misses.

Software reference. zobristhashing: Zobrist hashing in C

The strange case of the copyright of open-source software

Economists make a grave mistake when they fail to mention open-source software as one of the critical innovation of our era. Open-source software offers a great reference for the type of innovation we will get in the post-industrial era. To economists, open-source software does not matter because it does not count very much toward the GDP… but I think it is a critical hidden variable.

The Internet runs largely on open-source software from its servers (Apache) to its operating-system (Linux). The current smartphone war is being won by Android, an open-source system that uses the open-source Linux kernel. Cloud computing is largely a matter of open-source software, again based on Linux.

In our current legal system, we can own information through the concept of copyright. You would think that who owns this software would be a critical issue.

So who does? In many instances, it is one company, such as Oracle and Google. Or sometimes it might be a particular university. But in a lot of important cases, the copyright is simply owned by whoever wrote it (which might include companies):

  • Linux is copyrighted by “Linus Torvalds and others who actually wrote it”.
  • Most web browsers (Chrome, Opera, Safari) are based on WebKit. Who owns WebKit? Its authors. WebKit, like Linux, started out as one-man project (KHTML by Lars Knoll).

Why is it worth noting?

Many would assume that critical pieces of software should be owned and controlled by a large company, a reputed university or another impressive organization. For end-user software, that is often the case. But much of the underlying software is written by regular programmers without much fanfare… It is put together without permission.

My theory is that open-source software is an invisible college for programmers. Software programmers have an impact that is out of proportion with how much they get paid (and they get paid well enough) in part because of all the brilliant processes they have invented. Programmers are masters at collaboration. And open-source software is one of these ingredients.

Many people, including many professional programmers remain unaware… they often work for a corporation, and deal with other programmers working for other serious sounding companies… they never stop to consider that much of the underlying infrastructure is built through an invisible college.

I asked a few people who owned Linux. Many people told me that it was the Linux Foundation. That’s false. I asked how open Apple was regarding its browser software, and most people told me that Apple was secretive and that all their software was built in-house in a proprietary fashion. That’s false.

All these computer companies along with a good fraction of the very best independent programmers take part in the invisible college that is open-source software. This invisible college is not based on authority. Prestige buys you little good will. Contributions are what matters in the end. And the copyright ownership is a reflection of this underlying culture.

To be smart, work on problems you care about

Never do anything that bores you. My experience in science is that someone is always telling to do something that leaves you flat. Bad idea. I’m not good enough to do something I dislike. In fact, I find it hard enough to do something that I like. (James Watson)

A high school student who is preparing for college recently asked me a few questions by email. He wants to study computer science. I thought that his email was great so I will reply publicly. Let us call him John.

John says… I intend to major in Computer Science (though I’ve also considered majoring and/or minoring in neuroscience or math, of some sort). So, I’ll obviously be taking math courses. Now, for most of my life, I’ve pretty much hated math. I had a poor early educational experience in general and came to despise math, even though I typically got good grades.

I don’t think we should be surprised by John’s feelings about mathematics. We spend a great deal of time teaching stupid things like how to divide 46712 by 13 using a long division, with a pen and some paper. Or adding fractions like 3/5 and 7/9. That would be like getting students to take 5 years of Microsoft Office classes and then be surprised that they hate “computer science”.

How do you approach learning new math concepts? And, in particular, how do you understand formulas? (…) I feel like I am probably missing the bigger picture somehow, and failing to gain a true understanding of math.

I work from examples that I care about. In fact, I avoid learning new mathematical concepts for their own sake. It is a matter of motivation. Having a practical problem I care about activates my intelligence. Boredom makes me dumb.

Have you learned “new math” since obtaining your Ph.D.? When you go to work every day, do you gain new insights and learn things about math, like, say, a programmer would?

I do not use a lot of mathematics as a computer scientist beyond the basics: set theory, Boolean logic, elementary probability theory, some descriptive statistics, functions and relations, linear algebra. You can go a very long way with this small amount of mathematics.

In my experience, mathematics suffers from a diminishing return effect. The basics are very, very useful… but more advanced mathematics often turns into solutions seeking problems. It is best to learn the most advanced mathematics as needed.

John says… A lot of math involves calculations, but I’ve begun to realize that calculating isn’t really what math is about. It seems math is a far more abstract topic and calculating is just an “unfortunate” chore that must be done before having the real fun. Problem is, up to now I’ve always thought that the calculations were the math, and that was all there was to it. How do you abstract away from the rote number crunching and gain an intuitive grasp of math concepts? Must one be good at quick mental calculations and basic math to truly understand and enjoy math?

Fast and accurate number crunching might be useful to impress teachers. But computers are about a billion times better than the best human being could ever be.

John says… How do you learn new math? When you do, have you always found it necessary to work through lengthy sets of problems involving the new concept? If so, how do you learn math outside of textbooks and college courses when problem sets are not present? Have you always just learned from college courses? Do you find math as presented in textbooks dry? Is there, perhaps, another, better way to learn math?

I have gotten rid of my books several years ago. I am a university professor, but if you come to my office, I have no textbook. For some classes I teach, I expect the students to get a textbook, but that’s mostly to satisfy students’ expectations.

Solving problems is definitively the best way to learn mathematics. For most topics, there are thousands of great problems online. There are also great YouTube videos and so forth.

It is not that textbooks are bad… but learning mathematics from a handful of textbooks is like trying to learn about love by reading a handful of romance novels. You need to get out there, find your own references.

Obviously, you have a Ph.D. in math. If you could do things over again, would you get the Ph.D., or not? Would you major in another field?

Degrees, especially advanced degrees, are overrated. You sometimes need degrees to get through the door. I bet that it is a lot easier to get an interview with Google if you have a computer science degree. If you want to get a research job, the Ph.D. makes things easier. But, the degree itself is a small thing.

Think of it as being clean. It is hard to get a job if you are dirty. But if you clean yourself up, you may still never get the job you want. The fact is that is hard to get good jobs, even with a computer-science degree.

Whether you have a college degree or not, only about a third of all workers are enthusiastic about their work. A degree gives you access, potentially, to higher paying jobs… but it won’t make your career.

So don’t go just do a degree. Do something cool and fun on the side. Build a mobile app. Build a VR mini-game. Design a website. Build a new AI. Find something you like doing and get good at it. It comes down to the same idea: work on examples you care about.

Would I still get a Ph.D.? That’s a bad question. A better question: Do I advise students to get PhDs? No. The rule is that you should only get a Ph.D. is you cannot imagine living without one. Would I discourage my own boys from getting a Ph.D.? Yes. Do I encourage them to think about college? No. I repeatedly, tenaciously, discourage my kids from focusing on diplomas and degrees. I keep telling them not to wait for the teachers. Do cool stuff now, on your own.

A degree can be a useful tool, but that’s all it is. If that’s all you have, you have no career. You are at a high risk of ending up in a dead-end job if you get a job at all. There are more boring jobs out there than you know.

Do you have any suggestions or advice for a soon-to-be college student? I’m a pretty solid introvert, and mostly keep to myself – so I’m not looking for the “Don’t party, focus on your studies” advice :). I’m more looking for intellectual sort of advice. Perhaps suggestions on how to get more out of college than I otherwise might?

  • Courses, professors, and grades matter less than you might think. Make sure you get the required grade. Make sure you take the right course but don’t worry more than needed about it. In the real world, employers don’t care whether you aced your math. courses.
  • The people you interact with (and that’s not necessarily the professors) matter a great deal. If you are studying a topic, try to meet people either on campus on online who are interested in this topic.
  • Don’t focus on the degree and the courses. Build strong side interests. Acquire a demonstrable expertise. Do something that people might care about and then tell the world. Remember that it is how Linus Torvalds started. It is how the web started. Who cares whether Linus got a good great in math? Just do stuff, and do not wait passively for your teachers to tell you what to do. Again, do projects you care about, you are going to learn a lot more than you could otherwise. And learn to tell the world about what you do. Communicating your ideas is just as important as doing the cool work. Fanciful mathematics should be far down the list of priorities unless that is your thing.

What if you are not interested in anything, and you prefer to just go to your classes and maximize time spent in front of the TV? If you don’t know what to do, and you prefer being told what to do, that’s fine. Just don’t complain later when people pick your work for you.

Enough with the bogus medical studies!

Every week, we hear about how eating such and such food gives cancer, or how working out can save you from a heart attack. If you have been reading these studies with attention and changing your life to follow their recommendations… you have been wasting your time and you have possibly harmed yourself.

Feynman famously described a common phenomenon which he called cargo-cult science. True science is a fantastic tool that has taken us, in a couple of centuries, from superstitious fools to a highly advanced civilization. Almost all major advances in the last two centuries can be linked with science. So people want to appear to be doing good science… but they are much less concerned about actually doing good science, when they even know what it is.

What is science? As Feynman wrote, it is the belief in the ignorance of experts. It is the idea that our intuition, our good sense… are not enough to tackle the world. We need objective measures. We often need mathematics. We often need hard logic. We need repeatable experiments capable of falsifying theories.

We need these things because we are easily fooled. Science acts as some sort of “software update” for the brain. You train your brain to rely on objective, verifiable facts, and hard logic… instead of impressions.

One of the core ideas behind science is that authority-based arguments should be rejected. So if Professor X from Harvard University or MIT says something… it is not scientific to believe her because she is from Harvard, or because she is a professor. It is similarly unscientific to believe something because it has appeared in a prestigious journal. It is not even scientific to believe something because of a “consensus”. (Hint: there was a widespread consensus that the aeroplane was “impossible” before the Wright brothers came along.)

Let me take this further. Believing something because impressive people say it must be so is the illustration of pre-scientific thought. And this means that every time you hear that a study by impressive people proved something new, you should doubt… you should doubt automatically, without even thinking.

How do we determine what is true? That’s one of the core lessons of the scientific revolution: it is much harder than you might expect. We are feeble creatures that very easily fall prey to bogus arguments. We must relentlessly seek incontrovertible evidence. There is no shortcut.

The real question is not how to make your mind about what is true… that is easy, your brain will eagerly accept many things as true given a chance. The hard part is to keep on doubting. We are just not wired for constant doubt, we are “believers”.

Many people do not really understand what science is… and this includes fully trained “scientists” who receive government research grants and train other scientists. What they do understand, however, is how to speak and act like a scientist. It looks like science, but it does not give the same result. That’s what Feynman calls cargo-cult science.

When it comes to medicine, we invest a lot of money in medical research. Several times more than any other type of research. Secondly, lots of real people can end up suffering needlessly if the research is wrong.

Is medical research wrong? Yes. But that’s not the most tragic part. The real problem is that much of it is pure cargo-cult science. And well funded at that. Cargo-cult science does not merely lead to the wrong ideas… it actively sets us back.

What are the problems?

  • Most work is based on observational studies

    Just watching and taking notes is not science. It is valid scholarship, as long as you report it as such. But it gets reported as “science” instead which is misleading.

    There is nothing wrong with historians who tell us that Vikings once traveled to North America. It is interesting. It is scholarship. But it is not science. It is an observation.

    Don’t get me wrong. Scholarship is precious. We know, for example, that women live longer than men. Cataloging such facts can help us… but it is not science. Our ancestors knew exactly where the stars were. They knew a lot about the seasons. Their high priests observed, cataloged… but they did not know about science.

    When you hear that exercise has been shown to reduce your risk of cancer… that’s (usually) a flat out lie. What actually happens is that scientists take a group of people. Then they ask them whether they exercise. Then they ask whether they have cancer. And then they run the statistics. “Oh! See, the people who say they exercise also less often say that they have cancer.” So far so good, but you have merely observed, you have not done any science. You have done nothing that is above what scholars routinely did in the pre-scientific era.

    All you can show with observational studies is an observation. Nothing else. It can be a lot, but it is not science.

    It could be that when Joe complains about his knee, it rains. That does not show that complaining about one’s knee brings the rain. Nor does it show that the rain brings pain to one’s knee. It is an observation, not science.

  • Small effects are greatly exagerated

    Even when scientists actually do science… they do not use the proper tools to avoid being fooled or fooling others. In fact, they often appear to seek to fool us.

    Suppose that you are treating patients for cancer. The average survival rate is 9 months. All patients die within 3 years. You give them some pill and the survival rate increases to 10 months, but all patients still die within 3 years. It is progress, but a small progress. If the pill costs 100k$ and makes you twist in pain for 3 weeks, maybe it is no progress at all.

    However, it is not how most studies will present it. They will choose a short window. Say a year. Maybe 60% of all patients die within a year. However, with the extra month offered by the pill, only 40% of all patients die within the year. Then the study will conclude that the survival rate has been increased by 50%. I kid you not.

  • They report selective effects

    Without even looking at the numbers, I can tell you that people with high-risk occupations have a lower cancer mortality. Why? Because they die younger. So, there you go… I got a cancer prevention technique: choose a high-risk occupation. Of course, this reasoning is silly. But that’s often how medical studies are presented.

  • So how to keep cancer free? You will hear things like “eating well, exercise, keeping your weight low, eating broccoli, avoiding red meat, getting tested repeatedly”. But if you actually look at the numbers, mostly, there are two factors that have an effect worth noticing on your risk of having cancer: your age and whether you smoke. Your risk of having cancer grows up exponentially with age. If eating bacon increases your cancer risk, it is by a tiny amount. What about early diagnostic? Again, a nice idea, but, in general, it is a waste of time to get regularly tested for cancer. You won’t live an extra day.

    What about heart attacks? Age and gender are basically the things you need to watch, assuming you are not smoking. Again, your risk grows up exponentially with age. Also, men are three times more likely to have a heart attack than women. Why? We don’t know. Smoking is often considered like a really bad habit. It is true that it doubles your risk of cardiovascular diseases. However, aging from 40 to 90 multiplies your risk of cardiovascular diseases by 100.

    So, how do you stay healthy? By staying young. Sadly, we don’t know how to do prevent aging. Yet.

    Demonizing hamburgers is a dangerous distraction.

    Note: I am not a medical professional. Consult your doctor if you have questions. While you are at it, ask her if she knows about cargo-cult science.

The surprising cleverness of modern compilers

I wanted to know how a modern C compiler like clang would process the following C code:

#include <stdint.h>
int count(uint64_t x) {
  int v = 0;
  while(x != 0) {
    x &= x - 1;
    v++;
  }
  return v;
}

Can you guess?

popcntq	%rdi, %rax

That is right. A fairly sophisticated C function, one that might puzzle many naive programmers compiles down to a single instruction. (Tested with clang 3.8 using -O3 -march=native on a recent x64 processor.)

What does that mean? It means that C is a high-level language. It is not “down to the metal”. It might have been back when compilers were happy to just translate C into correct binary code… but these days are gone. One consequence of the cleverness of our compilers is that it gets hard to benchmark “algorithms”.

In any case, it is another example of externalized intelligence. Most people, most psychologists, assume that intelligence is what happens in our brain. We test people’s intelligence in room, disconnected from the Internet, with only a pencil. But my tools should get as much or even more credit than my brain for most of my achievements. Left alone in a room with a pencil, I’d be a mediocre programmer, a mediocre scientist. I’d be no programmer at all. And this is good news. It is hard to expand or repair the brain, but we have a knack for building better tools.

We are passing the Turing test right on schedule

In 1950, the brilliant computing pioneer Alan Turing made the following prediction in his paper Computing Machinery and Intelligence:

I believe that in about fifty years’ time it will be possible, to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning. The original question, “Can machines think?” I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.

We are slightly over 50 years after Turing’s time, but I think it is fair to say, at least, that the storage capacity prediction has been reached. Turing used bits (or “binary digits”) as units of storage so 109 is 100MB. That must have sounded like an enormous storage capacity in 1950. But today, all smartphones have a lot more storage than that. In fact, it seems that Turing somewhat underestimated storage capacities. He reported the brain to have a storage capacity of less than 1015 or 100TB, when today’s estimate puts the brain storage capacity at 10 times this amount. To be fair to Turing, storage bits were tremendously precious in 1950 so programmers used them with care. Today we waste bits without second thoughts. Many programs that require gigabytes of memory could make do with 100MB if they were carefully engineered. And even today, we know too little about the engineering of our brain to appreciate its memory usage.

As for the last part, where Turing says that people will accept that machines think, if you listen to people talk, they will routinely refer to software as “thinking”. Your mobile phone thinks you should turn left at the next corner. Netflix thinks I will like this movie. And so forth. Philosophers will still object that machines cannot think, but who listens to them?

We call our phones “smart”, don’t we?

What about fooling people into thinking that the machine is human? I think that Alan Turing, as an observer of our time, would have no doubt that this prediction has come to pass.

In 2014, a computer managed to pass for a 13-year-old boy and fool 33% of the judges. But it could be dismissed as an anecdote. However, recently, Ashok Goel, a professor of computer science, used IBM Watson’s technology to create a teaching assistant called Jill Watson. The assistant apparently fooled the students. Quoting from the New York Times:

One day in January, Eric Wilson dashed off a message to the teaching assistants for an online course at the Georgia Institute of Technology. “I really feel like I missed the mark in giving the correct amount of feedback,” he wrote, pleading to revise an assignment. Thirteen minutes later, the TA responded. “Unfortunately, there is not a way to edit submitted feedback,” wrote Jill Watson, one of nine assistants for the 300-plus students. Last week, Mr. Wilson found out he had been seeking guidance from a computer. “She was the person—well, the teaching assistant—who would remind us of due dates and post questions in the middle of the week to spark conversations,” said student Jennifer Gavin. “It seemed very much like a normal conversation with a human being,” Ms. Gavin said. Shreyas Vidyarthi, another student, ascribed human attributes to the TA—imagining her as a friendly Caucasian 20-something on her way to a Ph.D. Students were told of their guinea-pig status last month. “I was flabbergasted,” said Mr. Vidyarthi.

So Turing was right. We are about 15 years after his 50-year mark, but a 30% error margin when predicting the future is surely acceptable.

Let us reflect on how Turing concluded his 1950 article…

We may hope that machines will eventually compete with men in all purely intellectual fields. But which are the best ones to start with? Even this is a difficult decision. Many people think that a very abstract activity, like the playing of chess, would be best. It can also be maintained that it is best to provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English.

Chess has been definitively solved in 1997. Our brains are now obsolete as far as the game of Chess is concerned. Computers can speak English quite well. They understand us, to a point.

All in all, Turing was nearly prescient in how he imagined the beginning of the XXIst century.

Professors intentionally slow down science to make themselves look better

Recently, the president of the United States announced a big anti-cancer initiative, to be headed by his vice-president Joe Biden. Will it be fruitful? Maybe. But an important obstacle has become clear. Researchers are glad to receive funds from the government, but they do not want to share back data, software, and raw results. They are willing to publish only as a way to claim credit, but as a way to open their laboratories to the world. This means that years can be lost while other tries to reproduce research results from partial information.

Though American medical researchers are required to post results from clinical trials on clinicaltrials.gov, it takes only a few minutes to find out that the information is not available: only about a tenth of completed studies report results a year after the study is completed. The results of most government funded clinical trials are not shared, and when they are, they information is often limited. Chen et al. recently concluded:

Despite the ethical mandate and expressed values and mission of academic institutions, there is poor performance and noticeable variation in the dissemination of clinical trial results across leading academic medical centers.

This is far from limited to medicine. If you are trying to access some academic research results, and you are failing to get access to the data and the software, it is most likely by design.

If you ask representatives from the research community, they might tell you that there is too little funding, so academic researchers must compete fiercely for positions. If governments only gave a bit more money, people would be more open to share. But if this theory held, then established researchers who have jobs for life and a healthy track record of securing large grants should be more open to sharing than the post-docs who are hunting for a job. I have never heard of any such documented correlation. And given that the established researchers set the pace in science, it is unlikely that they are advocates of openness. In any case, according to this theory, we would see variations on how willing people are to share according to the level of government funding… and, again, no such correlation was ever established.

The other story is that if they were to share their data and software, others could benefit while they would get, at best, partial. That story is closer to the truth.

The real reason the Harvard medical researcher does not fully open his lab to the Stanford medical researcher is that he is afraid that the Stanford medical researcher will go back home, use the data to achieve a breakthrough.

To put it in clear terms, if you are an academic researcher, and you have data or software that could be useful to others… that could enable them to cure cancer, get trees to absorb CO2 faster, or crack AI… you’d rather that they do not do these things as they would cast a shadow on your own accomplishments.

But, surely, academic researchers are lot more open than people from industry? Not so.

“trials by industry were 3 times more likely to report results than were trials funded by the NIH [government funding agency]” (Law et al.)

The problem is that academic researchers are overly obsessed with their own personal social status. It is a stronger pull than the desire to advance science, the public good or even commercial interests.

Some people then object that it is not an obsession with their own social status, but rather a lack of incentives. Reward the researchers with more social status if they share and they will, says this theory. Of course, if you think it through, this objection is nothing more than a rewording of my own explanation: researchers tend to only think about their own social status and everything else is secondary.

Engineering the system so that it gives more social status to people who share is hard and open to gaming.

Instead, I think it is time that we get busy mocking academic researchers. Someone needs to go through clinicaltrials.gov, find all government-funded studies without reporting and openly shame the offenders.

And this would only be the start. We need to take down academia’s ego.

Next, someone needs to start issuing data, software or details requests to various government-funded laboratories. When the information is not forthcoming, we should openly report the status-obsessed professors.

Eventually, if there is enough shaming, the professors will have no choice but to share so as to protect their status.

Further reading: Michael Nielsen, Reinventing Discovery: The New Era of Networked Science, 2011.