Don’t underestimate the nerds

I’m a little nerdy. According to my wife, I even look like a nerd. I am not very big. I have a long resume posted online, and I’ll proudly post my follower count, but if you meet me in person, I am unlikely to come across as “impressive”. I don’t talk using “big words”. I have been told that I lack “vision”. Given a choice between spending time with powerful people getting their attention, and reading a science article… I will always go for the latter.

I’m not at all modest, but I am humble. I get most things wrong, and I will gladly advertise my failures.

I’m lucky in that I have a few like-minded colleagues. I have a colleague, let us call her “Hass”. She gave us a talk about power laws. (The mathematical kind.) Who spends their lunchtime talking about power laws and probabilistic distributions?

We do.

However, if you have been deep down in the bowels of academia… You will find another animal. You have “political professors” whose main game is to achieve a high status in the most visible manner. Academia rewards this kind of behavior. If you can convince others that you are important, well regarded and that you do great work for humanity, you will receive lavish support. It makes sense given the business schools are into: delivering prestige.

If you visit a campus, you might be surprised at how often computing labs are empty, no professor to be found. Because of who I am, I would never ask for space unless I really needed it. But, see, that’s not how political animals think… to them, having space is a matter of status.

Nerds are, at best, part-time political animals. It would seem that nerds are weak. Are they?

My view is that nerds are almost a different species. Or, at least, a subspecies. They do signal strength, but not by having a luxurious car, a big house, a big office, a big title.

I remember meeting with the CEO of a company that was doing well. The CEO kept signaling to me. He talked endlessly about his prestigious new car. He was sharply dressed in what was obviously a very expensive suit. He kept telling me about how many millions they were making. Yet we were in my small office, in a state university. He kept on signaling… and you know how I felt in the end? Maybe he expected me to feel inferior to him. Yet I lost interest in anything he had to tell me. He wanted me to review some technology for them, but I discouraged him.

Big titles, displays of money… those do not impress me. If you signal strength through money alone, I’m more likely to pity you.

If Linus Torvalds were to meet Bill Gates, you think that Linus would be under Bill in the nerdom hierarchy? I doubt it. I have no idea how much money Linus has, and the fact that nobody cares should be a clue.

What did my colleague Hass do? She came and presented a kick-ass nerdy presentation. The kind of stuff you cannot make up if you don’t know what you are talking about. She displayed strength, strength that I recognize. I think everyone in the room saw it. Yet she did not wear expensive clothes and she did not advertise big titles.

My wife recently taught me how to recognize signaling between cats. You could live all your life with cats and never realize how they broadcast signals and strength.

It is a mistake to think that the introverted nerds are weak. This is a very common mistake. I once bought a well-rated book on introverts, written by an extrovert. The whole book was about how introverts should face their fears. The author clearly thought that we were weak, in need of help somehow.

You are making a mistake if you think that my colleague Hass is weak. She could kick your nerd ass anytime.

Science and Technology links (February 2nd, 2018)

  1. Most mammals, including human beings, age according to a Gompertz curve. It is a fancy way of saying that your risk of death goes up exponential with age. Naked mole rats are mammals that do not age, in the following sense:

    unlike all other mammals studied to date, and regardless of sex or breeding-status, the age-specific hazard of mortality did not increase with age, even at ages 25-fold past their time to reproductive maturity

  2. It seems that the brain of male infants differs from that of female infants:

    We show that brain volumes undergo age-related changes during the first month of life, with the corresponding patterns of regional asymmetry and sexual dimorphism. Specifically, males have larger total brain volume and volumes differ by sex in regionally specific brain regions, after correcting for total brain volume.

  3. The American National Institutes of Health are launching a major research program in genome editing ($190 million over six years).
  4. It appears that many of us are deficient in magnesium, and that’s an important driver for cardiovascular diseases. Most of us will die of a cardiovascular disease (given current medical knowledge).

Picking distinct numbers at random: benchmarking a brilliant algorithm (JavaScript edition)

Suppose you want to choose m distinct integers at random within some interval ([0,n)). How would you do it quickly?

I have a blog post on this topic dating back to 2013. This week I came across Adrian Colyer’s article where he presents a very elegant algorithm to solve this problem, attributed to Floyd by Bentley. The algorithm was presented in an article entitled “A sample of brilliance” in 1987.

Adrian benchmarks the brilliant algorithm and finds it to be very fast. I decided the revisit Adrian’s work. Like Adrian, I used JavaScript.

The simplest piece of code to solve this problem is a single loop…

let s = new Set();
while(s.size < m) {

The algorithm is “non-deterministic” in the sense that you will generally loop more than m times to select m distinct integers.

The brilliant algorithm is slightly more complicated, but it always loops exactly m times:

let s = new Set();
for (let j = n - m; j < n; j++) {
        const t = randInt(j);
        s.add( s.has(t) ? j : t );

It may seem mysterious, but it is actually an intuitive algorithm, as Adrian explains in his original article.

It seems like the second algorithm is much better and should be faster. But how much better is it?

Before I present you my results, let me port over to JavaScript my 2013 algorithm. Firstly, we introduce a function that can generate the answer using a bitset instead of a generic JavaScript Set.

function sampleBitmap(m, n) {
   var s = new FastBitSet();
   var cardinality = 0
   while(cardinality < m) {
      cardinality += s.checkedAdd(randInt(n));
   return s

Bitsets are can be much faster than generic sets, see my post JavaScript and fast data structures.

Secondly, consider the fact that when you need to generate more than m = n/2 integers in the range [0,n), you can, instead, generate m – n integers, and then negate the result:

function negate(s, n) {
  var news = new FastBitSet()
  let i = 0
  s.forEach(j => {while(i<j) {
             i = j+1})
  while(i<n) {news.add(i);i++}
  return news

My complete algorithm is as follows:

function fastsampleS(m, n) {
    if(m > n / 2 ) {
      let negatedanswer = fastsampleS(n-m, n)
      return negate(negatedanswer)
    if(m * 1024 > n) {
      return sampleBitmap(m, n)
    return sampleS(m, n)

So we have three algorithms, a naive algorithm, a brilliant algorithm, and my own (fast) version. How do they compare?

mnnaivebrilliantmy algo
10,0001,000,0001,200 ops/sec1,000 ops/sec4,000 ops/sec
100,0001,000,00096 ops/sec80 ops/sec700 ops/sec
500,0001,000,00014 ops/sec14 ops/sec120 ops/sec
750,0001,000,0006 ops/sec8 ops/sec80 ops/sec
1,000,0001,000,0000.4 ops/sec5 ops/sec200 ops/sec

So the brilliant algorithm does not fare better than the naive algorithm (in my tests), except when you need to select more than half of the values in the interval. However, in that case, you should probably optimize the problem by selecting the values you do not want to pick.

My fast bitset-based algorithm is about an order of magnitude faster. It relies on the FastBitSet.js library.

My complete source code is available.

Science and Technology links (January 26th, 2018)

  1. We have reached “peak coal” meaning that coal usage is going to diminish in the coming years.
  2. McGill professor Ishiang Shih has been accused by the US government of leaking chip designs to the Chinese government. The professor runs a business called JYS Technologies. This sounds impressive and mysterious until you check out the company web site and its headquarters. If professor Ishiang Shih is a spy, he is clearly not in the James-Bond league.

    Anyhow, I thought it was interesting that the US would worry about China having access to chip designs. China is the world’s largest consumer of computer chips, but it still produces few of them, relying on imports instead. Obviously, the Chinese government would like to start making its own chips. Soon.

  3. Though we can increase networking bandwidth, we are unlikely to improve network latency in the future, because nothing can go faster than the speed of light. This means that we are hitting the physical limits of how quickly web sites can acknowledge your requests.

    (…) current network (Internet) latencies are here to stay, because they are already within a fairly small factor of what is possible under known physics, and getting much closer to that limit – say, another 2x gain – requires heroics of civil and network engineering as well as massive capital expenditures that are very unlikely to be used for general internet links in the foreseeable future.

    This would have impressed Einstein, I think.

  4. Men differ from women in that they are much more diverse:

    Human studies of intrasex variability have shown that males are intellectually more variable. Here we have performed retrospective statistical analysis of human intrasex variability in several different properties and performances that are unrelated or indirectly related to intelligence: (a) birth weights of nearly 48,000 babies (Medical Birth Registry of Norway); (b) adult weight, height, body mass index and blood parameters of more than 2,700 adults aged 18–90 (NORIP); (c) physical performance in the 60 meter dash event of 575 junior high school students; and (d) psychological performance reflected by the results of more than 222,000 undergraduate university examination grades (LIST). For all characteristics, the data were analyzed using cumulative distribution functions and the resultant intrasex variability for males was compared with that for females. The principal finding is that human intrasex variability is significantly higher in males, and consequently constitutes a fundamental sex difference.

    If you take this result to its logical conclusion, you realize that whether you look at top performers or worst performers, you will find more men than women, assuming that the average performance is the same. Biology takes more risks with men than with women.

  5. Scholars who believe nurture trumps nature also tend to doubt the scientific method. I am not sure what to make of this observation.
  6. Curcumin is a yellow-ish chemical found in the turmeric spice (commonly used in Indian cuisine). It has long been reported to have anti-inflammatory properties. It seems to be good against arthritis (according to the guy who renovated my kitchen) and there are reports that people who eat turmeric-rich Indian food have fewer cancers. To my knowledge, the evidence for the benefits of curcumin remain somewhat anecdotal which suggests that the beneficial effect, if any, is small. To make matters worse, curcumin is not very bio-available, meaning that you’d need to eat truck loads of turmeric to get a lot of curcumin in your cells. Some clever folks have commercialized more bio-available (and typically much more expensive) forms of curcumin. You can buy some on Amazon (it is not cheap, will cost you about $2 a day). We can hope that they would have greater effects. A paper in the American Journal of Geriatric Psychiatry reports that taking bio-available curcumin improves cognition in adults. Presumably, it reduces brain inflammation. (credit: David Nadeau)

    I expect that the effect, if real, is small. Still, it is also probably safe enough.

  7. Chinese scientists have successfully cloned two monkeys through somatic cell nuclear transfer. When I asked my wife why it was such a big deal, she pointed out that it suggested human cloning. Indeed, if you can clone monkeys, why not clone human beings?
  8. China has overtaken the United States in terms of the total number of science publications, according to statistics compiled by the US National Science Foundation (NSF). If you find this interesting, you might want to read my post China is catching to the USA, while Japan is being left behind.
  9. Intel (and AMD) processors have instructions to compute the sine and the cosine. It is surprisingly inaccurate:

    The worst-case error for the fsin instruction for small inputs is actually about 1.37 quintillion units in the last place, leaving fewer than four bits correct. For huge inputs it can be much worse, but I’m going to ignore that. (…) It is surprising that fsin is so inaccurate. I could perhaps forgive it for being inaccurate for extremely large inputs (which it is) but it is hard to forgive it for being so inaccurate on pi which is, ultimately, a very ‘normal’ input to the sin() function. with for decades.

Initializing arrays quickly in Swift: be wary of Sadun’s initializers

Swift is Apple’s go-to programming language. It is the new default to build applications for iPhones. It also runs well on Linux.

It is not as low-level as C or C++, but it has characteristics that I like. For example, it does not use a “fancy” garbage collector, relying instead on deterministic reference counting. It is also a compiled language. It also benefits from a clean syntax.

Suppose you want to initialize an array in Swift with the values 0, 100, 200… Let us pick a sizeable array (containing 1000 elements). The fastest way to initialize the array in my tests is as follows:

Array((0..<1000) { 100 * $0 })

The “lazy” call is important for performance… I suspect that without it, some kind of container is created with the desired values, and then it gets copied back to the Array.

One of the worse approaches, from a performance point of view, is to repeatedly append elements to the Array:

var b = [Int]()
for i in stride(from: 0, to: maxval, by: skip) {

It is more than 5 times slower! Something similar is true with vectors in C++. In effect, constructing an array by repeatedly adding elements to it is not ideal performance-wise.

One nice thing about Swift is that it is extensible. So while Arrays can be initialized by sequences, as in my code example… they cannot be initialized by “generators” by default (a generator is a function with a state that you can call repeatedly)… we can fix that in a few lines of code.

Erica Sadun proposed to extend Swift arrays so that they can be initialized by generators… Her code is elegant:

public extension Array {
  public init(count: Int, generator: @escaping() -> Element) {
    precondition(count >= 0, "arrays must have non-negative size")

I can use Erica Sadun’s initializer to solve my little problem:

var runningTotal : Int = 0
let b = Array(count: 1000) {() -> Int in
           runningTotal += 100
           return runningTotal

How fast is Erica’s initializer?

$ swift build    --configuration release && ./.build/release/arrayinit
append                           6.091  ns
lazymap                          1.097  ns
Erica Sadun                      167.311  ns

So over 100 times slower. Not good.

Performance-wise, Swift is a rather opinionated language: it really wants you to initialize arrays from sequences.

We can fix Erica’s implementation to get back to the best performance:

public extension Array {
  public init(count: Int, generator: @escaping() -> Element) {
    precondition(count >= 0, "arrays must have non-negative size")
    self.init((0..<count) { Element in generator() })

My source code is available.

Microbenchmarking is hard: virtual machine edition

To better understand software performance, we often use small controlled experiments called microbenchmarks. In an earlier post, I remarked that it is hard to reason from a Java benchmark. This brought me some criticism from Aleksey Shipilëv who is one of the top experts on Java benchmarking. I still stand by my belief and simply promised Aleksey to, one day, argue with him over a beer.

In a follow-up post, I insisted that microbenchmarks should be relying on very tightly controlled conditions, and I recommended avoiding just-in-time compilers if possible (such as is standard in Java). Indeed, you want your microbenchmarks to be as deterministic as possible (it should always be the same) yet just-in-time compilers are, almost by definition, non-deterministic. There is no reason to believe that your Java code will always be executed in the same manner from run to run. I also advocate avoiding memory allocation (and certainly garbage collection).

I am basing my opinion on practice. When developing software, I have often found it frustratingly difficult to determine whether a change would impact performance positively or negatively when using a language like Java or JavaScript, but much easier when using a more deterministic language like Go, Swift, C or C++.

Laurence Tratt shared with me his paper “Virtual Machine Warmup Blows Hot and Cold” (presented at OOPSLA last year). I believe that it is remarkable paper, very well written. Tratt’s paper is concerned with microbenchmarks written for languages with a virtual machine, like Java, JavaScript, Python (PyPy), Ruby, Scala and so forth. Note that they use machines configured for testing and not any random laptop.

Here are some interesting quotes from the paper:

in almost half of cases, the same benchmark on the same VM on the same machine has more than one performance characteristic

However, in many cases (…) neither JIT compilation, nor garbage collection, are sufficient to explain odd behaviours (…) we have very little idea of a likely cause of the unexpected behaviour.

It is clear that many benchmarks take considerable time to reach a steady state; that different process executions of the same benchmark reach a steady state at different points; and that some process executions do not ever reach a steady state.

What should we do if P contains a no steady state? (…) no meaningful comparison can be made.

We suggest that in many cases a reasonable compromise might be to use smaller numbers (e.g. 500) of in-process iterations most of the time, while occasionally using larger numbers (e.g. 1500) to see if longer-term stability has been affected.

My thoughts on their excellent work:

  1. Their observation that many benchmarks never reach a steady state is troubling. The implicit assumption in many benchmarks is that you have some true performance, and they have noise. Many times, it is assumed that the noise is normally distributed. So, for example, you may rarely hit a performance that is much higher or much lower than the true (average) performance. That’s, of course, not how it works. If you plot timings, you rarely find a normal distribution. But Tratt’s paper puts into question the concept of a performance distribution itself… it says that performance may evolve, and keep on evolving. Furthermore, it hints at the fact that it might be difficult to determine whether your benchmark has gone to a true steady state.
  2. They recommend running more benchmarks, meaning that quantity as a quality of its own. I agree with them. The counterpart to this that they do not fully address is that benchmarking has to be easy if it is to be plentiful. It is not easy to write a microbenchmark in Java (despite Aleksey’s excellent work). Languages like Go make it much easier.
  3. They argue for long-running benchmarks on the basis that a single event (e.g., a context switch) will have a larger relative effect on a short benchmark than on a long benchmark. My view is that, as far as microbenchmarks are concerned, you want to idealize away outlier events (like a context switch), that is, you do not want them to enter into your reported numbers at all, and that’s difficult to do with a long-running benchmark if you are reporting an aggregate like an average.

    Moreover, if you have a really idealized setup, the minimum running time should be a constant: it is the fastest your processor can do. If you cannot measure that, you are either working on a problem that is hard to benchmark (e.g., involving random memory accesses, involving hard-to-predict branches, and so forth), or you have a non-ideal scenario.

    Of course, if you have a more complicated (non-ideal) setup, as is maybe unavoidable in a language like Java, then it is a different game. I would argue that you should be headed toward “system benchmarks” where you try benchmark a whole system for engineering purposes. The downside is that it is going to be harder to reason about the performance with confidence.

    Thus, when I really want to understand something difficult, even if it arose from Java or JavaScript, I try to reproduce it with a low-level language like C where things are more deterministic. Even that can be ridiculously difficult at times, but it is at least easier.

I would conclude that benchmarking is definitively not a science. But I’m not sure paranoia is the answer, I think we need better, easier tools. We need more visualization. We need more metrics. And, no, we don’t want to wait longer while sipping coffee. That won’t make us any smarter.

Science and Technology links (January 19th, 2018)

  1. The Raspberry Pi 3, a $15-dollar computer that I use for various fun projects, is 15 times more powerful than the Cray-1 supercomputer, but it is 130,000 times lighter. The Cray-1 was worth $9 million in 1977. (Source: Joe Armstrong‏)
  2. Stem cells can be used to replace or modify the behavior of our own cells. It is likely that many breakthrough therapies will involve stem cells. But the production of stem cells is expensive. To solve the cost issue, the Mayo clinic will be producing stem cells in an automated manner for clinical trials.
  3. As we age, we tend to lose hair, and it does not just come back on its own. And if it did, at an old age, you would expect the hair to be white. But it looks like wounds can regrow hair at any age:

    We reported an 80-year-old patient with a large wound on the scalp (…) The patient’s wound healed very well aesthetically. Interestingly, on approximate post wound day 180, a hair was observed to be growing towards the surface and eventually erupted in the center of the wound. The hair remained black at 42-month follow-up. This case demonstrated that neogenesis of hair is possible even in a geriatric patient. (Source)

  4. The Alibaba corporation has developed an artificial intelligence model that scored better than humans in a Stanford University reading and comprehension test. I have not looked into it, but as far as I know, we don’t know how to build computers that can read like human beings do. I mean that we don’t even know how to do it in principle.
  5. Some chameleons have fluorescent bones.
  6. In the novel Rainbows End, sci-fi author Vernor Vinge describe a hero who is recovering from Alzheimer’s. The novel is set in 2025. We are in 2018, and we still have no clue how to halt, let alone cure Alzheimer’s. If we were to cure Alzheimer’s, would the individual be able to recover normal use of his memory? Many people doubt it: they think that the synapses are being destroyed. Research from McGill University suggests that Rainbows End’s vision might be correct. The synapses are still present, even in advanced stages of Alzheimer’s, they are just unable to function. If correct, this means that we might, one day, reverse Alzheimer’s.
  7. As I suspected all throughout 2017, not all his well in Hollywood:

    While the average price of a movie ticket in the U.S. rose to $8.97 in 2017, an increase of 3.69 percent, total domestic box office in North America dropped by 2.55 percent to $11.091 billion, according to information released Wednesday by the National Association of Theatre Owners. Despite the increase in ticket prices, the overall decline in ticket revenue was caused by a drop in overall admissions, which fell by 6.03 percent to 1.236 billion.

  8. Birth order with a family seems to matter quite a bit:

    (…) we found strong birth order effects on IQ that are present when we look within families. Later-born children have lower IQs, on average, and these differences are quite large. For example, the difference between firstborn and second-born average IQ is on the order of one-fifth of a standard deviation

    The difference in educational attainment between the first child and the fifth child in a five-child family is roughly equal to the difference between the educational attainment of blacks and whites calculated from the 2000 Census.

    Firstborn children are significantly more likely to be employed and to work as top managers (…) firstborn children are more likely to be in occupations requiring sociability, leadership ability, conscientiousness, agreeableness, emotional stability, extraversion, and openness.

    later-borns are less likely to consider themselves to be in good health, and measures of mental health generally decline with birth order

Ridiculously fast base64 encoding and decoding

Computers store data as streams of bits. Binary files like image, audio or video files are allowed to contain just about any sequence of bits.

However, we also often use text formats; for example, web pages and emails are required to be text. So how do we send images by email? How do we embed images within web pages? One possibility is to point to a distinct binary file. Another common approach is to directly embed the binary data within the web page or the email using base64. Base64 is just a standard text format that can be used to code any binary input. To be precise, a base64 code is always a valid ASCII text (and thus, it is also valid UTF-8). Each byte of base64 code contains 6 bits of data. Thus we “waste” about 2 bits per byte. Hence the base64 equivalent of a binary file is about 33% larger. In practice, this size increase is rarely a source of concern. As far as I know, email attachments are almost always encoded as base64.

When writing HTML, I find it handy to encode my images directly in the HTML using a data URI. For example, in a recent post, I included a PNG file within my HTML code. Major websites like Google use data URIs all the time. It has a small downside in that the web pages are larger (obviously) and they cannot benefit from image caching. On the upside, you save the browser a separate network request.

If you are a web developer, you can use Web Storage to create a client-side database for your application. This client-side database can contain images and arbitrary data, but it must be all base64-encoded.

Most database engines support binary data, but several require that it be encoded as base64 at some point: MongoDB, Elasticsearch, Amazon SimpleDB, and Amazon DynamoDB. I am probably missing a few.

Base64 is commonly used in cryptography to exchange keys. A form of base64 is also used to pass arbitrary data as part of a URI.

Thankfully, encoding and decoding base64 is fast. Yet there are cases where it can become a problem. Matt Crane and Jimmy Lin found that decoding binary attributes from base64 in Amazon DynamoDB is slow.

How fast can you decode base64 data? On a recent Intel processor, it takes roughly 2 cycles per byte (from cache) when using a fast decoder like the one from the Chrome browser. This fast decoder is basically doing table lookups. That’s much slower than copying data within the cache (which takes less 0.05 cycles per byte).

Is this the best you can do?

Alfred Klomp showed a few years ago that you could do much better using vector instructions. Wojciech Muła, myself and a few others (i.e., Howard and Kurz) decided the seriously revisit the problem. Muła has a web page on the topic.

We found that, in the end, you could speed up the problem by a factor of ten and use about 0.2 cycles per byte on recent Intel processors using vector instructions. That’s still more than a copy, but much less likely to ever be a bottleneck. I should point out that this 0.2 cycles per byte includes error handling: the decoder must decode and validate the input (e.g., if illegal characters are found, the decoding should be aborted).

Our research code is available so you can reproduce our results. Our paper is available from arXiv and has been accepted for publication by ACM Transactions on the Web.

My understanding is that our good results have been integrated in Klomp’s base64 library.

Further reading:

Microbenchmarking calls for idealized conditions

Programmers use software benchmarking to measure the speed of software. We need to distinguish system benchmarking, where one seeks to measure the performance of a system, with microbenchmarking, where one seeks to measure the performance of a small, isolated operation.

For example, if you are building a browser, you might want to know how quickly the iPhone 7 can retrieve and display the Wikipedia home page. That’s what I call system benchmarking.

Microbenchmarking is much more narrow in scope. So maybe you are interested in how quickly your processor can divide two numbers. Or maybe you want to know how quickly your processor can generate a random number.

Microbenchmarking is almost entirely distinct from system benchmarking, even if it looks superficially the same.

If you are trying to find out how quickly your processor can divide two integers, you want to know that, nothing else. You don’t want to measure how quickly the processor can divide two numbers while loading a website and cleaning out its registry.

Microbenchmarking is a form of science. Think about the physicist trying to measure the speed of light in some medium. It usually calls for idealized conditions. Why would you want to use idealized conditions when you know that they will not incur in the normal course of system operations?

  1. Idealized conditions makes it easier to reason about the results. You are much more likely to know about the factors that influence your code if you have far fewer factors.
  2. Idealized conditions are more consistent and reproducible. Real systems are not stationary. Their performance tends to fluctuate. And to reproduce even your own results can be a challenge when working with real systems. A small idealized scenario is much easier to reproduce.

Though microbenchmarking can be used as part of an engineering project, the primary purpose is to gain novel insights.

You might object to idealized conditions… what if you want to know how fast your code would run “in practice”? The problem is that “in practice” is ill-defined. You need to tell us what the “practice” is. That is, you need to describe a system, and then you are back into system benchmarking.

System benchmarking is more typically engineering. You seek to make a system that people care about better. You can and should be able to get reproducible results with system benchmarking, but it is much more costly. For one thing, describing a whole system is much harder than describing a tiny function.

In the type of microbenchmarking I like to do, CPU microbenchmarking, the idealized conditions often take the following form:

  • As much as possible, all the data is readily available to the processor. We often try to keep the data in cache.
  • We make sure that the processor runs at a flat clock frequency. In real systems, processors can run slower or faster depending on the system load. As far as I know, it is flat out impossible to know how many CPU cycles were spent on a given task if the CPU frequency varies on Intel processors. Let me quote Wikipedia on time-stamp counters (TSC):

    Constant TSC behavior ensures that the duration of each clock tick is uniform and makes it possible to use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior for all later Intel processors.

    Your processor can run faster or slower than the advertized frequency, unless you specifically forbid it.

  • If the code performance depends on branching, then we make sure that the branch predictor has had the chance to see enough data to make sound predictions. An even better option is to avoid branching as much as possible (long loops are ok though).
  • When using a programming language like Java, we make sure that the just-in-time compiler has done its work. Ideally, you’d want to avoid the just-in-time compiler entirely because it is typically not deterministic: you cannot count on it to compile the code the same way each time it runs.
  • You keep memory allocation (including garbage collection) to a minimum.
  • You keep system calls to a minimum.
  • You must benchmark something that lasts longer than ~30 cycles, but you can use loops.
  • You have to examine the assembly output from your compiler to make sure that your compiler is not defeating your benchmark. Optimizing compilers can do all sorts of surprising things. For example, your compiler could figure out that it does not need to compute the trigonometric functions you are calling very accurately, and so it could fall back on a cheaper approximation. If you can’t examine the assembly output, you should be very careful.

Of course, these are not requirements for a good microbenchmark, it is just a set of requirements that I have often found useful.

If you are lucky and can meet the above requirements, then you can run reliable CPU microbenchmarks in microseconds.

What do I mean by reliable? Last week, I reported that standard C code can interleave two 32-bit integers into a 64-bit integers using about 3.6 cycles on a Skylake processor. In the my original post, I only ran 5 trials, I picked the minimum and I divided the total clock time by the number of pairs. This was criticized: according to some, my benchmark did not last long enough to “stabilize” the results; according to others, my test is too short to have good accuracy.

I returned to this Skylake processor and computed a histogram. I run my short function (which computes 1000 interleaves) 2048 times, and each time I record the duration, before dividing by the number of pairs (1000). Running this whole experiment takes less than a second, and only requires a simple script. So let us look at the histogram:

The bulk of the results are in the 3.6 to 3.65 range (72% of all data points). The median is 3.636. As a first approximation, the noise follows a log-normal distribution. You might expect the measurement noise to follow a normal distribution (a bell curve), but normal distributions are uncommon when measuring computer performance.

From this histogram, one could argue that the “real” time is maybe slightly above 3.6 cycles per pair. It is maybe somewhere between 3.6 and 3.7 cycles. But that’s clearly a reasonable uncertainty.

Even though 72% of data points are between 3.6 and 3.65, the average is 3.6465… but that can be explained by the presence of outliers (for example, one measure was 17.2).

I like to report the minimum because it is easy to reason about: it is close to the best-case scenario for the processor. We know a lot about what processors can do in the best case… Yes, it is idealized but that’s fine.

My minimum is still the minimum of large averages (1000 pairs)… so it is more robust than you might expect. Also I can reasonably expect the noise to be strictly psitive, so the minimum makes sense as an estimator.

If you want to know how fast a human being can run, you look at the world’s records and pick the smallest time. When I run microbenchmarks, I want to know how fast my processor can run my code in ideal conditions.

If you had a normal distribution (bell curve), then taking the minimum would be a terrible idea because you would tend to track unlikely events (also called sigma events). But that’s not the case in performance: with a proper testing methodology, your minimum will be consistent from run to run.

Is the minimum necessarily a good metric? I think it is in CPU-bound microbenchmarks. If you can check that the average (on a quiet machine) is close to the minimum, then the minimum is surely sound. Intel recommends looking at the variance of the minimum from run to run. If the variance is small (e.g., 0), then the minimum is likely a sound metric.

There is nothing magical about the average because your distributions are almost never normal, it is just easily computed. If you are doing service benchmarks, percentiles (1%,20%, 50%,80%,99%) are very useful, and the minimum and maximum are just instances of percentiles.

When in doubt regarding which metric to use, just plot your data. If you can’t script it, just throw the numbers in a spreadsheet (like excel) and generate some graph.

Notice how I returned to this benchmark a week later, after the machine was updated and rebooted, and was able, without effort, to reproduce exactly my prior results.

Why don’t I include normality tests, standard errors and all that? Because I don’t need to. The best statistician is the one you never need. It is something of a myth that scientific results need fancy statistics: great scientific results require very little statistical sophistication. You just need to control the factors involved so that the numbers can speak for themselves.

A good analogy is how you weight people. If you do things right, that is, you weight the individual naked at always the same time at the beginning of the day, before any food was eaten, a single (short) measure each day is more than good enough. The “error” is probably small and irrelevant. If you weight people randomly during the day, after random meals, wearing random clothes, then you may need many measures to accurately estimate someone’s weight. Of course, once you throw in the complexity of having to deal with a whole population, then it becomes much harder to control all the factors involved and you may need fancy statistics again.

Wouldn’t I get more accurate results if I repeated my tests more often? Maybe not. If you pick a textbook, you might learn that averaging repeated measures brings you closer to a true value… but there are hidden assumptions behind this result that typically do not hold. Moreover, it is very hard to keep a modern system busy doing just one small thing for a long time. You can do it, but it requires a lot more care.

So long-running benchmarks are often not good idealized microbenchmarks because they also measure all sorts of irrelevant operations like unnecessary cache misses, context switches and so forth. The numbers you get often depend on how quiet your machine was at the time. You can run the same benchmark for three days in a row and get different numbers (with error margins far above 1%). It is certainly possible to get good long-running microbenchmarks, it is just harder.

Think about the surface of the Earth. If you move very little, then you can safely assume that the ground is flat and that we live on a planar surface. If you move a lot, you will encounter montains and seas, and so forth.

(Of course, system benchmarks can and maybe should be run over a long time… but that’s different!)

Can it hurt to run really exhaustive tests? I think it can. The computer might not tire of waiting for seconds and minutes, but the human mind fares a lot better when answers keep coming quickly. I favor many small and cheap microbenchmarks over a few long-running ones, even when the long-running benchmarks are perfectly clean.

The idea that if a benchmark takes longer to run, it ought to be better might seem intuitively self-evident, but I think we should be critical of this view. Fast is good.

Science and Technology links (January 12th, 2018)

  1. A few years ago, researchers in Danemark expressed concerns regarding high concentrations of pesticides that are showing up in urine samples of Danish mothers and children. Last time I was in Danemark, a newspaper was reporting that there are surprising amounts of pesticides in the products sold in Danemark. A recent research article found that

    the adverse health effects of chronic pesticide residue exposure in the Danish population are very unlikely. The Hazard Index for pesticides for a Danish adult was on level with that of alcohol for a person consuming the equivalent of 1 glass of wine every seventh year.

  2. For the first time in American history, healthcare is the largest source of jobs, ahead of retail, the previous largest source.
  3. Farmers use laser beams to stop eagles from attacking their animals.
  4. Last year, we saw many new therapies based on gene editing. The most exciting gene editing technique right now is CRISPR-Cas9. A paper just reported that most of us are probably immune to CRISPR-Cas9, which means that we can’t receive an in vivo gene therapy based on CRISPR-Cas9 without concern that our immune system will fight it. This suggests that new techniques are needed.
  5. Gary Marcus caused quite a debate online by posting a paper entitled Deep Learning: A Critical Appraisal. I believe that Marcus’s main point is deep learning is maybe not the golden path toward human-like intelligence. I’m not exactly sure why this should be so controversial given that I have heard one of the founders if the deep-learning school, Hinton, say exactly the same thing.

    Deep Learning is the dominant paradigm in artificial intelligence, right now. And it works. It is not all hype. It solves real problem people care about.

    But we don’t know how far it can go. Yet it seems that there are fairly obvious limits. For example, nobody knows how to use deep-learning alone to prove theorems. Yet we have pretty good tools (that work right now) to automatically prove theorems.

  6. Many people are concerned about how our climate could change in the near future. The Atlantic has a piece of how we can use technology to regulate the climate (something they call “geo-engineering”).