Performance overhead when calling assembly from Go

The Go language allows you to call C functions and to rewrite entire functions in assembly. As I have previously documented, calling C functions from Go comes with a significant overhead. It still makes sense, but only for sizeable functions, or when performance is irrelevant.

What about functions written in assembly? To illustrate the performance constraints, I am going to use an example designed by Jason Aten.

Recent Intel processors have an instruction (tzcnt) that counts the “number of trailing zeroes” of an integer. That is, given a non-zero unsigned integer, you count the number of consecutive zeros starting from the least significant bits. For example, all odd integers have no trailing zero, all integers divisible by 4 have at least two trailing zeros and so forth. You might reasonably wonder why anyone would care about such an operation… it is often used to iterate over the 1-bit in a word. It is useful in data compression, indexing and cryptography.

A fast way to compute the number of trailing zeros without special instructions is as follows… (see Leiserson and Prokop, Using de Bruijn Sequences to Index a 1 in a Computer Word, 1998)


where table is a short array of bytes that fits in a cache line…

var table = []byte{
	0, 1, 56, 2, 57, 49, 28, 3, 61,
        58, 42, 50, 38, 29, 17, 4, 62, 
        47, 59, 36, 45, 43, 51, 22, 53,
        39, 33, 30, 24, 18, 12, 5, 63, 
        55, 48, 27, 60, 41, 37, 16, 46,
        35, 44, 21, 52, 32, 23, 11,54,
        26, 40, 15, 34, 20, 31, 10, 25,
        14, 19, 9, 13, 8, 7, 6,

Such a function is going to be fast, using only a handful of machine instructions and running in a handful of cycles. Still, Intel’s tzcnt instruction is superior, as it is a single instruction of a cost comparable to a single multiplication. Roughly speaking, we could expect tzcnt to be twice as fast.

Go can call tzcnt through a function written in assembly. So it seems that Go can easily make use of such instructions. Sadly no. Based on Aten’s code, I designed a benchmark. I am using a test server with a Skylake processor running at a flat 3.4 GHz, and my benchmark measures the instruction throughput. I think that the results speak for themselves:

pure Go (de Bruijn) 3.55 cycles/call
assembly 11.5 cycles/call

In this instance, the function that calls tzcnt (and does little else) runs at nearly half the speed of the pure Go function. Evidently, Go does not take the assembly and inline it.

Programmers have asked the Go authors to inline assembly calls, but there seems to be little support from the core Go team for such an approach.

My point is not that you can’t accelerate Go code using functions written in assembly but rather that if the function is tiny, the function-call overhead will make the performance worse. So you will be better off using pure Go.

The source code is available.

What is a useful theory?

I was an adept, as a teenager and a young adult, of thinkism. Thinkism is the idea that intelligence alone can solve problems. I thought I was smart so that I could just sit down and solve important problems. One after the other. Whatever contributions I ended up making had little to do with sitting down and thinking hard… and much more to do with getting dirty.

When you learn mathematics, you are given theorems that all seem to descend from the minds of brilliant men (and, in some cases, women). So you think that society makes progress when people sit down and think really hard about problems.

Then you realize that it is not quite how the world advances. There is often a myriad of minor conceptual advances… solar panels don’t go from being overly expensive and inefficient to being cheaper than oil due to one breakthrough… the progress follows some kind of random walk. You come to realize that most of the things we think we know are not quite right. Even the simplest ideas are often misleading. We generally move forward, but there are steps backward as well. And, more critically, useful theories from the past can suddenly become anchors.

Progress is Darwinian, not Cartesian.

You can recognize good theoretical ideas because they make you smarter. Algebra is one such idea. For example, when I try to explain mathematical ideas to kids who have not yet taken algebra, I struggle.

Before you learn any new theory, you ought to figure out what kind of problems will become easier.

One concept I have criticized is “probability”. That’s a controversial stance because so much of software, engineering, and machine learning appears to rely on probabilities. My own research makes abundant use of probability theory. However, I struggle to find real-world problems that become easier once you introduce probability theory. It is not that there is any shortage of hard problems worth solving… it is that too few of these problems are made easier when applying probabilities.

I remember being on the examination committee of a Ph.D. student a few years back. She was making use of probabilities and Bayesian networks throughout her presentation. At some point, to test her out, I asked what should have been a trivial hypothetical question… all she needed to do is really know what independent probabilities are… Not only did she failed to answer properly, I suspect that many of the other members of the examining board failed to realize she was confused.

I cannot blame her because I get probabilities wrong all the time. Let us consider the Monty Hall problem, as stated on Wikipedia:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

I have been confused by this problem… that’s ok, maybe I am just not that smart… but I have watched several people with a science Ph.D. being unable to even think about the problem, even when they took all the probability theory that they need to answer this question.

In this case, the problem is not hard if you stop short of using probability theory. You can just enumerate the possibilities:

  1. Door 1: Goat, Door 2: Car, Door 3: Goat
  2. Door 1: Car, Door 2: Goat, Door 3: Goat
  3. Door 1: Goat, Door 2: Goat, Door 3: Car

So you have picked Door 1. In the first scenario, the host will pick Door 2, and switching is the right move. In the second scenario, switching is the bad move. In the last scenario, switching would again be the right move. So in 2 out of 3 scenarios, switching is the right move.

If I explain it in this manner, I am convinced that even 10-year-olds can understand.

I have argued in more details in my blog post Probabilities are unnecessary mathematical artifacts that we should use different tools that make it easier for us to get the right answer.

My friend Antonia Badia wrote to me…

(…) it’s impossible to do Artificial Intelligence today without using probabilities.

I suspect he is right. However, the fact that we use painful tools today can be interpreted as a hint that there are massive conceptual breakthroughs to be made in the near future.

We should not assume that ideas are universally useful. We live on a moving landscape.

Calculus, as it is currently taught, was made exceedingly abstract so that pre-computing scientists could do their work. But in the computing era, it makes less and less sense. Zeilberger correctly argued, in my mind, that we should just do away with the concept of infinity…

So I deny even the existence of the Peano axiom that every integer has a successor. Eventually, we would get an overflow error in the big computer in the sky, and the sum and product of any two integers is well-defined only if the result is less than p, or if one wishes, one can compute them modulo p. Since p is so large, this is not a practical problem, since the overflow in our earthly computers comes so much sooner than the overflow errors in the big computer in the sky. (Zeilberger in “Real” Analysis is a Degenerate Case of Discrete Analysis)

Even if you disagree with Zeilberger, I don’t think that there can be much disagreement on the fact that school-taught mathematics pays little attention to modern-day problems. The long division is about as useful as Latin.

So, yes, a useful theory is one that makes you smarter. We are surrounded by theories that fail to pass this basic test. It is a problem and an opportunity.

Further reading
: Kuhn’s Structure of Scientific Revolutions, Pragmatism (Wikipedia)

Science and technology: what happened in 2016

This year, you are able to buy CRISPR-based gene editing toolkits for $150 on the Internet as well as autonomous drones, and you can ask your Amazon Echo to play your favorite music or give you a traffic report. You can buy a fully functional Android tablet for $40 on Amazon. If you have made it to a Walmart near you lately, you know that kids are going to receive dirt cheap remote-controlled flying drones for Christmas this year. Amazon now delivers packages by drone with its Prime Air service. There are 2.6 billion smartphones in the world.

So what else happened in 2016?


Cancer rates have dropped by 23% since 1991. And, no, it is not explained away by the fact that fewer people smoke.

Though most insects are short-lived, we have learned that some ants never age.


In the UK, scientists have been allowed to modify genetically human embryos.

As you age, your body accumulates “senescent cells”. These are non-functional cells that are harmless in small quantities but can cause trouble when they accumulate. Researchers have found that by removing them (using drugs called senolytics), they could extend the life of mice. A company funded by’s CEO Jeff Bezos, Unity Biotechnology, is working to bring this technology to human beings. The CEO of this new company has given a talk which you can watch on YouTube.

Google reports that it can determine the location of almost any picture with superhuman ability. Take a picture of your favorite landscape and Google will tell you where you are just by looking at the picture.

Researchers are attempting to regrow knee cartilage in human beings using stem cells.


Google defeated the world champion at the board game of Go. With the defeat of Kasparov at the hands of IBM’s Deep Blue 20 years ago, there is no longer any board game where human beings are superior to computers.

Google supplemented its Google Photo service with advanced artificial intelligence. If you are looking for pictures of your dog, Google can find them for you, without anyone ever entering metadata.


Europe has authorized a gene therapy to help cure children. Soon enough, we will routinely use genetic engineering to cure diseases.

We have entered the era of fully functional consumer-grade virtual-reality gear. Many of us have experienced high-quality virtual experiences for the first time in 2016. According to some estimates, over 3 million VR units were sold in 2016: 750k PlayStation VR, 261k Google DayDream, 2.3 million Gear VR, 450k HTC Vive and 355k Oculus Rift.

Dementia rates in human beings are falling. We do not know why.


Foxconn, the company that makes the iPhone for Apple, has replaced 60,000 employees with robots.


Pokemon Go is the first massively popular augmented-reality game.

China holds the first clinical trials of CRISPR-based anti-cancer therapies.


Netflix has more subscribers than any other TV cable company.

There are ongoing clinical trials where blood plasma from young people is given to older people, in the hope of rejuvenating them. Blood plasma is abundant and routinely discarded in practice, so if it were to work, it might be quite practical. (Further research that appeared later this year allows us to be somewhat pessimistic regarding the results of these trials in the sense that rejuvenation might require the removal of aging agents rather than the addition of youthful factors.)

Singapore is the first city in the world to have self-driving taxis.

We know that some invertebrates, like lobsters, do not age (like we do). For example, we have found 140-year-old lobsters in the wild recently and they aren’t the oldest. The hydra has constant mortality throughout its life. There is definitively a lot of variation in how different species age. Lots of sea creatures like sharks are both long-lived and vertebrates. We don’t know how long sharks can live, but we found a shark that is at least 272 years old. This suggests that even vertebrates, and maybe mammals, could be effectively ageless.


What would happen if you took stem cells and put them in a brain? Would they turn into functional neurons and integrate into your brain… potentially replacing lost neurons? Researchers have shown that it works in mice.

Japan had 153 centenarians in 1963. There are now over 60,000 centenarians in Japan. According to researchers (Christensen et al., 2009), a child born today in a developed country has a median life expectancy of over 100 years.

The iPhone 7 is just as fast as a 2013 MacBook Pro. That is, our best smartphones from 2016 can run some software just as fast as the best laptop from three years ago.

Software can now read mammograms more accurately than any doctor. Software is also faster and cheaper.

Google upgraded its “Google Translate” service using a new engine leveraging recent progress in “deep learning” algorithms. The quality of the translations has been much improved. What is really impressive is that it works at “Google scale” meaning that all of us can use it at the same time.


The most popular game console of this generation, the PlayStation 4, released a well-reviewed virtual-reality headset.

Between 2011 and 2016, traditional TV viewing by 18-24-year-olds dropped by more than 9 hours per week.

Mammal heart muscles do not regenerate. So if your heart runs out of oxygen and is damaged, you may never recover the loss function. However, scientists have showed (in monkeys) that we could induce regeneration with stem cells.

Microsoft claims to be able to match human beings in conversational speech recognition.

Tesla, the car maker, has enabled self-driving on all its cars. Even General Motors is producing self-driving cars in Michigan. You should still check whether it is legal for you to drive without holding the wheel.


Crazy researchers gave young human blood to old mice. The old mice were rejuvenated. Though we do not know yet what it means exactly, it is strong evidence that your age is inscribed in your blood and that by changing your blood, we can age or rejuvenate you, to a point.

On the same theme, the Conboy lab at Berkeley, with funding from Google (via Calico) has shown, using a fancy blood transfer technique, that there are “aging agents” in old mice blood. If you take these agents and put them into a young mouse, it is aged. Presumably, if we were to identify these agents, we could remove them with drugs or dialysis and rejuvenate old mice. In time, we could maybe rejuvenate human beings as well.

Facebook’s CEO, Mark Zuckerberg, said that it will be soon normal for human beings to live over 100 years. This is supported by the work of demographers such as James W. Vaupel.


It was generally believed that women eventually ran out of eggs and became irreversibly sterile. Researchers have found that a cancer drug can be used to spur the generation of new eggs, thus potentially reversing age-related infertility. In the future, menopause could be reversed.

Bones grown in a lab were successfully transplanted in human beings.

We don’t know exactly how long seabirds could live. It is hard work to study long-lived animals and it takes a long time (obviously). Biologists have encountered unexpected problems such as the fact that the rings that they use to tag the birds wear out faster than the birds do. In 1956, Chandler Robbins tagged an albatross. The 66-year-old albatross laid an egg this year.

Old people (in their 30s and 40s) can have young babies. Evidently, biological cells have a way to reset their age. Ten years ago (in 2006), Shinya Yamanaka showed how to do just that in the lab by activating only four genes. So no matter how old you are, I can take some of your cells and “reset them” so that they are “young again”. So far, however, nobody has known how to apply this to multicellular organism. The New York Times reports that researchers from the Salk Institute did just that. (They have a video on YouTube.) By activating the four genes, they showed that they could rejuvenate human skin tissue in vitro, then they showed that they could extend the lifespan of mice.

How to build robust systems

Millions of little things go wrong in your body every minute. Your brain processes the data in a noisy manner. Even trained mathematicians can’t think logically most of the time.

In our economy, most companies fail within a decade. Most products are flawed.

Over a million people die every year in car accidents. We have made cars much safer over the last few decades, not by turning them into tanks or by limiting speed to 10 km/h, but by adding air bags and other security features that make accidents less harmful.

A large fraction of our software is built in C. The C language is great, but it does not easily lend itself to a rigorous analysis. For example, a lot of production C software relies on “undefined” behavior that is compiler specific.

Irrespective of the programming language you choose, your software will run out of resources, it will contain bugs… Faults are unavoidable but our computers still keep us productive and entertained.

There are two distinct strategies to build robust systems:

  1. You can reduce the possibility of a fault.
  2. You can reduce the harm when faults do occur.

We all know what it means to reduce the possibility of a fault… we can make software safer by adding more checks, we can train better drivers…

But what does the latter (reducing harm) means concretely? Our bodies are designed so that even if hundreds of our cells break down you barely notice it. Our economy can keep working even if a large fraction of its enterprises fail. Robust software systems are all around us. For example, modern operating systems (Linux, Windows, Android, iOS, macOS) can sustain many faults before failing. Our web servers (e.g., Apache) have often long streams of error logs.

As a programmer, it is simply not possible to account for everything that may go wrong. Systems have limited resources (e.g., memory) and these may run out at any time. Your memory and storage may become corrupted… there is a small but non-zero probability that a single bit may flip. For example, cosmic rays are bound to corrupt slightly your memory. We can shield memory and storage, and add layers of redundancy… but, beyond a certain point, reducing the possibility of a fault becomes tremendously complicated and expensive… and it becomes far more economical to minimize the harm due to expected faults.

Don’t assume that safety comes for free: a Swift case study

Most modern languages try to be “safer” by checking runtime values in the hope of producing more secure and less buggy software. Sadly, it makes it harder to reason about the performance of the code. And, sometimes, the cost can be far from negligible.

Let me illustrate it through a story using Swift programming (Apple’s new programming language).

How do you sum the values in an array? If you are an idiot like myself, you might use a for loop:

var s = 0
for i in array {
    s += i

According to my benchmark, this runs at a speed of 0.4 ns per array element. My Skylake processor runs at a flat speed of 3.4GHz, so that we are using a bit over one CPU cycle per array element (1.36 to be precise). Not bad, but we know we could do better. How could Swift get something so simple “wrong”? We shall come back to this issue.

We can get fancier and compute the sum using “functional programming” with a call to reduce. It is slightly shorter and more likely to get you a job interview:

array.reduce(0,{$0 + $1})

For extra obfuscation and even better luck during interviews, you can even use an equivalent and even shorter syntax:


Though it is more scientific-sounding, this function is actually 50% slower (clocking at 0.6 ns per element). Why is it slower? We are evidently paying a price for the higher level of abstraction.

Can we do better?

Swift’s integers (Int) are 64-bit integers on 64-bit platforms. What happens if the sum of the values cannot be represented using a 64-bit integer (because it is too large)? Swift helpfully checks for an overflow and crashes if it occurs (instead of silently continuing as Java, C or C++ would). You can disable this behavior by prefixing your operators with the ampersand as in this code…

var s = 0
for i in array {
    s &+= i

or this one…

array.reduce(0,{$0 &+ $1})

These are “unsafe” functions in the sense that they might silently overflow.

On my Skylake processor, the silent-overflow (“unsafe”) approach can be twice as fast:

technique clock cycles per value
simple loop 1.36
reduce 2
"unsafe" simple loop (&+) 0.65
"unsafe" reduce (&+) 1.87

So between the “safe” reduce and the “unsafe” simple loop, there is a factor of 3. I consider such a difference quite significant. The “unsafe” simple loop has the same speed as a C function: that’s what we want, ideally.

All these functions achieve nearly the same best performance if you disable safety checks at build time (swift build --configuration release -Xswiftc -Ounchecked). This means, in particular, that we get the fancy abstraction (the reduce approach) for free, as long as we disable safety checks. That makes performance a lot easier to understand.

So why the large performance difference? With regular checked arithmetic in a for loop, Swift generates add instructions (addq) followed by a condition jump (jo). Without checked arithmetic, it is able to vectorize the sum (using paddq instructions). (Thanks to John Regher for putting me down on this path of analysis.) To paraphrase Steve Canon, “overflow checking itself isn’t very expensive, but it inhibits other optimizations.”

You might notice that I am using quotes around the terms “safe” and “unsafe”. That’s because I do not think it is universally a good thing that software should crash when an overflow occurs. Think about the software that runs in your brain. Sometimes you experience bugs. For example, an optical illusion is a fault in your vision software. Would you want to fall dead whenever you encounter an optical illusion? That does not sound entirely reasonable, does it? Moreover, would you want this “fall dead” switch to make all of your brain run at half its best speed? In software terms, this means that a mission-critical application could crash because an unimportant function in a secondary routine overflows. This is not a merely hypothetical scenario: an overflow in an unnecessary software component halted the software of the Ariane 5 rocket leading to a catastrophe.

My Swift and C code is freely available.

Further reading: We Need Hardware Traps for Integer Overflow and Understanding Integer Overflow in C/C++

Appendix: Overflow checks are not unique to Swift. You can compile your C and C++ with runtime overflow checks using LLVM clang (
) or GNU GCC (
). The Rust language checks for overflow in debug mode (but not in release mode).

Update: A HackerNews user commented on the performance issues:

Looking at the generated code, we have a couple issues that prevent full optimization. One of them I already knew about ( — as a debugging aid, we guard every trap we emit with the equivalent of an `asm volatile(“”)` barrier, because LLVM will otherwise happily fold all the traps together into one trap. We really want every trap to have a distinct address so that the debugger can map that trap back to the source code that emitted it, but the asm barrier prevents other optimizations as well. As for `reduce`, it looks like the compiler does inline away the closure and turn the inner loop into what you would expect, but for some reason there’s more pre-matter validation of the array than there is with a for loop. That’s a bug; by all means, reduce is intended to optimize the same.

Update 2: Nathan Kurz ran additional tests using arrays of different sizes…

$ swift build --configuration release
$ cset proc -s nohz -e .build/release/reduce

# count  (basic, reduce, unsafe basic, unsafe reduce)
1000      (0.546, 0.661, 0.197, 0.576)
10000     (0.403, 0.598, 0.169, 0.544)
100000    (0.391, 0.595, 0.194, 0.542)
1000000   (0.477, 0.663, 0.294, 0.582)
10000000  (0.507, 0.655, 0.337, 0.608)
100000000 (0.509, 0.655, 0.339, 0.608)
1000000000(0.511, 0.656, 0.345, 0.611)

$ swift build --configuration release  -Xswiftc -Ounchecked
$ cset proc -s nohz -e .build/release/reduce

# count  (basic, reduce, unsafe basic, unsafe reduce)
1000      (0.309, 0.253, 0.180, 0.226)
10000     (0.195, 0.170, 0.168, 0.170)
100000    (0.217, 0.203, 0.196, 0.201)
1000000   (0.292, 0.326, 0.299, 0.252)
10000000  (0.334, 0.337, 0.333, 0.337)
100000000 (0.339, 0.339, 0.340, 0.339)
1000000000(0.344, 0.344, 0.344, 0.344)

Getting a job in the software industry

I am routinely asked about how to break into the software industry as a programmer. It is true that there is high demand for qualified programmers, but not all jobs are equal. There are good and bad tracks. And most jobs aren’t that good.

How do you get a good job as a programmer? Here is my somewhat educated answer:

  • Make sure you really want to be a programmer. It is hard work. Can you spend hours reading or writing technical documentation? Can you spend hours reviewing someone’s code? People don’t get into programming because you like video games or because you like the salaries. It is not going to end well.
  • It is entirely possible to get a good job without any formal training as a programmer. Good programmers can recognize one of their own without checking a resume. However, the overwhelming majority of programmers with great jobs seem to have some college education. At a minimum, you need to be able to communicate clearly (in English) and know basic mathematics. You can do a lot of useful programming without knowing about logarithms, probabilities, standard deviations, functions, set theory… but you are just going to be a lot less convincing in some contexts. So if you have no college education whatsoever, it might be easier to get some. At the other end of the spectrum, having many advanced degrees might not help as much as you’d expect. The prestige of your college probably does not matter a whole lot, and you can get good mileage out of things like Coursera.
  • You need to learn to recognize good jobs. A good tell is the people interviewing you… if there is nobody in charge of recruiting that has a strong background in programming, run away. I’d make an exception if it is a start-up and you are the first programmer being hired… but then you should probably be getting some equity in the company.
  • It is tempting to look at job ads and focus on getting all the right keywords on your resume. But without an in-depth knowledge of the domain, it can be a misleading activity because (1) most jobs being advertised are not good jobs and (2) not all keywords have equal weight.
  • Right now, it probably pays to get yourself a GitHub profile and to contribute usefully to some projects. Show that you can be a team player, show that you can follow-through and produce useful code. Getting a StackOverflow profile and contributing to the community is probably wise. There might be other online communities that are just as good, and these things tend to evolve over time… it is possible that in 5 years, GitHub will be replaced entirely by something better. You need to keep up.
  • Get involved with local programming-oriented meetups. Show up, ask questions, give talks. You can get to meet some of the best engineers from some of the best companies, and they might be quite happy to discuss job prospects.
  • You may need a little bit of shallow knowledge about all sorts of things, especially the latest trends. It probably pays to follow news sites catering to programmers: Hacker News, Slashdot, Reddit, blogs, Twitter…
  • Technical interviews can be quite difficult. I am not exactly sure why the interviews are so difficult, but they can be a lot harder than your average college exam. It probably pays to train with the help of a book or two.

Software evolves by natural selection

Software evolves by natural selection, not by intelligent design. It is a massive trial-and-error process. There are many thousands of programmers working every day to build new things in the hope of replacing the old…

From time to time, you will hear about a new fantastic piece of computer science. For example, right now deep learning is the hot new thing. Some years ago, people were very excited about MapReduce.

As an ecosystem changes, some tools become less likely to be useful while others gain dominance in common use cases. So, because of a myriad of other changes, deep learning is suddenly a lot more useful than it would have been 10 years ago. Deep learning is great to classify pieces of data like images, given that you have the kind of computing power a modern GPU offers. Decades ago, we had much less data, and nothing like a modern GPU, so perceptrons (the ancestors of deep learning) were useless.

What does it mean for the programmer? If software advances by trial and error, then the more quickly you can move and adapt, the better are your chances. It also becomes important to anticipate changes and to try many things. Trial-and-error is an underrated approach in software.

You will do doubt object that you are using the latest stuff… maybe it is deep learning running on the sexiest big-data framework (e.g., Apache Spark) with the fanciest programming language. So you are set, right? Maybe but be concerned that even a brand new bus turns slowly. Don’t be a bus.

So what is the software of the future? Where should you place your bet?

I have been making a boring prediction for more than a decade, and it has served me well: we will have more data. We should think about what could become possible if we were given 10x as much data and 10x as much processing power. What kind of software would we need then?

It is a safe bet that current PC architecture cannot cope. So do not get too attached to it.

On metadata

I remember a time, before the Web, when you would look for relevant academic papers by reading large books with tiny fonts that would list all relevant work in a given area published in a given year. Of course, you could have just gone to the shelves and checked the research articles themselves but, by for a slow human being, this would have been just too time consuming. These large volumes contained nothing by “metadata”: lists of article titles, authors, keywords… They were tremendously valuable to the researchers.

One of the earliest applications of computers was to help manage document collections. In this sense, the Web and Google are very naturally applications for computers. In the early days, computers could not have access to the books themselves but human experts (librarians) could enter the precious metadata in the computers to make them useful.

This was at a time when many computer scientists were worried that computers would be starved for data. Indeed, there are only so many highly paid experts who will sit quietly at a computer entering valuable data.

Back then was an era when people dreamed of intelligent computers. So they imagined a world where human experts would enter data and rules in computer systems, and that these rules would allow these computers to match or even outmatch human intelligence.

History proved these early artificial-intelligence researchers wrong. This “classical” approach toward machine intelligence did not bear fruit. It is not, as you might imagine, that these dreamers lacked support. They got billions of dollars in funding from just about every government agency you can think of, including the military.

More recently, this classical artificial-intelligence approach has lead to the semantic web (1998) and its idea of “linked data”. The thesis of this approach is to supplement the existing web with a thick layer of “RDF” that describes the resources and enables advanced machine reasoning.

There are a few problems that make classical artificial intelligence a bad approach:

Collecting, curating and interpreting a large volume of predicates in an open world is a fundamentally intractable problem. If you must track the list of authors, the publication date and the topic of a set of books, and you have enough highly trained librarians, it will work. But the system fails to scale up.

If you have enough time and money, you can build things up, to a point. Walmart, the world’s largest private employer, has an extensive supply chain that needs to track precise and accurate metadata about a vast range of products. It works. And it made Walmart very rich and powerful.

But we should not underestimate the difficulty. When the American retailer Target tried to enter the Canadian market, they ended up with empty shelves. These empty shelves lead to poor business and to a massive failure that will be a subject of study for years to come. Why were the shelves empty? Because Target tried to build a new inventory management system and failed. What does an inventory management system do? Essentially, it is a metadata system. To build it up quickly, they had their employees work at collecting and entering the data, days after days… And then they face reality: when the metadata was there at all, it was often badly wrong. This made it difficult if not impossible to keep the shelves filled up.

Walmart has a worldwide distribution and supply-chain system, with well-oiled computing systems. But getting it to work is not a secondary or trivial activity. It is Walmart’s core business. It takes enormous effort for everyone involved to keep it going. We know because it is hard for other people to reproduce it, even when they have tremendous resources.

Even the simplest things can become headaches. Creative Commons allows you to specify a license for you work. A population choice is to allow “noncommercial” use. But does that mean? Reasonable people can disagree. Is a class given at a public college a “commercial” or “noncommercial” activity? Dublin core aims at specifying simple things like the author and title of a piece of work, but when you examine the generated metadata, there is widespread disagreement about what the various attributes mean.

In the early days of the web, Yahoo! was the standard search engine. It worked through a taxonomy of websites. You looked for a taxidermist in your town by drilling down to the right category. But despite billions in revenues, Yahoo! could not keep this system going. It collapsed under its own weight.

This should not be underestimated. Most metadata is unreliable. Maintaining high-quality data is simply hard work. And even when people have good intentions, it takes more mental acuity than you might think. For example, in the Target Canada case, product dimensions were sometimes entered in inches, sometimes in centimeters. You might dismiss such errors as the result of hiring poorly trained employees… but the NASA Mars Climate Orbiter crashed because highly paid engineers got confused and used English units instead of metric units.

One of the problems with metadata in the real world is that you are in an adversarial setting. So assuming that the people entering the metadata are motivated and highly trained, you still have to worry that they are going to lie to you. That’s what happened with Yahoo!, and what happens with a lot of metadata. Suppose that I am a book editor and I am asked to classify the book… and I know that customers will browse the content according to my classification… my incentive is strong to fit my product in the most popular categories.

To make matters worse, metadata systems are not objectively neutral entities on their own. They were designed by people with biases. There is no such thing as an objective oracle in the real world. If you have ever had to reconcile different databases, data stored in different formats, you know how painful it can be. There are simply different and equally valid ways to describe the world.

You would think that errors can be objectively assessed… and certainly a box measures about 12 inches or it does not. However, the level and type of determinism in natural languages has evolved and reached a sweet spot. We are neither fully pedantic nor entirely vague in our use of a natural language. There is plenty of room for reasonable people to disagree about the meaning of the words, about how to categorize things. Is Pluto a planet? Well. It is a dwarf planet. Is that like a small planet or not a planet at all?

We silently make some assumptions. One of them is that the world can be understood in a top-down fashion… we start from abstract concepts and work our way to concrete cases. Peter Turney argues that though we think that “red” and “ball” are the fundamental logical atoms, the opposite might be true. We first get a “red ball” as the core logical atom, and the most abstract notions, such as “red” or “ball” are derived from the more concrete ones.

So far, I have been addressing the case of simple metadata. The kind that can help you determine whether a book was published in June or July of this year. But if you want to build “intelligent” software, you need a lot more than that. You need to represent your knowledge about the world in a formal form.

How hard is that? Let us turn to mathematicians. Surely representing mathematics formally is the easiest thing. If you can’t formalize mathematics, what chances do you have in the real world? So for many decades, a collection of super smart mathematicians (under the pseudonym Bourbaki) attempted to give mathematics a formal foundation. It was interesting and challenging, but, ultimately, it just proved too difficult to pursue.

But wait! Isn’t Google able to go through your pictures and find all pictures of a cat? Isn’t Google able to take a picture of your house and find out your street address?

Yes. But this muscle behind these feats has little to do with the prowesses of metadata. You see, just as classical artificial intelligence folks pursued their dream of cataloging all knowledge and achieving intelligence in this manner… Other people thought that intelligence was more like a probabilistic engine. It is more about statistics than reasoning. We usually refer to this approach as “machine learning” meaning that the knowledge enters the machine through “learning”, and not through the wisdom of human experts.

The opposition between the two points of views has been entitled “Neats and scruffies“. The metadata and formal reasoning people are in the “neat” camp, while others are in the scruffies.

Google is the child of the scruffies. Google does not hire large sets of experts to catalog knowledge, the overwhelming majority of their effort is invested in machine learning. Of course, given a web page, Google has metadata… but the bulk of this metadata was derived by the machine.

Suppose you needed to build a representation of the world… would you recruit thousands of people who would fill out forms? How would you build software that can parse textbooks, Wikipedia, reference manuals…? The latter approach has to deal directly with ambiguity and contradictions, but it is also the only way to scale up to billions of facts.

So, at this point, if anyone asks you to invest a lot of time entering metadata in a system… you really ought to be skeptical. That’s not the approach that has taken over the world. It is the approach that has failed again, and again to deliver the promised goods.

Let me summarize. Representing knowledge formally is hard. Harder than intuition dictates. It is really expensive. It does not fare well once adversarial interests come in. You usually end up with faulty, barely usable data, either because people did not make the effort, because they gamed the system or because they made honest mistakes. And that’s for simple data such as maintaining an address book of your customers. Intelligent applications need finer representations, and then it gets really hard and expensive to get right. Meanwhile, approaches based on machine learning are getting better and better every year. They can defeat the best human players at most games, they can classify your pictures automatically, they can translate books for you… We have to embrace machine intelligence. The last thing we want our highly paid experts to do is to be filling out forms.

Further study: Everything is Miscellaneous by David Weinberger.

Credit: Thanks to Andre Vellino and Antonio Badia for their valuable feedback and contributions.

Framed in the past

We can’t predict the future. However, I still read futurologists like Calum Chace, Alvin Toffler, Bruce Sterling, Vernor Vinge, Joël de Rosnay and so forth. A good futurologist won’t even attempt to predict the future for real, but he will offer a new way to think about the present. Indeed, I think that to really understand the present, it useful to first put ourselves into an imagined future, and then look back. A fish can only see the pond if he first jumps out of it.

The default in our culture works in reverse. Instead of imagining the future and thinking back, we put ourselves in the past. Instead of thinking as futurologists, we think as historians.

People buy phones. In 2016. Of course, they actually mean portable computers with a touch screen. Nobody in the 1970s would recognize an iPhone as a phone. It is actually far superior to the Star Trek communicators that people in the 1970s saw on TV. That should not be surprising given that an iPhone is more powerful than any computer that existed in the 1970s. But we still frame these portable computers as things (phone) from the past.

We talk about shows that appear (sometimes exclusively) on Netflix as “TV shows”, whereas a “TV” is entirely optional to watch the shows in question.

By framing radical innovations in the past, we are effectively dismissing the genuine novelty. That is the very purpose of this framing. It is hard to sell something new. So marketers have to relate it to what you know. Everyone knows about phones. They are useful. Companies pay for them. So let us sell phones, even if they are no longer phones.

I was invited to talk at a major music industry event in Montreal recently. People complained repeatedly that they had a harder time selling discs. They were angry. Yes, it is not a typo. In Montreal, people in the music industry still talk about selling discs. Of course, they mean, I suppose, getting royalties from people listening to their music online. Or maybe that’s not what they mean. I am not sure. What is sure is that they frame their problem in the past.

A colleague conducted a set of interviews with education professors specializing in distance learning. They were asked whether technology (the web, basically) changed distance learning. Their answer was unequivocal. It did not. The change is superficial, they say. I am sure that they are being truthful: they view contemporary distance learning as it was 20 years ago.

You will find exactly the same thing in a computer-science department. Did the technology of the past 20 years disrupt software? You will have no trouble finding professors who will carefully explain that nothing at all is new.

Past-tense framing is useful in that it gets people to accept radical change. But that’s not entirely a good thing: it also prevents people (including smart people) from thinking clearly about the present. If the present is like the past, then whatever worked in the past will work in the present.

Historians are fond of quoting Santayana: “Those who cannot remember the past are condemned to repeat it.” I prefer to say that those who can’t see different futures won’t have any say in the present.

As you organize your upcoming industry event, you should probably recruit speakers who can tell you about the future of your industry. Of course, they are going to be wrong, but you can be wrong in interesting and useful ways.

My review of Deus Ex: Mankind Divided (video game)

The Deus Ex series is set in a dystopian futuristic universe. (It reminds me a bit of Japan’s Ghost in the Shell.) The latest game in the series (Deux Ex: Mankind Divided) is set less than 15 years in the future (2029). In this future, many people have artificial organs (eyes, legs, hands) called “augmentations” because they are superior to biological organs. You play as Adam Jensen, a man who barely survived a terrible attack… but was given superior augmentations by the corporation he works for. So you are something of a super-hero. As the name suggests (“mankind divided”), people with augmentations are oppressed and put in ghettos. Oddly enough, you are free to go and generally given a lot of respect. There is a lot of terrorism, organized crime, high-level conspiracies, perverted media…

I am not exactly sure whether the authors believe that they are offering a credible depiction of the future. I don’t think that the game will age well in this respect. In the game, laptops are omnipresent. If we have the technology to replace your lungs with better artificial organs… why would we carry around bulky 2010-era laptops? There is a single virtual-reality setup, and it takes a whole room. I would think that by 2029, virtual-reality gear would be as convenient as a pair of regular glasses? There are many oppressed workers, lots of poverty… but few robots and no talk of technological unemployment. There are cults embracing the technological singularity, but we see no evidence that technology is changing at an accelerated pace.

Ok. It is a video game, not a futurology essay.

So how is the game?

The fun part in Deus Ex is that there are many different ways to do your missions. The game invites you to be creative and to have fun. And you need to be a bit clever: Adam Jensen is hardly invincible, he is just one man. You are just not going to destroy gangsters with machine guns on frontal assault, as you would in other games. You need to look around for an alternate path. You get to hack robots, turrets… knock quite a few heads… steal secret documents… sneak in people’s apartments…

It is the first game in the Deux Ex series that I can recommend. If you have played the previous ones, this one will look familiar… but everything is smoother. The level design was thoroughly tested and finely tuned. I never got stuck. Also, it helps that this game was designed for today’s beefier hardware.

The major downside is that the scenario feels like gibberish. The theme and the mood are well done, but the authors chose to play the “layer-upon-layer mystery” card. There are bad guys pulling the strings, but it all feels overengineered in the end. The final boss is evil alright, but his motivations seem all too complicated. There are too many characters pulling the strings. I chose to ignore all of it while playing.

Other recent games that I have really liked: Uncharted 4 and Last of us. Last of us is my favorite game of all times.

Who is keeping an eye on the tech companies?

A corporation such as Spotify was founded a few years ago by a 23-year-old man, and it now plays a key role in the music industry. YouTube is now the big-boss of music delivery. Had I predicted these things ten years ago, you would have thought me odd.

Kurzweil, the inventor, popularized the notion of a technological singularity. It is a moment in our history where things move so fast, we can no longer comprehend the world nor make even short-term predictions. Are you game to predict the next ten years of the music industry?

Let us think about the present.

There is an explosion of diversity because it becomes affordable. Spotify can offer millions of songs. This was unthinkable in the past. The net result is that more musicians than ever can be found by more people than ever.

Chris Anderson in 2004 promised us a world where more of us could make a living on the “long tail” (the long stream of less popular products and services) without needing a big break. Is it what happened? I think we still debate it. Recently, a study on the Google Play store found that it is a super-star market where the bulk of the money ends up in the pocket of the few, with most people hardly making a penny. Should we blame the bulk of the players for their bad luck?

Having your data stored in the index is not enough to be found. Being visible is not the same thing as being found. When you use Google to search for something, it can return 20,000 relevant pages, but you are unlikely to look at more than 3 or 4. Tools like Spotify are not different. They use recommender systems or, in other words, AI, to help us find what we are looking for. They think for us.

People don’t always realize that tools such as Google know about you and provide personalized results. These tools are mirrors, necessarily imperfect ones. They are not neutral.

What do I mean by neutral? When using polls to predict election results, we measure what people think, but we also change what people think by publishing the results. A top-10 list in music is not just a mirror, it is also an active agent that changes what people listen to.

The rules of engagement also matter. Our democratic systems are often setup to favor the emergence of two dominant parties. It is the intended effect. It couldn’t be more obvious in the USA. Yet who is setting up the rules in the digital world? Do you know?

The big players like Google or Spotify have access to a remarkable wealth of data. They know who you are, where you are and what you do. They know your age, your race, your sexual orientation. No business could have dreamed of having so much data 30 years ago.

As such, it is not harmful.

But any complex and powerful system can have unexpected and undesirable effects. For example, “there’s a general tendency for automated decisions to favor those who belong to the statistically dominant groups.”

At a minimum, I think we should study what systems like Google and Spotify do so we can discuss it openly.

Problems don’t necessarily arise because the big players are malicious. Datta et al. (2015) found that online ads for highly paid jobs tended to be shown more often to men than women. It seems that the reason is that younger women are a prized demographic so they get ads from advertisers with deeper pockets instead.

What could we do in concrete terms? We could use volunteers who agree to have some of their Internet interactions monitored. We could also proceed with automated sousveillance, where we create fake accounts to keep an eye on the big players.

We already keep track of journalists, to check whether they are being fair, I think we should spend a bit more time tracking Google, Apple, Amazon, Spotify… This would seem only prudent in such a fast changing world.

Update to my VR bet with Greg Linden

I have an ongoing bet with Greg Linden stating that we are going to sell 10 million virtual-reality (VR) units per year by 2019.

I have been paying close attention to VR technology and its impact.

What have we learned in the last few months?

  • The technology is fantastic. VR headsets work.
  • The software is currently the weak point. Besides games and “experiences”, we simply have no compelling application. And while there are some good games, none of them is good enough to motivate millions into purchasing a VR headset.
  • A nice surprise: Sony has managed to make VR work with the relatively underpowered PlayStation 4. The initial reports are very positive. I think that’s important: it shows that relatively weak and inexpensive processors are sufficient for VR.

I key indicator is how many VR headsets Sony manages to sell over Christmas.

Intel will add deep-learning instructions to its processors

Some of the latest Intel processors support the AVX-512 family of vector instructions. These instructions operate on blocks of 512 bits (or 64 bytes). The benefit of such wide instructions is that even without increasing the processor clock speed, systems can still process a lot more data. Most code today operators over 64-bit words (8 bytes). In theory, keeping everything else constant, you could go 8 times faster by using AVX-512 instructions instead.

Of course, not all code can make use of vector instructions… but that’s not relevant. What matters is whether your “hot code” (where the processor spends much of its time) can benefit from them. In many systems, the hot code is made of tight loops that need to run billions of times. Just the kind of code that can benefit from vectorization!

The hottest trend in software right now is “deep learning”. It can be used to classify pictures, recognize speech or play the game of Go. Some say that the quickest “get rich quick” scheme right now is to launch a deep-learning venture, and get bought by one of the big players (Facebook, Google, Apple, Microsoft, Amazon). It is made easier by the fact that companies like Google have open sourced their code such as Tensorflow.

Sadly for Intel, it has been mostly left out of the game. Nvidia graphics processors are the standard off-the-shelf approach to running deep-learning code. That’s not to say that Intel lacks good technology. But for the kind of brute-force algebra that’s required by deep learning, Nvidia graphics processors are simply a better fit.

However, Intel is apparently preparing a counter-attack, of sort. In September of this year, they have discreetly revealed that their future processors will support dedicated deep-learning instructions. Intel’s AVX-512 family of instructions is decomposed in sub-families. There will be two new sub-families for deep-learning: AVX512_4VNNIW and AVX512_4FMAPS.

A case study in the performance cost of abstraction (C++’s std::shuffle)

Statisticians and machine-learning experts sometimes need to shuffle data quickly and repeatedly. There is one standard and simple algorithm to shuffle an array, the so-called Fisher-Yates shuffle:

for (i=size; i>1; i--) {
  nextpos = random_numbers_in_range(0,i);
  swap(storage[i-1], storage[nextpos]);

Not very difficult, is it? The C++ programming language, like many others, have a standard function to solve this problem (std::shuffle).

How does it fare against a very basic Fisher-Yates shuffle without any optimization whatsoever? To make sure that the comparison is fair, let us work with the same data (an array of strings) and use the same random number generator (I chose PCG). To avoid caching issues, let us use a small array that fits in cache.

Here is my unoptimized C++ code:

template <class T>
void  shuffle(T *storage, uint32_t size) {
    for (uint32_t i=size; i>1; i--) {
        uint32_t nextpos = pcg32_random_bounded(i);

I claim that this is the “textbook” implementation… meaning that it is the closest thing you can get to taking a textbook and coding up the pseudo-code in C++. (Yes I know that people copy code from Wikipedia or StackOverflow, not textbooks, but humor me.)

The pcg32_random_bounded function I call is implemented in a standard (but suboptimal way) to get a random number in a range with two divisions. You can do it with a single multiplication instead but let us ignore optimizations.

Here are my results, expressed in CPU clock cycles per value… (Skylake processor, clang++ 4.0)

technique clock cycles per value
std::shuffle 73
textbook code 29

So the textbook code is twice as fast as the standard C++ function.

Why is that?

It is often the case that default random number generators are slow due to concurrency issues. But we provide our own random number generators, so it should not be an issue.

A cursory analysis reveals that the most likely reason for the slowdown is that the standard C++ library tends to use 64-bit arithmetic throughout (on 64-bit systems). I implicitly assume, in my textbook implementation, that you are not going to randomly shuffle arrays containing more than 4 billion elements. I don’t think I am being unreasonable: an array of 4 billion std::string values would use at least 128 GB of RAM. If you need to shuffle that much data, you probably want to parallelize the problem. But, from the point of view of the engineers working on the standard library, they have to work with the requirements set forth by the specification. So 64-bit arithmetic it is! And that’s how I can beat them without any effort.

The absolute numbers might also surprise you. The Fisher-Yates shuffle is very efficient. We do a single pass over the data. We do not really need to look at the data, just move it. We use a fast random number generator (PCG). How can we end up with 30 or 70 cycles per array element?

Part of the problem is our use of std::string. On my system, a single (empty) std::string uses 32 bytes whereas a pointer (char *) uses only 8 bytes. If we fall back on C strings (char *), we can accelerate the processing simply because there is less data to move. Without going overboard with optimizations, I can bring the computational cost to about 7 cycles per element by avoiding divisions and using C strings instead of std::string objects. That’s an order of magnitude faster than the standard shuffle function.

So while std::shuffle is very general, being able to sort just about any array using just about any random-number generator… this generality has a cost in terms of performance.

My code is available.

Variable-length strings can be expensive

Much of our software deals with variable-length strings. For example, my name “Daniel” uses six characters whereas my neighbor’s name (“Philippe”) uses 8 characters.

“Old” software often shied away from variable-length strings. The Pascal programming language supported fixed-length strings. Most relational databases support fixed-length strings. Flat files with fixed-length lines and fixed-length records were common in the old days.

It seems that people today never worry about variable-length strings. All the recent programming languages I know ignore the concept of fixed-length strings.

Yet fixed-length strings have clear engineering benefits. For example, if I were to store “tweets” (from Twitter) in memory, using a flat 140 characters per tweet, I would be able to support fast direct access without having to go through pointers. Memory allocation would be trivial.

However, maybe processors have gotten so smart, and string operations so optimized, that there is no clear performance benefits to fixed-length strings in today’s software.

I don’t think so.

Even today, it is often possible to accelerate software by replacing variable-length strings by fixed-length ones, when you know ahead of time that all your strings are going to be reasonably short.

To examine this idea, I have created lists of strings made simply of the numbers (as strings) from 0 to 1024. Such strings have length ranging between 1 and 4 characters. In C, we use null-terminated characters, so the actual length is between 2 and 5. In C++, they have standard strings (std::string), with significant overhead: on my system, a single std::string uses 32 bytes, not counting the string content itself.

Instead of using variable-length strings, I can “pad” my strings so that they have a fixed length (8 characters), adding null characters as needed.

How long does it take, in CPU cycles per string, to sort a short array of strings (1024 strings)?

string type CPU cycles per element
C++ std::string 520
standard C string 300
padded C string 140

So for this experiment, replacing variable-length strings by fixed-length strings more than double the performance! And my code isn’t even optimized. For example, to keep the comparison “fair”, I sorted pointers to strings… but my padded C strings fit in a machine word and do not require a pointer. So, in fact, fixed-length strings could be nearly three times faster with enough work.

To summarize: Variable-length strings are a convenient abstraction. You may hear that string operations are very cheap, unlikely to be a bottleneck and so forth… That might be so…

But I submit to you that the omnipresence of variable-length strings as a universal default can make us blind to very sweet optimization opportunities.

My source code is available.

Can Swift code call C code without overhead?

Swift is the latest hot new language from Apple. It is becoming the standard programming language on Apple systems.

I complained in a previous post that Swift 3.0 has only about half of Java’s speed in tests that I care about. That’s not great for high-performance programming.

But we do have a language that produces very fast code: the C language.

Many languages like Objective-C, C++, Python and Go allow you to call C code with relative ease. C++ and Objective-C can call C code with no overhead. Go makes it very easy, but the performance overhead is huge. So it is almost never a good idea to call C from Go for performance. Python also suffers from a significant overhead when calling C code, but since native Python is not so fast, it is often a practical idea to rewrite performance-sensitive code in C and call it from Python. Java makes it hard to call C code, so it is usually not even considered.

What about Swift? We know, as per Apple’s requirements, that Swift must interact constantly with legacy Objective-C code. So we know that it must be good. How good is it?

To put it to the test, I decided to call from Swift a simple Fibonacci recurrence function :

void fibo(int * x, int * y) {
  int c = * y;
  *y = *x + *y;
  *x = c;

(Note: this function can overflow and that is undefined behavior in C.)

How does it fare against pure Swift code?

let c = j;
j = i &+ j;
i = c;

To be clear, this is a really extreme case. You should never rewrite such a tiny piece of code in C for performance. I am intentionally pushing the limits.

I wrote a test that calls these functions 3.2 billion times. The pure Swift takes 9.6 seconds on a Haswell processor… or about 3 nanosecond per call. The C function takes a bit over 13 seconds or about 4 nanoseconds per iteration. Ok. But what if I rewrote the whole thing into one C function, called only once? Then it runs in 11 seconds (it is slower than pure Swift code).

The numbers I have suggest that calling C from Swift is effectively free.

In these tests, I do not pass to Swift any optimization flag. The way you build a swift program is by typing “swift build” which is nice and elegant. To optimize the binary, you can type “swift build --configuration release“. Nice! But benchmark code is part of your tests. Sadly, swift seems to insist on only testing “debug” code for some reason. Typing “swift test --configuration release” fails since the test option does not have a configuration flag. (Calling swift test -Xswiftc -O gives me linking errors.)

I rewrote the code using a pure C program, without any Swift. Sure enough, the program runs in about 11 seconds without any optimization flag. This confirms my theory that Swift is testing the code with all optimizations turned off. What if I turn on all C optimizations? Then I go down to 1.7 seconds (or about half a nanosecond per iteration).

So while calling C from Swift is very cheap, insuring that Swift properly optimizes the code might be trickier.

It seems odd that, by default, Swift runs benchmarks in debug mode. It is not helping programmers who care about performance.

Anyhow, a good way around this problem is to simply build binaries in release mode and measure how long it takes them to run. It is crude, but it gets the job done in this case:

$ swift build --configuration release
$ time ./.build/release/LittleSwiftTest

real       0m2.030s
user       0m2.028s
sys        0m0.000s
$ time ./.build/release/LittleCOverheadTest

real       0m1.778s
user       0m1.776s
sys        0m0.000s

$ clang -Ofast -o purec  code/purec.c
$ time ./purec

real       0m1.747s
user       0m1.744s
sys        0m0.000s

So there is no difference between a straight C program, and a Swift program that calls billions of times a C function. They are both just as fast.

The pure Swift program is slightly slower in this case, however. It suggests that using C for performance-sensitive code could be beneficial in a Swift project.

So I have solid evidence that calling C functions from Swift is very cheap. That is very good news. It means that if for whatever reason, Swift is not fast enough for your needs, you stand a very good chance of being able to rely on C instead.

My Swift source code is available (works under Linux and Mac).

Credit: Thanks to Stephen Canon for helping me realize that I could lower the call overhead by calling directly the C function instead of wrapping it first in a tiny Swift function.

Sorting already sorted arrays is much faster?

If you are reading a random textbook on computer science, it is probably going to tell you all about how good sorting algorithms take linearithmic time. To arrive at this result, they count the number of operations. That’s a good model to teach computer science, but working programmers need more sophisticated models of software performance.

On modern superscalar processors, we expect in-memory sorting to limited by how far ahead the processor can predict where the data will go. Though moving the data in memory is not free, it is a small cost if it can be done predictably.

We know that sorting “already sorted data” can be done in an easy-to-predict manner (just do nothing). So it should be fast. But how much faster is it that sorting randomly shuffled data?

I decided to run an experiment.

I use arrays containing one million distinct 32-bit integers, and I report the time in CPU cycles per value on a Haswell processor. I wrote my code in C++.

function sorted data shuffled data sorted in reverse
std::sort 38 200 30

For comparison, it takes roughly n log(n) comparisons to sort an array of size n in the worst case with a good algorithm. In my experiment, log(n) is about 20.

The numbers bear out our analysis. Sorting an already-sorted array takes a fraction of the time needed to sort a shuffled array. One could object that the reason sorting already-sorted arrays is fast is because we do not have to move the data so much. So I also included initial arrays that were sorted in reverse. Interestingly, std::sort is even faster with reversed arrays! This is clear evidence for our thesis.

(The C++ source code is available. My software includes timsort results if you are interested.)

Swift versus Java : the bitset performance test

I claimed online that the performance of Apple’s Swift was not yet on par with Java. People asked me to back my claim with numbers.

I decided to construct one test based on bitsets. A bitset is a fast data structure to implement sets of integers.

Java comes with its own bitset class called java.util.BitSet. I wrote three tests for it: the time it takes to add a million integers in sequence to a bitset, the time it takes to count how many integers are present in the bitset, and the time it takes to iterate through the integers.

Here are the results in Java :

git clone
cd microbenchmarks
mvn clean install
java -cp target/microbenchmarks-0.0.1-jar-with-dependencies.jar me.lemire.microbenchmarks.bitset.Bitset

Benchmark Mode Samples Score Error Units
m.l.m.b.Bitset.construct avgt 5 0.008 ± 0.002 s/op
m.l.m.b.Bitset.count avgt 5 0.001 ± 0.000 s/op
m.l.m.b.Bitset.iterate avgt 5 0.005 ± 0.001 s/op

So all tests take at most a few milliseconds. Good.

Next I did my best to reproduce the same tests in Swift:

git clone
cd SwiftBitset

swift test -Xswiftc -Ounchecked -s BitsetTests.BitsetTests/testAddPerformance
Test Case '-[BitsetTests.BitsetTests testAddPerformance]' measured [Time, seconds] average: 0.019

swift test  -Xswiftc -Ounchecked -s BitsetTests.BitsetTests/testCountPerformance
Test Case '-[BitsetTests.BitsetTests testCountPerformance]' measured [Time, seconds] average: 0.004

swift test  -Xswiftc -Ounchecked  -s BitsetTests.BitsetTests/testIteratorPerformance
Test Case '-[BitsetTests.BitsetTests testIteratorPerformance]' measured [Time, seconds] average: 0.010

These tests are rough. I got these numbers of my laptop, without even trying to keep various noise factors in check. Notice however that I disable bound checking in Swift, but not in Java, thus giving something of an unfair advantage to Swift.

But as is evident, SwiftBitset can be 2 times slower than Java’s BitSet. Not a small difference.

SwiftBitset is brand new whereas Java’s BitSet underwent thorough review over the years. It is likely that SwiftBitset leaves at least some performance on the table. So we could probably close some of the performance gap with better code.

Nevertheless, it does not make me hopeful regarding Swift’s performance compared to Java, at least for the type of work I care about. But Swift hopefully uses less memory which might be important on mobile devices.

Some people seem to claim that Swift gives iOS an advantage over android and its use of Java. I’d like to see the numbers.

Update. I decided to add a Go benchmark and a C benchmark. Here are my results:

language create count iterate
Java’s BitSet 8 ms 1 ms 5 ms
C’s cbitset 11 ms 0 ms 4 ms
SwiftBitset 19 ms 4 ms 10 ms
Go’s bitset 18 ms 3 ms 13 ms

It might seem surprising to some, but Java can be a fast language. (The C implementation suffers from my Mac’s poor “malloc” performance. Results would be vastly different under Linux.)

Note that the release 1.9 of Go improves performance substantially.

My thoughts on Swift

Swift is a new programming language produced by Apple for its iOS devices (primarily the iPhone). It first appeared two years ago and it has been gaining popularity quickly.

Before Swift, Apple programmers were “stuck” with Objective-C. Objective-C is old and hardly ever used outside the Apple ecosystem.

Swift, at least the core language, is fully available on Linux. There are rumours that it should soon become available for Windows.

If you want to build mobile apps, then learning Swift is probably wise. But setting this context aside, how does Swift stands on its own?

To find out, I wrote and published a small Bitset library in Swift 3.0.

  • Java initially compiles to byte code and then it uses just-in-time compilation to generate machine code on-the-fly based on profiling. Swift compiles directly to machine code once and for all like Go, Rust, C and C++. In theory, this gives an advantage to Java, since it can compile code given knowledge of how it runs. In particular, this should help Java inline functions that Swift can’t inline due to its lack of profile-guided-optimization (PGO) by default. However, this also means that Java needs more time and memory when the application is starting up. Moreover, for the developer, having ahead-of-time compilation makes the performance easier to measure which may help people write better and faster Swift code.
  • Like most recent languages (e.g., Rust, Go), Swift 3.0 comes with standard and universal tools to test, build and manage dependencies. In contrast, languages like C, C++ or Java depend on additional tools that are not integrated in the language per se. There is no reason, in 2016, to not include unit testing, benchmarking and dependency management as part of a programming language itself. Swift shines in this respect.
  • Swift feels a lot like Java. It should be easy for Java programmers to learn the language. Swift passes classes per reference and everything else per value though there is an inout parameter annotation to override the default. The ability to turn what would have been a pass-by-reference class in Java and make it a pass-by-value “struct” opens up optimization opportunities in Swift.
  • Java only supports advanced processor instructions in a very limited manner. Go is similarly limited. Swift compiles to optimized machine code with full support for autovectorization like Rust, C and C++.
  • In Java, all strings are immutable. In Swift, strings can be either immutable or mutable. I suspect that this may give Swift a performance advantage in some cases.
  • Swift supports automatic type inference which is meant to give the syntax a “high-level” look compared to Java. I am not entirely convinced that it is actual progress in practice.
  • Swift uses automatic reference counting instead of a more Java-like garbage collection. Presumably, this means fewer long pauses which might be advantageous when latency is a problem (as is the case in some user-facing applications). Hopefully, it should also translate into lower memory usage in most cases. For the programmer, it appears to be more or less transparent.
  • Swift has operator overloading like C++. It might even be more powerful than C++ in the sense that you can create your own operators on the fly.
  • By default, Swift “crashes” when an operation overflows (like casting, multiplication, addition…). The intention is noble but I am not sure crashing applications in production is a good default especially if it comes with a performance penalty. Swift also “crashes” when you try to allocate too much memory, with apparently no way for the application to recover sanely. Again, I am not sure why it is a good default though maybe it is.
  • It looks like it is easy to link against C libraries. I built a simple example. Unlike Go, I suspect that the performance will be good.
  • Available benchmarks so far indicate that Swift is slower than languages like Go and Java, which are themselves slower than C. So Swift might not be a wise choice for high-performance programming.

My verdict? Swift compares favourably with Java. I’d be happy to program in Swift if I needed to build an iOS app.

Would I use Swift for other tasks? There is a lot of talk about using Swift to build web applications. I am not sure.

The rise of dark circuits

The latest iPhone 7 from Apple has more computing peak power than most laptops. Apple pulled this off using a technology called ARM big.LITTLE where half of the processor is only used when high performance is needed, otherwise it remains idle.

That’s hardly the sole example of a processor with parts that remain idle most of the time. For example, all recent desktop Intel processors come with an “Intel processor graphics” that can process video, replace a graphics card and so forth. It uses roughly half the silicon of your core processor but in many PCs where there is either no display or where there is a graphics card, most of this silicon is unused most of the time.

If you stop to think about it, it is somewhat remarkable. Silicon processors have gotten so cheap that we can afford to leave much of the silicon unused.

In contrast, much of the progress in computing has to do with miniaturization. Smaller transistors use less power, are cheaper to mass-produce and can enable processors running at a higher frequency. Yet transistors in the CPU of your computer are already only dozens of atoms in diameters. Intel has thousands of smart engineers, but none of them can make a silicon-based transistor with less than one atom. So we are about to hit a wall… a physical wall. Some would argue that this wall is already upon us. We can create wider processors, processors with fancier instructions, processors with more cores… specialized processors… but we have a really hard time squeezing out more conventional performance out of single cores.

You can expect companies like Intel to provide us with more efficient processors the conventional manner (by miniaturizing silicon transistors) up till 2020, and maybe at the extreme limit up till 2025… but then it is game over. We may buy a few extra years by going beyond silicon… but nobody is talking yet about subatomic computing.

I should caution you against excessive pessimism. Currently, for $15, you can buy a Raspberry Pi 3 computer which is probably closer than you imagine to the power of your laptop. In five years, the successor of the Raspberry Pi might still sell for $15 but be just as fast as the iPhone 7… and be faster than most laptops sold today. This means that a $30 light bulb might have the computational power of a small server in today’s terms. So we are not about to run out of computing power… not yet…

Still… where is the next frontier?

We can build 3D processors, to squeeze more transistors into a smaller area… But this only helps you so much if each transistor still uses the same power. We can’t pump more and more power into processors.

You might argue that we can cool chips better or use more powerful batteries… but none of this helps us if we have to grow the energy usage exponentially. Granted, we might be able to heat our homes with computers, at least those of us living in cold regions… but who wants an iPhone that burns through your skin?

How does our brain work despite these limitations? Our neurons are large and we have many of them… much more than we have transistors in any computer. The total computing power of our brain far exceeds the computing power of most powerful silicon processor ever made… How do we not burst into flame? The secret is that our neurons are not all firing at the same time billions of times per second.

You might have heard that we only use 10% of our brain. Then you have been told that this is a myth. There is even a Wikipedia page about this “myth”. But it is not a myth. At any one time, you are probably using less than 1% of your brain:

The cost of a single spike is high, and this severely limits, possibly to fewer than 1%, the number of neurons that can be substantially active concurrently. The high cost of spikes requires the brain not only to use representational codes that rely on very few active neurons, but also to allocate its energy resources flexibly among cortical regions according to task demand. (Lennie, 2003)

So, the truth is that you are not even using 10% of your brain… more like 1%… Your brain is in constant power-saving mode.

This, I should add, can make us optimistic about intelligence enhancement technologies. It seems entirely possible to force the brain into a higher level of activity, with the trade-off that it might use more energy and generate more heat. For our ancestors, energy was scarce and the weather could be torrid. We can afford to control our temperature, and we overeat.

But, even so, there is no way you could get half of your neurons firing simultaneously. Our biology could not sustain it. We would go into shock.

It stands to reason that our computers must follow the same pattern. We can build ever larger chips, with densely packed transistors… but most of these circuits must remain inactive most of the time… that’s what they call “dark silicon“. “Dark silicon” assumes that our technology has to be “silicon-based”, clearly something that may change in the near future, so let us use the term “dark circuits” instead.

Pause to consider: it means that in the near future, you will buy a computer made of circuits that remain mostly inactive most of the time. In fact, we might imagine a law of the sort…

The percentage of dark circuits will double every two years in commodity computers.

That sounds a bit crazy. This means that one day, we might use only 1% of the circuits in your processors at any one time—not unlike our brain. Though it sounds crazy, we will see our first effect of this “law” with the rise of non-volatile memory. Your current computer relies on volatile memory made of transistors that must be constantly “charged” to remain active. As the transistors stop shrinking, this means that the energy usage of RAM per byte will plateau. Hence, the energy usage due to memory will start growing exponentially, assuming that the amount of memory in systems grows exponentially. Exponentially growing energy usage is not good. So we will switch, in part or in full, to non-volatile memory, and that’s an example of “dark circuits”. It is often called “dark memory”.

You may assume that memory systems in a computer do not use much energy, but by several accounts, they often account for half of the energy usage because moving data is expensive. If we are to have computers with gigantic memory capacities, we cannot keep moving most of the data most of the time.

In this hypothetical future, what might programming look like? You have lots and lots of fast memory. You have lots and lots of efficient circuits capable of various computations. But we must increasingly “budget” our memory transfers and accesses. Moving data takes energy and creates heat. Moreover, though you might have gigantic computational power, you cannot afford to keep it on for long, because you will either run out of energy or overheat your systems.

Programming might start to sound a lot like biology.

Credit: This blog post benefited from an email exchange with Nathan Kurz.