Thoughts on my new laptop (Dell XPS 13 with Windows 10)

I love computers. Unlike many people, who stick to one brand and one operating system, I like to use many different systems. I own several game consoles, several types of tablets and so forth.

My primary laptop for many years has been an inexpensive and light MacBook Air. It is nearly perfect.

I also have a Thinkpad. My first Thinkpad did not even have wifi. I wrote my Ph.D. thesis on a Thinkpad. The one I have by my desk looks like it was made in the 1990s. And using it feels quite retro too! Anyhow, it is not a serious machine for me.

Here are my general requirements:

  • I do not need powerful laptops. For compute intensive tasks, I use remote servers. I do play games, but not on laptops. However, a laptop should be fast enough to run a modern IDE and compile C++ code without falling apart. If I need to run serious computations, I will use a server.
  • I expect my machines to be perfectly silent.
  • I carry around my laptops all the time. If I need a stationary machine, I will use a desktop. So I want my laptops to be lightweight.

I recently acquired a somewhat expensive Dell XPS 13. These are Dell’s flagship laptops, the best of the best. It is a serious machine.


  • Windows 10 is a good operating system. The usability is as good as macOS. This wasn’t always true.
  • The Windows Subsystem for Linux really does work. You have “something” like a bash shell with the common Unix tools running within a hidden directory on your machine.
  • The Dell XPS looks good. The screen is gorgeous and the fonts look good. I must admit that it is much better than my aging MacBook Air.
  • I like that Microsoft is offering a decent and free version of Visual Studio. On weaker machines, Visual Studio 2015 works really poorly, but on the XPS 13, it is fast enough.
  • I really like the work that the GitHub crew has been doing. GitHub Desktop for Windows is great. Throw in the Atom editor and you are good.
  • Though it is not nearly as easy as it should, you can generate Visual-Studio “solution files” with CMake, make it possible to build your macOS/Linux C/C++ code on Windows without too much fuss.

Neither positive nor negative:

  • Once booted up, the XPS 13 feels quite fast. It has a significantly better (and more recent) CPU than my MacBook Air… but I would have a hard time telling them apart.
  • For what I use them for, Cortana and Siri appears to be on par. Both of them can give me the weather. It is rather amazing that we can take for granted speech recognition in 2017.


  • Maybe I am crazy, or maybe my machine was hacked, but it feels like Microsoft is showing me ads?
  • On a MacBook, you just lift the lid and the machine awakes, instantly. On my XPS, I have to enter a passcode, wait several seconds for the machine to “boot up” and I can resume my work. It is annoying when you are used to better.
  • Windows 10 seems to suffer from random Wifi dropouts. I have had something like it on the new laptop: after rebooting, the Wifi might randomly fail to reconnect automatically. It is a mere nuisance, but I find it surprisingly amateurish. How hard can it be to connect to a Wifi network? Why do I have to manually do it?
  • Windows definitively feels like it needs to clean the house. I have a “settings” application along with a “control panel” application. Both appear to have closely related functions, but the layout is drastically different. Windows 10 feels like it was added to Windows 7.
  • I spend a lot of time copying and pasting. The ctrl-c and ctrl-v shortcuts are fine, but Dell put the left ctrl key to the extreme left of the keyboard, beyond the useless “Fn” key. I will probably have to remap the keys at some point, somehow.
  • I find it unacceptably hard to grab a Window and move it or to scroll down a web page. Thankfully, I have a touch screen and I can just put my finger on the window and slide it. It often works fine.
  • Though the Windows Subsystem for Linux works well, it is poorly integrated into Windows. You can’t (or shouldn’t) edit Linux files using your Windows text editor. The two systems run different file systems and there are (apparently) synchronization issues. This is probably fine.
  • The webcam is located at the bottom of the screen so that it will point to the underside of your nose during a video conference. It feels like it was designed by monkeys.
  • There is more plastic than I’d like on the Dell laptop, and I can already see some wear, but I suppose it is fine if you plan on replacing the machine regularly.
  • Soon after getting the machine, I had a bad case of screen flickering and tearing. I found out on posting boards and YouTube that it has been common for years with Dell XPS machines. For a time, I thought that my hardware was broken, but a common recommendation (turning off “Hyper-V”) solved the problem. It was simple enough, but how can it escape quality testing considering that the Dell support forums are filled with people complaining about screen flickering?
  • I can “hear” the machine read and write on disk. I did not know that SSDs could make noise!
  • There is “coil whining” (the computer makes high-pitch sounds). Most of the time (99%), it is unaudible (you have to put your hear to the machine to hear something), but it can randomly become quite audible whether the machine is under load or not.


I like the XPS 13 well enough as a secondary laptop. If I had to use it as a primary laptop, I’d invest in noise-cancelling headphones.

How fast can you count lines?

Inspired by earlier work by Llogiq, I decided to look at how fast I could count the number of lines in a string. By assuming that the string relies on ASCII, UTF-8 or other ASCII superset character encoding, the bulk of the work consists in counting the linefeed (‘\n’) characters.

Most programmers would proceed as follows…

cnt = 0;
 for(i = 0; i < size; i++)
      if(buffer[i] == '\n') cnt ++;
 return cnt;

It runs at about 0.66 cycles per input character on a recent x64 processor.

Your compiler achieves this speed by automatically vectorizing the problem.

Can we do better?

Let us try a vectorized approach using AVX intrinsics. The idea is simple. We use a vectorized counter made of 32 individual counters, initialized at zero. Within a loop, we load 32 bytes, we compare them (using a single instruction) with the linefeed character. We add the result to the running counter.

__m256i cnt = _mm256_setzero_si256(); // 32 counters
// ...
__m256i newdata = // load 32 bytes
__m256i cmp = _mm256_cmpeq_epi8(newline,newdata);
cnt = _mm256_add_epi8(cnt,cmp);

So per blocks of 32 input bytes, we use 3 instructions (one load, one comparison and one addition), plus some overhead. Could we do better? Maybe but I suspect it would be very difficult to use far fewer than 3 instructions per 32 input bytes on current processors.

With appropriate loop unrolling, this vectorized approach runs about ten times faster than the naive approach (0.04 cycles per input byte) or about 1.3 cycles per 32 input bytes. Because x64 processors have a hard time sustaining more than 4 instructions per cycle, we can estimate that our vectorized implementation is probably within a factor of two of optimality.

How does it compare with Llogiq’s best approach? I’d be interested in knowing, but his code is written in Rust, so this makes a direct comparison with my C code somewhat more difficult.

My source code is available.

Note: I have updated my code following a comment by Turbo.

Sorting sorted arrays in Swift

Sorting arrays quickly is a classical computer science problem. It is also a common task worth optimizing.

Sadly, there is no best approach, no silver bullet. Most people rely on whatever their programming language provides as a sorting function, but it is almost always suboptimal.

The new Swift language has made an interesting choice for a sorting function. They went with a variation on the textbook quicksort approach called introsort. Introsort is a popular choice, found in C++ standard libraries as well as in the Microsoft .Net platform. Introsort initially works like quicksort, partitioning the data log(n) times, but it then falls back on heapsort. This makes it a robust choice. That is, it is not possible to design a set of ever larger arrays so that the running time of the algorithm goes up quadratically. You have linearithmic complexity. So far, it would seem that the Swift designers made a safe choice… Except for one thing… A critical step in quicksort, and introsort is effectively just a safe variant of quicksort, is to pick a good “pivot”. Swift’s sort function uses the first value as the pivot.

That’s a somewhat odd choice. The state-of-the-art is to pick the median of the first, last and middle element as pivot (median-of-3). Indeed, the GNU ISO C++ standard library uses a median-of-3 pivot (as per the stl_algo.h header).

The performance of the sorting algorithm is reliant on the pivot splitting the array in two. That is, the pivot should be close to the median of the array.

But what if your array is already sorted? Then the first value of the array is the worst possible pivot!

We can test it out… calling the sort function on an array twice in Swift… the second call is almost assuredly going to take longer…

array.sort() // fast
array.sort() // really slow

Of course, that’s silly… why would you sort an array that’s already sorted? Well.

  • In some instances, developers do not want to take any chances. They get an array from some other part of the code, they expect the array to be sorted, but they can’t be entirely sure. To be safe, they will sort it again.
  • Or maybe the array was sorted but the code had to change a value or two in the array. In these instances, the developer might think that resorting the array ought to be sufficiently cheap.

So even if it sounds silly, actual code often sorts arrays that are nearly sorted. For this reason, Java uses Timsort, a sort function optimized for “almost sorted” data. Swift made the opposite choice: its sort function does poorly on “almost sorted” data.

I think it is useful to know that in Swift, sorting an already sorted array is probably the worse possible scenario. If it happens a lot in your code, then you might want to use a different sort function.

My source code is available.

Update: According to the comments, this was fixed in Swift’s source code and will probably be fixed in future releases.

Montreal researchers “prove” that aging is the result of a genetic program

Unless you are a tree, a lobster, or some other sea creature, you are probably aging… which is another way of saying that beyond adulthood, your fitness decreases while your mortality risk increases over time. Though some pretend that aging is a well-understood process… from a scientific point of view, it largely remains a mystery.

There is plenty of room for debate, but there are generally two different ways to understand aging.

Some believe that aging is simply the result of an evolutionary neglect or compromise. That is, given that animals never get very old in the wild, their genes are simply not geared toward keeping them healthy and fertile for a long time. This is problematic as a theory because slightly older individuals in the wild are affected by aging (a 30-year-old man is less fit than a 25-year-old man) sufficiently to cause their death or prevent them from reproducing. We also know that many of our ancestors did live quite old and chances that many died in part due to the fact that they were weaker due to age. Others believe that aging is the result of some kind of compromise… the body exhausts all its energy on reproduction, and is incapable, as a consequence, of remaining fit. This is also problematic and I have written an entire blog post to explain why this appealing theory does not fit the facts.

There is also another entirely different theory that says that aging serves an evolutionary purpose. It is a population control mechanism. Without aging, the old and strong individuals would dominate, leaving little room from younger people with newer genes… and this would create a less adaptable population more likely go through extinction events. This theory is also problematic because if, in a given population, some people remain fitter and more fertile, their genes would spread more easily and they would come to dominate the population.

Thus, we are left with multiple theories, all of them with apparent disqualifying faults.

Vladimir Titorenko, a biology professor at Concordia University in Montreal, has come up with an experiment that seems to confirm that aging is the result of an evolutionary program. For his demonstration, Titorenko used yeast. Though you may not think of yeast as an animal lifeform, it is a well-established model for our biology. We have more in common with yeast than you may expect. Importantly, yeast ages.

In any case, Titorenko put yeast into some chemical (lithocholic acid) that Titorenko describes as aging-delaying. This created mutants that live five times longer while growing and reproducing just as well as normal yeast if you keep them separate from normal yeast.

In Titorenko’s view, this constitutes a validation of the programmed theory of aging. Indeed, if you are able to reprogram biology to significantly increase lifespan without affecting fitness, then you are “proving” that evolution had the option of much longer lived individuals but somehow selected against them.

We provide evidence that the dominant polygenic trait extending longevity of each of these mutants 1) does not affect such key features of early-life fitness as the exponential growth rate, efficacy of post-exponential growth and fecundity; and 2) enhances such features of early-life fitness as susceptibility to chronic exogenous stresses, and the resistance to apoptotic and liponecrotic forms of programmed cell death. These findings validate evolutionary theories of programmed aging. We also demonstrate that under laboratory conditions that imitate the process of natural selection within an ecosystem, each of these long-lived mutant strains is forced out of the ecosystem by the parental wild-type strain exhibiting shorter lifespan. (Aging 2016)

It is still unclear how the longer-lived individuals lose out to the normal years… but they speculate that normal yeast has some trick to penalize longer-lived individuals. That should be expected: for the programmed theory of aging the work, you need some robust approach to do away with longer-lived individuals (i.e., “cheaters”). In multicellular organisms like human beings, you can count on several builtin mechanisms to limit lifespan… but in something as simple as yeast, you probably need to count on the community because it is hard to build enough safeguards without a single cell.

Interesting times. I am not convinced that Titorenko proved the programmed theory of aging, but I am looking forward to reading his future work.

Further reading : Mitteldorf’s Cracking the Aging Code, Fossel’s The Telomerase Revolution, de Grey’s Ending Aging, Farrelly’s Biologically Modified Justice and Wood’s The Abolition of Aging.

Maps and sets can have quadratic-time performance

Swift is a new programming language launched by Apple slightly over two years ago. Like C and C++, it offers ahead-of-time compilation to native code but with many new modern features. It is available on Linux and macOS.

Like C++, Swift comes complete with its own data structures like dictionaries (key-value or associative maps) and sets. It seems that Apple is simply reusing its Objective-C data structures.

How do you implement dictionaries and sets? We typically use sets. The general idea is simple:

  • We map values to “random” integers, a process called hashing. We call the random integer a “hash value”. The key fact is that whenever we have identical values, they should generate the same hash value. This is usually achieved using a funny-looking function that is carefully crafted to generate numbers that look random.
  • We store the values at the location indicated by the hash value.

This latter step is critical. At high level you can just build a large array and store the values with hash value x at index x. This does not quite work because different values might share the same hash value. You get “collisions”.

How do we resolve this problem? There are two popular techniques. One of them is called chaining. In effect, at each location in the big array, we have a pointer to a container where the values are actually stored. This is called chaining. Many popular languages use chaining (Java, Go…). The theory of chaining is simple and the implementation is not difficult.

In Swift, they use linear probing. Linear probing is also simple to implement: if you try to store a value at an occupied slot, you just move to the next available slot. When you seek a value, you start from the index indicated by the hash value, and move forward until you either find your value or an empty slot.

Suppose that you want to insert a new element, and you get a collision. With chaining, this means that you need to append a value to a container (typically an array). Before you do so, you need to check whether your value is already present in the container… This means that if you repeatedly insert to the same container, you will have quadratic-time complexity. That is, it gets very slow very fast. You have to be quite unlucky for this to happen.

Things get somewhat worse with linear probing. In linear probing, not only can you collide with values that have the same hash value, but you also tend to collide with other values to have nearby hash values! This means that the performance of linear probing can be quite a bit worse than the performance of chaining, in the worst case.

In several languages, such as PHP and JavaScript, maps preserve the “insertion order” which means that when iterating through the values stored, you do so in the order in which they were inserted. So if you first inserted a key having to do with “John” and then one having to do with “Mary”, the map will remember and default in this order. However, Swift does not appear to keep track of the insertion order by default. This creates a performance trap.

Let me illustrate it with code. Suppose that I have a set that stores all values between 1 and some large number. My point does not rely on the data taking this exact form, but it is the simplest example I could think of… (It does not matter that I am using integers!)

var d = Set<Int>()
for i in 1...size {

Next, maybe I have another set with a few values in it…

var dcopy = Set<Int>()

… and then I want to dump all of the big set into the small set like so…

for i in d {

Ah! You have fallen into a performance trap!

You think: “Surely not! I was taught that insertion in a hash table takes constant time…” To this, I object that you were taught no such thing. Or, rather, that you failed to read the small prints!

Basically, in both instances, I am building a hash table starting from nothing. The only difference is the order in which I build the hash table. It turns out that the order in which you insert elements matters.

Let us look at the number. Let us start with a small set made of a million elements… I time the results on my laptop…

  • Build time: 130 ns/element
  • Reinsertion: 670 ns/element

Ok. So reinserting the data from the big set to the small set takes 5 times longer. “No big deal”, you think, “I have CPU cycles to spare.”

Let us grow the set a bit more, to 16 million elements:

  • Build time: 180 ns/element
  • Reinsertion: 3400 ns/element

You see where I am going with this? Roughly speaking, the reinsertion time has quadratic complexity where the build time has linear complexity.

We can run a simulation and count the number of collisions generated on average using either a random insertion, or an adversarial insertion as illustrated by my code example. When we do so, we observe that the number of collisions grows very fast in the adversarial setting.

So the performance issue can be explained entirely due to the fast rising number of collisions with adversarial insertions on larger and larger sets.

Where does the problem come from? When we resize the hash table, we have to reinsert the values into a larger array. Normally, the values would spread out nicely, but our adversarial insertion order is not at all random, and it facilitates expensive collisions.

(As Tim points out in the comments, you can get around this problem by creating Sets having a large capacity. Or you can use a different data structure.)

My source code is available.

Further reading: Rust hash iteration+reinsertion (same problem as I describe, but in Rust)

How expensive are the union and intersection of two unordered_set in C++?

If you are using a modern C++ (C++11 or better), you have access to set data structures (unordered_set) which have the characteristics of a hash set. The standard does not provide us with built-in functions to compute the union and the intersection of such sets, but we can make our own. For example, the union of two sets A and B can be computed as follow:

out.insert(A.begin(), A.end());
out.insert(B.begin(), B.end());

where out is an initially empty set. Because insertion in a set has expected constant-time performance, the computational complexity of this operation is O(size(A) + size(B)) which is optimal.

If you are a computer scientist who does not care about real-world performance, your work is done and you are happy.

But what if you want to build fast software for the real world? How fast are these C++ sets?

I decided to populate two sets with one million integers each, and compute how how many cycles it takes to compute the intersection and the union, and then I divide by 2,000,000 to get the time spent per input element.

intersection (unordered_set)100 cycles/element
union (unordered_set)250 cycles/element

How good or bad is this? Well, we can also take these integers and put them in sorted arrays. Then we can invoke the set_intersection and set_union methods that STL offers.

set_intersection (std::vector)3 cycles/element
set_union (std::vector)5 cycles/element

That’s an order of magnitude better!

So while convenient, C++’s unordered_set can also suffer from a significant performance overhead.

What about std::set which has the performance characteristics of a tree? Let us use code as follows where out is an initially empty set.

std::set_union(A.begin(), A.end(), B.begin(), B.end(),
set_intersection (std::set)150 cycles/element
set_union (std::set)750 cycles/element

As we can see, results are considerably worse.

The lesson is that a simple data structure like that of std::vector can serve us well.

My source code is available.

Deep learning: the silver bullet?

In 2016, we saw a wide range of breakthroughs having to do with artificial intelligence and deep learning in particular. Google, Facebook, and Baidu announced several breakthroughs using deep learning. Google also defeated Go.

Deep learning is one specific class of machine learning algorithms. It has a long history, taking its roots in the earlier days of computer science. However, not all of machine learning is deep learning.

We are in 2017, it is only January, and the breakthroughs in artificial intelligence keep on being announced. These days, we hear about how the best human poker players are being defeated.

In particular, a system from Carnegie Mellon University called Libratus seems to have a lot of successes.

Details are scarce regarding Libratus. What caught my attention was the mention that it used “counterfactual regret minimization”, a conventional machine learning that is not a form of deep learning.

Given all of the hype going to deep learning, I find it almost surprising… are there really still artificial intelligence researchers working on techniques other than deep learning? (I’m being half serious.)

Last year, I participated to a panel on algorithms and their impact on Canadian culture. The organizers retroactively renamed the panel “Are we promoting Canadian content through deep-learning algorithms?” Yet the panel did not address deep learning per se. The successes of deep learning have been so remarkable that “deep learning” has become synonymous with “algorithm”.

I was recently on a graduate scholarship committee… and it seems that every smart young computer scientist is planning to work on deep learning. Maybe I exaggerate a little, but barely. I have seen proposals to apply deep learning to everything, from recognizing cancer cells all the way to tutoring kids.

A similar process is under way in business. If you are a start-up in artificial intelligence, and you are not focused on deep learning, you have to explain yourself.

Of course, machine learning is a vast field with many classes of techniques. However, one almost gets the impression that all of the major problems are going to be solved using deep learning. In fact, some proponents of deep learning have almost made this explicit claim… they often grant that other techniques may work well… on the small problems… but they often stress that for the large problems, deep learning is bound to win out.

We will keep on seeing very hard problems being defeated using various techniques, often unrelated to deep learning. If I am right, this means that these young computer scientists and start-up founders who flock to deep learning should be cautious. They may end up in an overcrowded field, missing out on important breakthroughs happening elsewhere.

It is hard to predict the future. Maybe deep learning is, indeed, the silver bullet and we will soon “solve intelligence” using deep learning… all problems will fall using variations on deep learning… Or it could be that researchers will soon hit diminishing returns. They will need to work harder and harder for ever smaller gains.

There are significant limitations to deep learning. For example, when I reviewed scholarship applications… many of the young computer scientists aiming to solve hard problems with deep learning did not have correspondingly massive data sources. Having an almost infinite supply of data is a luxury few can afford.

I believe that one unstated assumption is that there must be a universal algorithm. There must exist some technique that makes software intelligent in a general way. That is, we can “solve intelligence”… we can build software in a generic way to solve all other problems.

I am skeptical of the notion of general intelligence. Kevin Kelly, in his latest book, suggests that there is no such thing. All intelligence is specialized. Our intelligence “feels” general to us, but that’s an illusion. We think we are good at solving most problems but we are not.

For example, the human brain is showing its limits with advanced mathematics. Given a lot of training and dedication, some of us can write formal proofs without error. However, we are highly inefficient at it. I predict that it is a matter of decade or two before the unassisted human brain is recognized as being obsolete for research in advanced mathematics. The old guy doing mathematics on a blackboard? That’s going to look quaint.

So our brain is no silver bullet with respect to intelligence and thus I don’t believe that any one machine-learning technique can be a silver bullet.

I should qualify my belief. There might be one overarching “algorithm”. Matt Ridley points out in his latest book that everything evolves by a process akin to natural selection. Nature acquires an ever growing bag of tricks that are being constantly refined. In effect, there is an overarching process of trial and error. This is truly general, but with a major trade-off: it is expensive. Our biology evolved but it took all of the Earth ecosystem millions of years to produce homo sapiens. Our technology evolves, but it takes all of the power of human civilization to keep it going. It is also not fool-proof. There are regular extinction events.

Credit: Thanks to Martin Brooks for inspiring this blog post.

Further reading: DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker. Simon Funk’s Welcome to AI. Performance Trends in AI by Sarah.

Resizing arrays can be slow in Swift

Swift a recent high-performance programming language. It is still primarily used develop iOS applications, but it has the potential to be a general-purpose language.

The lack of maturity of the language is still apparent. For example, it is almost trivially easy to create source code that the Swift compiler won’t be able to compile in a reasonable time.

Another pain point I have encountered is the performance of arrays. Arrays are important in programming and you want them to be fast.

Initializing an array is efficient in Swift. The following code…

var z = Array(repeating: 0, count: 1001)
z[0] = 10

… uses about 1.5 cycles per entry.

But arrays in Swift are dynamic: you can grow them. And it is a lot less clear whether Swift is efficient in this case.

Let me illustrate. Suppose that you have an array containing the value 10 and then, afterward, you want to extend it with lots of zeros. This should be nearly as fast as initializing the array in one go…

Irrespective of the programming language, given a dynamic array, I would code it like so…

var z = [Int]() // z in a dynamic array (empty)
 for i in 0..<10000 { 

On a recent Intel processor, this uses up 10 cycles per added value. That’s a lot of cycles just to create an array filled with zeros!

Swift arrays have the reserveCapacity method that we can put to good use to avoid multiple reallocations…

var z = [Int]()
z.reserveCapacity(1000 + 1)
for i in 0..<1000 {

We are now down to 5 cycles per entry… better but still disappointing…

We can get the desired performance by ignoring the fact that Swift has dynamic arrays and just creating a new array instead…

func extendArray(_ x : inout [Int], size: Int) 
         -> [Int] {
   var answer = Array(repeating: 0, count: size)
   for i in 0..<x.count {
     answer[i] = x[i]
   x = answer

I consider these results, if they bear out, as some kind of performance bug. Any dynamic array implementation worth its salt should let you resize it as fast as you create a new array and copy the old result. If you can’t achieve that much… make it clear in the API that your structure is not meant to be regularly resized.

My Swift source code is available.

Update: Joe Groff suggest resizing arrays in this manner: x = Array((0.. but my tests suggests that it is not the fastest way.

How quickly can you remove spaces from a string?

Sometimes programmers want to prune out characters from a string of characters. For example, maybe you want to remove all line-ending characters from a piece of text.

Let me consider the problem where I want to remove all spaces (‘ ‘) and linefeed characters (‘\n’ and ‘\r’).

How would you do it efficiently?

size_t despace(char * bytes, size_t howmany) {
  size_t pos = 0;
  for(size_t i = 0; i < howmany; i++) {
      char c = bytes[i];
      if (c == '\r' || c == '\n' || c == ' ') {
      bytes[pos++] = c;
  return pos;

This code will work on all UTF-8 encoded strings… which is the bulk of the strings found on the Internet if you consider that UTF-8 is a superset of ASCII.

That’s simple and should be fast… I had a blast looking at how various compilers process this code. It ends up being a handful of instructions per processed byte.

But we are processing bytes one by one while our processors have a 64-bit architecture. Can we process the data by units of 64-bit words?

There is a somewhat mysterious bit-twiddling expression that returns true whenever your word contains a zero byte:

(((v)-UINT64_C(0x0101010101010101)) & ~(v)&UINT64_C(0x8080808080808080))

All we need to know is that it works. With this tool, we can write a faster function…

uint64_t mask1 = ~UINT64_C(0) / 255 * (uint64_t)('\r');
uint64_t mask2 = ~UINT64_C(0) / 255 * (uint64_t)('\n');
uint64_t mask3 = ~UINT64_C(0) / 255 * (uint64_t)(' ');

for (; i + 7 < howmany; i += 8) {
    memcpy(&word, bytes + i, sizeof(word));
    uint64_t xor1 = word ^ mask1;
    uint64_t xor2 = word ^ mask2;
    uint64_t xor3 = word ^ mask3;

    if (haszero(xor1) || haszero(xor2) || haszero(xor3)) {
      // check each of the eight bytes by hand?
    } else {
      memmove(bytes + pos, bytes + i, sizeof(word));
      pos += 8;

It is going to be faster as long as most blocks of eight characters do not contain any white space. When this occurs, we are basically copying 64-bit words one after the other, along with a moderately expensive check that our superscalar processors can do quickly.

As it turns out, we can better without using 64-bit words.

// table with zeros at indexes corresponding  to white spaces
int jump_table[256] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
  0, 1, 1, 0, 1,  ...};

size_t faster_despace( char* bytes, size_t howmany ) {
  size_t i = 0, pos = 0;
  while( i < howmany )
    bytes[pos] = bytes[i++];
    pos += jump_table[ (unsigned char)bytes[pos] ];
  return pos;

This approach proposed by Robin Leffmann is very clever, and it is fast because it avoids penalties due to mispredicted branches.

Can we do even better? Sure! Ever since the Pentium 4 (in 2001), we have had 128-bit (SIMD) instructions.

Let us solve the same problem with these nifty 128-bit SSE instructions, using the (ugly?) intel intrinsics…

__m128i spaces = _mm_set1_epi8(' ');
__m128i newline = _mm_set1_epi8('\n');
__m128i carriage = _mm_set1_epi8('\r');
size_t i = 0;
for (; i + 15 < howmany; i += 16) {
    __m128i x = _mm_loadu_si128((const __m128i *)(bytes + i));
    __m128i xspaces = _mm_cmpeq_epi8(x, spaces);
    __m128i xnewline = _mm_cmpeq_epi8(x, newline);
    __m128i xcarriage = _mm_cmpeq_epi8(x, carriage);
    __m128i anywhite = _mm_or_si128(_mm_or_si128(xspaces, 
                xnewline), xcarriage);
    int mask16 = _mm_movemask_epi8(anywhite);
    x = _mm_shuffle_epi8(
          x, _mm_loadu_si128(despace_mask16 + mask16));
    _mm_storeu_si128((__m128i *)(bytes + pos), x);
    pos += 16 - _mm_popcnt_u32(mask16);

The code is fairly straight-forward if you are familiar with SIMD instructions on Intel processors. I have made no effort to optimize it… so it is possible, even likely, that we could make it run faster. My original SIMD code had a branch but Nathan Kurz realized that it was best to simplify the code and remove it.

Let us see how fast it runs!

I designed a benchmark using a recent (Skylake) Intel processor over text entries where only a few characters are white space.

regular code5.5 cycles / byte
using 64-bit words 2.56 cycles/byte
with jump table 1.7 cycles/byte
SIMD (128 bits) code0.39 cycles / byte
memcpy0.08 cycles / byte

So the vectorized code is nearly 14 times faster than the regular code. That’s pretty good.

Yet pruning a few spaces is 5 times slower than copying the data with memcpy. So maybe we can go even faster. How fast could we be?

One hint: Our Intel processors can actually process 256-bit registers (with AVX/AVX2 instructions), so it is possible we could go twice as fast. Sadly, 256-bit SIMD instructions on x64 processors work on two 128-bit independent lanes which make the algorithmic design more painful.

Leffmann’s approach is not as fast as SIMD instructions, but it is more general and portable… and it is still three times faster than the regular code!

My C code is available.

Best programming language for high performance (January 2017)?

I keep hoping that the field of programming language will evolve. I am a bit tired to program in Java and C… I’d like better languages.

I am particularly interested in what I generally call “high-performance programming”. I want to pick languages where I can get the most out of my hardware. It is fine to program in JavaScript and Python, but there comes a time where you tackle large problems on hardware that is barely sufficient.

  • For my purposes, I am going to narrow it down to popular “system programming languages” which means Swift, Rust, Go, C++, C. These are languages that well-suited to build more than applications, but also servers, database engines and so forth. They all offer ahead-of-time compilation, which makes the performance easier to reason about.

    I think there will be no disagreement that C and C++ and system programming languages. Whether Go and Swift qualify is somewhat subjective. Would you write an operating system in Go? I would not. But Wikipedia seems to be happy to consider them all as system programming languages.

  • I am going to ignore assembly because I can’t imagine using it all the time. It is simply not practical unless you are working on a tiny code base (maybe on embedded software).
  • I am going to ignore Rust, D and Nim as they have too few users. Programming is a social activity: if nobody can use your code, it does not matter that it runs fast. I think that they are very interesting languages but until many more people start coding in these languages, I am going to ignore them. Fortran is out for similar reasons: it still has important users, but they are specialized.
  • I am not including non-system languages such as JavaScript, PHP, Matlab, Ruby, R, Python because though they can be fast, they are simply not generally well suited for high performance. They are typically “fast enough” for specific purposes, but they also have hard limitations. For example, JavaScript and Python appear unable to support natively multithreaded execution. (Spawning new processes does not count.)
  • Java and C# are interesting cases. They have been both described and used as “system programming languages”. I think that the evidence is overwhelming that Java can be used to build significant systems that perform very well. However, I think that running in a virtual machine (JVM, CLR) is a hard limitation: you can never be “bare metal”. I lump Scala with Java.
  • There is a countless number of niche languages that I have not mentioned. Most of them will never pick up any amount of users.

Let me run through a few desirable features for high-performance programming along with the current languages that fare well.

  • Operating systems are built in C or C++. Moreover, many important libraries have C interfaces. This means that being able to call C functions with minimal overhead is important in some instance. While Go allows you to call C functions, there is considerable overhead in terms of performance. All other programming languages can call C with minimal overhead.
    • Obvious winners: C, C++
    • Good players: Swift
    • Loser: Go
  • Though this is controversial, I think that a language with manual memory management may have the upper hand. In modern systems, managing memory is often a major overhead. C and C++ have manual memory management. Swift has reference counting which is a form of automated memory management, but with the advantage of being purely deterministic: when your data goes out of scope, the memory is reclaimed. However, reference counting does have an overhead, especially in a multithreaded context. Go has generational garbage collection (like Java or C#) which means that “it stops the world” at some point to reclaim memory, but the Go engineers have optimized for short pauses. That’s a good thing too because “stopping the world” is not desirable. I don’t know which approach I prefer: Swift or Go.
    • Winners: C, C++
    • Losers: Go, Swift
  • Our computers have many cores. Parallel programming is important. We would like our languages to support elegantly multicore programming. I think that Go has clearly the upper hand. Of course, for example, it is possible to do everything Go does from C as far as multicore programming, but not as easily. As far as I can tell, Swift has one standard way to use multiple cores: Grand Central Dispatch. It also has Operation/OperationQueue for higher level multi-threading.  It might be fine, but it does not look on par with what I find in other languages.
    • Special mention: Go
    • Winners : C++, C
    • Loser: Swift
  • For high performance, a good programming language should have “easy to reason about performance”. This means that when looking at a piece of code, you should have a good appreciation for how fast it will run. Honestly, this has gotten very hard over time, forcing us to rely more and more on benchmarking. However, simpler languages like C and Go makes it easier to reason about performance. Simply put, the more expansive the language, the harder it gets to compile it in your head.
    • Winners: C, Go
    • Losers: Swift, C++
  • Our processors have fancy new instructions, like SIMD instructions (AVX2, Neon…). These should be readily available from the programming language. No programming language is easy to program with something like SIMD instructions, but C and C++ clearly come on top. In Go, you can write functions in assembly and call them from the language, but it comes with a discouraging performance overhead. Swift is barely better (there is a simd package, but it only works under macOS).
    • Winners: C, C++
    • Losers: Go, Swift
  • It has been decades and there is still no universal way to manage dependencies in C. There is also no standard build mechanism. Sure, your favorite IDE can make it easier to add a dependency and build your program, but having something built into the language makes life much easier. In Go and Swift, there is a standard, cross-platform way to build programs, run tests and add dependencies.
    • Winners: Go, Swift (special mention for Go)
    • Losers: C, C++

Let me summarize my analysis with a table:

C C++ GoSwift
Call C Yes Yes SlowFair
Bare metal memory Yes Yes No No
Cores Yes Yes Great Maybe (GCD)
Simple syntax Yes No Yes No
Safety Minimal Minimal Yes Yes
SIMD Yes Yes Slow Maybe (macOS only)
Tools None None GreatGood

If I were hired to build high-performance programming, I’d be happy to be “forced” into any one of these languages. They are all fine choices. They are all improving at a fast pace. I think that Go, for example, has improved tremendously since I first learned it.