Don’t make it appear like you are reading your own recent writes

Richard Statin recently published a Java benchmark where the performance of a loop varies drastically depending on the size of the arrays involved. The loop is simple:

for (int i = 0; i < a.length; ++i) {
  a[i] += s * b[i];
}

If the array size is 1000, the performance is lower than if the array size is 1024. The weird result only occurs if AVX2 autovectorization is enabled, that is, only if Java uses vector instructions.

It is hard to reason from a Java benchmark. So many things could be going on! What is the address of the arrays? I am sure you can find out, but it is ridiculously hard.

Let me rewrite the loop in C using AVX2 intrinsics:

void vecdaxpy(double * a, double * b, double s, size_t len) {
    const __m256d vs = _mm256_set1_pd(s);
    for (size_t i = 0; i + 4 <= len ; i += 4) {
      const __m256d va = _mm256_loadu_pd(a + i);
      const __m256d vb = _mm256_loadu_pd(b + i);
      const __m256d mults = _mm256_mul_pd(vb,vs);
      const __m256d vavb = _mm256_add_pd(va,mults);
      _mm256_storeu_pd(a + i,vavb);
    }
}

In Java, you have to work hard to even know that it managed to vectorize the loop the way you expect. With my C intrinsics, I have a pretty good idea of what the compiler will produce.

I can allocate one large memory block and fit two arrays of size N, for various values of N, in a continuous manner. It seems that this is what Richard expected Java to do, but we cannot easily know how and where Java allocates its memory.

I can then report how many cycles are used per pair of elements (I divide by N):

$ gcc -O3 -o align align.c -mavx2  && ./align
N = 1000
vecdaxpy(a,b,s,N)   	:  0.530 cycles per operation
N = 1004
vecdaxpy(a,b,s,N)   	:  0.530 cycles per operation
N = 1008
vecdaxpy(a,b,s,N)   	:  0.530 cycles per operation
N = 1012
vecdaxpy(a,b,s,N)   	:  0.528 cycles per operation
N = 1016
vecdaxpy(a,b,s,N)   	:  0.530 cycles per operation
N = 1020
vecdaxpy(a,b,s,N)   	:  0.529 cycles per operation
N = 1024
vecdaxpy(a,b,s,N)   	:  0.525 cycles per operation

So the speed is constant with respect to N (within an error margin of 1%).

There are four doubles in each 256-bit registers, so I use about 2 cycles to process a pair of 256-bit registers. That sounds about right. I need to load two registers, do a multiplication, an addition, and a store. It is not possible to do two loads and a store in one cycle, so 2 cycles seem close to the best one can do.

I could flush the arrays from cache, and things get a bit slower (over four times slower), but the speed is still constant with respect to N.

Whatever hardware issue you think you have encountered, you ought to be able to reproduce it with other (simpler) programming languages. Anything hardware related should be reproducible with several programming languages. Why reason about performance from Java alone, unless it is a Java-specific issue? If you cannot reproduce it with another programming language, how can you be sure that you have the right model?

Still, Richard’s result is real. If I use arrays of size just under a multiple of 4kB, and I offset them just so that they are not 32-byte aligned (the size of a vector register), I get a 50% performance penalty.

Intel CPU discriminates between memory addresses based on their least significant bits (e.g., the least significant 12 bits). A worst-case scenario is one where you read memory at an address that looks (as far as the least significant bits are concerned) like an address that was very recently written to.

The Intel documentation calls this 4K Aliasing:

4-KByte memory aliasing occurs when the code stores to one memory location and shortly after that it loads from a different memory location with a 4-KByte offset between them. The load and store have the same value for bits 5 – 11 of their addresses and the accessed byte offsets should have partial or complete overlap. 4K aliasing may have a five-cycle penalty on the load latency. This penalty may be significant when 4K aliasing happens repeatedly and the loads are on the critical path. If the load spans two cache lines it might be delayed until the conflicting store is committed to the cache. Therefore 4K aliasing that happens on repeated unaligned Intel AVX loads incurs a higher performance penalty.

You can minimize the risk of trouble by aligning your data on 32-byte boundaries. It is likely that Java does not align arrays on cache lines or even on 32-byte boundaries. Touching more than one 32-byte region increases the likelihood of aliasing, and I have not been able to produce aliasing with 32-byte aligned arrays (though it might be possible).

The problem encountered by Richard is different from an old-school data alignment issue where loads are slower because the memory address is not quite right. Loading and storing vector registers quickly does not require alignment. The problem we have in Richard’s example is that we are storing values, and then it looks (from the processor’s point of view) like we might be loading it again quickly… this confusion incurs a penalty.

What would happen if Richard flipped a and b in his code?

for (int i = 0; i < a.length; ++i) {
  b[i] += s * a[i];
}

Because we write to array b in this example (instead of array a), then I suspect he would get that the worst case is having arrays of size just slightly over 4kB.

He could also try to iterate over the data in reverse order but this could confuse the compiler and prevent autovectorization.

I also suspect that this type of aliasing is only going to get worse as vector registers get larger (e.g., as we move to AVX-512). In the C version of Roaring bitmaps, we require 32-byte alignment when allocating bitsets. We might consider 64-byte alignment in the future.

My source code is available.

(To make sure you results are stable, avoid benchmarking on a laptop. Use a desktop configured for testing with a flat CPU frequency.)

Year 2017: technological highlights

  1. DeepStack and Libratus become the first computer programs to beat professional poker players.
  2. We are using synthetic cartilage to help people with arthritis.
  3. Stem cells can regrow a whole new tooth.
  4. We have made significant progress in rejuvenating old tissues and organs with senolytics. Senescent cells are believed to be one of the major cause of aging, and we approaching the day when we can clear them from our bodies. Clearing senescent cells is shown to alleviate osteoporosis in mice.
  5. Google has computing pods capable of 11.5 petaflops. It is a lot. By some estimates, it should be sufficient to simulate the human brain.
  6. Intel releases the Core i9 X-series which can produce one teraflop of computing performance for about $2000.
  7. Apple now offers sophisticated augmented reality applications on iPhones and iPads.
  8. The German company DeepL achieves a breakthrough in machine translation. It is reported to be as good as amateur human translators at many tasks.
  9. For 350$, you can buy a 400 GB memory card on Amazon. It is enough to record everything you hear during four years. It fits on the tip of a finger.
  10. Supplementation with osteocalcin, a natural hormone, rejuvenates memory in mice in addition to rejuvenating muscles.
  11. A robot did better than 80% of human students in a Japanese university entrance tests.
  12. Google’s Pixel Buds are headphones that can translate in real time any one of 40 different languages.
  13. A team from Google (Alphabet/DeepMind) has created a computer system (AlphaZero) that can learn games like Go and Chess in a few hours based only the rules.
  14. A woman with a transplanted uterus gives birth.

Multicore versus SIMD instructions: the “fasta” case study

Setting aside graphics processors, most commodity processors support at least two distinct parallel execution models.

  1. Most programmers are familiar with the multicore model whereas we split the computation into distinct parts that get executed on distinct cores using threads or goroutine.

    Speeding up your program by using many cores is not exactly free, however. The cores you are using are not available to the other programs on your machine. Thus you may be able to complete the processing faster, but it is unclear if, all things said, you are going to use less energy. Moreover, on a busy server, multicore processing could offer little gain.

    More critically, as a rule, you should not expect your program to go N times faster even if you have N cores. Even when your problems are conveniently embarrassingly parallel, it can be difficult to scale in practice with the number of cores you have.

    It is not uncommon for threaded programs to run slower.

  2. There is another parallelization model: vectorization. Vectorization relies on Single instruction, multiple data (SIMD) instructions. It is relatively easy for hardware engineers to design SIMD instructions, so most processors have them.

    Sadly, unlike multicore programming, there is very little support within programming languages for vectorization. For the most part, you are expected to hope that your compiler will automatically vectorize your code. It works amazingly well, but compilers simply do not reengineer your algorithms… which is often what is needed to benefit fully from vectorization.

My belief is that the performance benefits of vectorization are underrated.

As an experiment, I decided to look at the fasta benchmark from the Computer Language Benchmarks Game. It is a relatively simple “random generation” benchmark. As is typical, you get the best performance with C programs. I ran the current best-performing programs for single-threaded (one core) and multithreaded (multicore) execution. The multicore approach shaves about 30% of the running time, but it uses the equivalent of two cores fully:

elapsed timetotal time (all CPUs)
single-threaded1.36 s1.36 s
multicore1.00 s2.00 s

Is the multicore approach “better”? It is debatable. There are many more lines of code, and the gain is not large.

The numbers one the Computer Language Benchmarks Game page differs from my numbers but the conclusion is the same: the multicore version is faster (up to twice as fast) but it uses twice the total running time (accounting for all CPUs).

I decided to reengineer the single-threaded approach so that it would be vectorized. Actually, I only vectorized a small fraction of the code (what amounts to maybe 20 lines in the original code).

elapsed timetotal time (all CPUs)
vectorized0.31 s0.31 s

I am about four times faster, and I am only using one core.

My version is not strictly equivalent to the original, so extra work would be needed to qualify as a drop-in replacement, but I claim that it could be done without performance degradation.

I make my code available so that you can easily reproduce my results and correct as needed. Your results will vary depending on your compiler, hardware, system, and so forth.

Science and Technology links (December 29th, 2017)

  1. It is often believed that, in the developed world, more land is used to host human beings as time goes by, reducing space for forest and wildlife. That is not true:

    Forests are spreading in almost all Western countries, with the fastest growth in places that historically had rather few trees. In 1990 28% of Spain was forested; now the proportion is 37%. In both Greece and Italy, the growth was from 26% to 32% over the same period. Forests are gradually taking more land in America and Australia. Perhaps most astonishing is the trend in Ireland. Roughly 1% of that country was forested when it became independent in 1922. Now forests cover 11% of the land, and the government wants to push the proportion to 18% by the 2040s.

  2. Having older brothers makes you more likely to be gay (assuming you are man). The evidence appears to be overwhelming. I have not yet read an explanation for why this would be true.
  3. Envy served us well in the past when someone taking more food that needed could kill the rest of the tribe. Today, being envious of others is a disease:

    No evidence is found for the idea that envy acts as a useful motivator. Greater envy is associated with slower — not higher — growth of psychological well-being in the future. Nor is envy a predictor of later economic success.

    If you find yourself being envious of people having more success than you do, seek help… try to calm your envy. It is not good for you.

  4. Vitamin K is an anticalcification, anticancer, bone-forming and insulin-sensitising molecule. It is not uncommon to be vitamin-K deficient.
  5. Due to parabiosis, we know that there are “aging factors” in the blood of old people that reduce their fitness. As we age, we tend to have more Eotaxin-1 and it gets passed in blood plasma. Eotaxin-1 is associated with poorer mental abilities. What would happen if we removed Eotaxin-1 from the blood of older people? We do not know.

Personal reflections on 2017

Year 2017 is coming to an end. As a kid, I would have surely imagined that by 2017, I would be living in space. But then I can chat with Amazon’s servers through the “Alexa” persona. It is not space, but it is cool.

It is the first year I used a tablet (an iPad) for actual work. It is not a kind of laptop/tablet hybrid, but rather a genuine post-PC tool. It worked though it was also infuriating at times. I plan to write more about in 2018. We are not yet in the post-PC era, sadly.

Tech companies like Facebook and Google are believed to have reached very high levels of political powers. The web is a power law, with most of the traffic being captured by a handful of large corporations. The dream of the web as a democratic place where all of us could be both builders and consumers, with no middleperson, seems to wane. Facebook and Google decide what you watch and though they make good decisions on the whole, it does seem like we are moving back to the TV era in some sense… with a few big players holding most of the power. It sometimes feels like independent bloggers like myself are more irrelevant than ever before… and, yet, every number I look at suggests that, on average, my posts get more readers than ever before. Is my blog still relevant? I think it is important to ask such questions. As an experiment, I added a PayPal link where people can offer donations to help support the blog. I don’t want to talk too much about money, but there are more people than I thought that gave more money than I would have ever guessed. These people told me in a very convincing manner that this blog is valuable to them. Thus I have mixed feelings about the state of the web. There has been no apocalypse wiping me out.

Lots of people have told me about how bad they feel about politics, and about how we seem to have lost some of our freedom of speech… but I get the impression that much of the loss is self-inflicted. Let us rejoice first on how all these powerful men fell from grace because of their sexually abusive ways. Surely, something right happened if this could occur.

Videoconferencing is free and everywhere but it is bad. In fact, in 2017, I have often insisted on in-person meetings. We need to be able to meeting in person through our computers in a way that makes sense. Watching a video rendition of the other person is not great. As a testimony to my engagement toward live meetings, I have personally organized lunch meetings in Montreal (open to all, in French). One of the best move I made all year. I have also decided to mostly abandon videoconferencing with my grad students… until the technology gets better.

I have become department chair. Mostly it has to do with making people work together better. It helps being a stoic (philosophically speaking) when doing such jobs. That is, you set aside how you feel about the problems, you ignore what you can’t change, and you focus what little you can do to find solutions.

This year, our Roaring bitmap indexes kept become more and more widely used. It was very rewarding. It was adopted by Microsoft (for VSTS). Pilosa is a new and already very popular distributed index system that’s based on Roaring (written in Go). At this point, the various libraries are built and developed by a web of super clever engineers. My role is mostly limited in complaining about how difficult it is to build and release Java code using tools like Maven. I am not sure what lesson to draw of the success that Roaring bitmaps received. Obviously, something was done right. I hope to be able to draw the right lessons. A few things come to mind:

  • Software is not primarily about code and bits, it is primarily about people. People matter. A great deal. For example, in the initial stages of working on Roaring, we got in touch with the Druid folks (Metamarkets) and it led to a paper. They provided real-world feedback and validation that made a huge difference. The optimized C implementation was the result of an array of collaborations that lead us to a nice experience-report paper. It would have been impossible to pull something like this in a closed campus laboratory. No matter how smart you are, there are more clever ideas out there than you know.
  • Simple designs coupled with lots of strong engineering can make for a potent if underestimated mix. People always seek the “big idea”… but big ideas are overrated. Great solutions are often based on simple ideas that are executed with care.

In 2017, I acquired my first smartphone (an iPhone 7). Yes. It is crazy that I did not own one before, but there is a purpose to my madness. I am deliberately picking with care what enters my life. For example, I did away with television fifteen years ago, something people often don’t take seriously. Not having a smartphone allowed me to a spectator to the smartphone phenomenon. Yet I knew that I would have to cave it one day, and I decided that this year, I needed that phone. It came about when I set up a meeting Montreal with Stephen Canon. I went to meet Stephen, but the restaurant I suggested was closed. Reasonably enough, Stephen messaged me, but I had relied on my wife’s smartphone… and though I was with my wife, my wife wasn’t carrying on smartphone… Long story short, I never got to meet Stephen. So now I own a smartphone. Socially speaking, not owning a smartphone has become a serious handicap. (I am aware that this is not a new insight.)

In technology, this year was the year of deep learning. We know it works because it is being used by companies with lots of data (Google, Apple, etc.) to solve hard problems. I have been very impressed by the machine translations offered by DeepL.

Too many people asked me why I am not jumping into deep learning. The answer is much the same as to why I am not buying any bitcoins. My best estimate is that at this stage in the game, you are as likely to overestimate as to underestimate the value of your investments. You should not expect to profit massively from these technologies at this time, unless you invested earlier.

I am placing my own bets on problems that are much less glamorous, and mostly untouched. Here are a few things that I have in mind:

  • New non-game applications of virtual reality (VR). VR is getting cheap and really accessible. There are some modest applications of VR outside games… such as medical rehabilitation and e-Learning… but I am betting that there is a lot more than we know to discover.
  • For all the talk of AI and big data, data engineering is a mess in 2017. We figured out that throwing everything into an Oracle database was great for Oracle, but not so good building modern applications. So we created data lakes and they have turned into data swamps. The clever AI kernel that makes the press release… is maybe 2% of your codebase… the rest is made of the pipes laid out in your data swamp. But it is hard to recruit data-swamp engineers. I am betting that lots of highly valuable future technology will live in the data swamp.
  • Software performance is a messy place. Dan Luu has a great post about how modern computers fall short of the Apple 2 when it comes to latency. It is great that I can run a 3D shooter like Doom in pure JavaScript within my browser… but it is less great that when editing a LaTeX document online using a tool like Overleaf, my browser will effectively freeze for what feels like a second from time to time. My wife is going mad with her brand-new Android phone because, too often, when she clicks on a button or a link, there is a noticeable delay while the phone seems to become unconscious for a second or two… I’m helping a colleague right now do some standard statistical analysis… the thing is slow (running for weeks) to the point of making it unusable. Given how standard the problem is, I assumed I could just find ready-made solutions… and I could, but they are immature and undocumented. Not to mention that they are not so fast. I guess it explains how people like John D Cook can make a good living by consulting at the intersection of mathematics, statistics and computing. Maybe the best-kept secret in computing today is that software performance matters more than it did, not less.
  • I remember giving talks about ten years ago, and summarizing the state of processors: “you are running an Intel processor, so are your friends, so are the servers you connect to”. There were non-Intel processors all over the place, but they were pretty much limited to embedded computing… an important field to be sure, but one you could leave to specialists. That’s not even close to being true today. You are more likely to be reading my words on a computer running an ARM processor than an Intel processor. And these Intel processors have become more complicated beasts.Some people are willing to bet that ARM processors may soon enter the server market and displace (partially) Intel. I find this credible.If you care about performance and you have to deal with ever more exotic hardware, your life is not going to get easier.

Science and Technology links (December 22nd, 2017)

Bitcoins are electronic records that are meant to be used as currency. They have become increasingly popular and expensive. Reportedly, the famous investment bank Goldman Sachs is planning to offer bitcoin financial services in the summer of 2018.

Doing away with television and switching to subscription services like Netflix reduces your exposure to ads. So we have Facebook and Google who earn almost all their money through ads. Netflix is effectively saving us from ads.

MagicLeap, the Florida-based augmented reality company, has announced its first product… some good looking glasses. Apparently, they are quite nice. Price remains undisclosed and you cannot buy them in store yet. It appears that will need to strap on some computer to make them work.

People are typically very interested in fairness, though it is sometimes unclear what it means in practice. What if you recruited your employees based on anonymized resumes where you can’t tell the race or the gender of the candidate? That would be “fairer” certainly, right? Maybe not:

We find that participating firms become less likely to interview and hire minority candidates when receiving anonymous résumés. We show how these unexpected results can be explained (…) by the fact that anonymization prevents the attenuation of negative signals when the candidate belongs to a minority.

It is widely believed that there is more genetic variation within human populations than between them. On this basis, it follows that the concept of “human race” is likely “socially constructed”: it has no biological basis. West Hunter has a counterpoint:

Pygmies couldn’t really be hugely shorter than most other human populations. Yet they are short (five or six standard deviations shorter)

He also points out that evolution can run its course faster than we might intuitively expect. For example, lactose tolerance is thought to have evolved in the last 10,000 years or so. Thus while it is impossible to have a rational discussion about it in today’s political climate, there are significant genetic differences between human populations, that is simply a scientific fact. It does not follow that the concept of human race is useful or relevant, but if we were a scientific civilization, we would certainly be able to talk openly about genetic differences between populations.

Awni Hannun has a post entitled Speech Recognition Is Not Solved. It is a response to same claims made this year that computers have no better speech recognition than human beings (Microsoft made such claims). He points out that performance is a major constraint when deploying new algorithms…

(Latency) Bidirectional recurrent layers are a good example of a latency killing improvement. All the recent state-of-the-art results in conversational speech use them. The problem is we can’t compute anything after the first bidirectional layer until the user is done speaking. So the latency scales with the length of the utterance.

(Compute) The amount of computational power needed to transcribe an utterance is an economic constraint. We have to consider the bang-for-buck of every accuracy improvement to a speech recognizer. If an improvement doesn’t meet an economical threshold, then it can’t be deployed.

To a lot engineers, this will sound quite familiar. Research papers often rely on trade-offs that are simply unrealistic in the real world.

Science and Technology links (December 15th, 2017)

  1. Scientists found a human gene which, when inserted into mice, makes their brain grow larger. David Brin has a series of classical sci-fi books where we “uplift” animals so that they become as smart as we are. Coincidence? I think not.
  2. Should we be more willing to accept new medical therapies? Are we too cautious? Some people think so:

    Sexual reproduction, had it been invented rather than evolved would never have passed any regulatory body’s standards (John Harris)

  3. Apple has released the iMac Pro. It is massively expensive (from 5k$ to well over 10k$). It comes with a mysterious co-processor named T2 which is rumored to be a powerful ARM processor derived from an iPhone processor. It can encrypt your data without performance penalty.
  4. Playing video games can make older people smarter? You might think so after reading the article Playing Super Mario 64 increases hippocampal grey matter in older adults.
  5. At a high level, technologists like to point out that technology improves at an exponential rate. A possible mechanism is that the more sophisticated you are, the faster you can improve technology. The exponential curve is a robust result: just look at per capita GPD or the total number of pictures taken per year.

    Many people like to point out that technology does not, strictly speaking, improve at an exponential rate. In practice, we experience plateaus when looking at any given technology.

    Rob Miles tweeted a powerful counterpoint:

    This concern about an ‘explosion’ is absurd. Yes, the process looks exponential, but it’s bounded – every real world exponential is really just the start of a sigmoid, it will have to plateaux. There’s only a finite amount of plutonium in the device (…) Explosives already exist, so nukes aren’t very concerning

    His point, in case you missed it, is that it is easy to rationally dismiss potentially massive disruptions as necessarily “small” in some sense.

  6. Gene therapies are a bit scary. Who wants to get his genetic code played with? Some researchers suggest that we could accomplish a lot simply by activating or turning off genes using a variation on the technology currently used to modified genes (e.g., CRISPR/Cas9).
  7. Why do Middle Eastern girls crush boys in school?

    A boy doesn’t need to study hard to have a good job. But a girl needs to work hard to get a respectable job.

  8. Google has this cool feature whereas it automatically catalogs celebrities and displays their biographical information upon request. If you type my name in Google, right now, my picture should come up. However, the biographical information is about someone else (I am younger than 62). To make matters worse, my name comes up along with a comedian (Salvail) who was recently part of a sexual scandal. Maybe it is a warning that you should not take everything Google says for the truth? But we know this, didn’t we?

    In case you want to dig deeper into the problem… “Daniel Lemire” is also the name of a somewhat famous Canadian comedian. I think we look nothing alike and we have had entirely distinct careers. It should be trivial for machine learning to distinguish us.

If all your attributes are independent two-by-two… are all your attributes independent?

Suppose that you are working on a business problem where you have multiple attributes… maybe you have a table with multiple columns such as “age, gender, income, location, occupation”.

You might be interested in determining whether there are relations between some of these attributes. Maybe the income depends on the gender or the age?

That is fairly easy to do. You can take the gender column and the income column and do some statistics. You can compute Pearson’s correlation or some other measure.

If you have N attributes, you have N (N-1) / 2 distinct pairs of attributes, so there can be many pairs to check, but it is not so bad.

However, what if you have established that there is no significant relationship between any of your attributes when you take them two-by-two. Are you done?

Again, you could check for all possible sets (e.g., {age, gender, income}, {income, location, occupation}). The set of all possible sets is called the power set. It contains 2N sets. So it grows exponentially with N, which means that for any large value of N, it is not practical to check all such sets.

But maybe you think that because you checked all pairs, you are done.

Maybe not.

Suppose that x and y are two attributes taking random integer values. So there is no sensible dependency between x and y. Then introduce z which is given by z = x + y. Clearly, x and y determine z. But there is no pairwise dependency between any of x, y, z.

To be precise, in Java, if you do the following

Random r = new Random();
for(int k = 0; k < N; k++) {
   x[k] = r.nextInt();
   y[k] = r.nextInt();
   z[k] = x[k] + y[k];
}

then there is no correlation between (y, z) or (x, z) even though x + y = z.

So if you look only at (x, y), (y, z) and (x, z), this tells less than you might think about (x, y, z).

Thus, checking relationships pairwise is only the beginning…

No, a supercomputer won’t make your code run faster

I sometimes consult with bright colleagues from other departments who do advanced statistical models or simulations. They are from economics, psychology, and so forth. Quite often, their code is slow. As in “it takes weeks to run”. That’s not good.

Given the glacial pace of academic research, you might think that such long delays are nothing to worry about. However, my colleagues are often rightfully concerned. If it takes weeks for you to get the results back, you can only iterate over your ideas a few times a year. This limits drastically how deeply you can investigate issues.

These poor folks are often sent my way. In an ideal world, they would have a budget so that their code can be redesigned for speed… but most research is not well funded. They are often stuck with whatever they put together.

Too often they hope that I have a powerful machine that can run their code much faster. I do have a few fast machines, but it is often not as helpful as they expect.

  • Powerful computers tend to be really good at parallelism. Maybe counter-intuitively, these same computers can run non-parallel code slower than your ordinary PC. So dumping your code on a supercomputer can even make things slower!
  • In theory, you would think that software could be “automatically” parallelized so that it can run fast on supercomputers. Sadly, I cannot think of many examples where the software automatically tries to run using all available silicon on your CPU. Programmers still need to tell the code to run in parallel (though, often, it is quite simple). Some software libraries are clever and do this work for you… but if you wrote your code without care for performance, it is likely you did not select these clever libraries.
  • If you just grabbed code off the Internet, and you do not fully understand what is going on… or you don’t know anything about software performance… it is quite possible that a little bit of engineering can make the code run 10, 100 or 1000 times faster. So messing with a supercomputer could be entirely optional. It probably is.

    More than a few times, by changing just a single dependency, or just a single function, I have been able to switch someone’s code from “too slow” to “really fast”.

How should you proceed?

  • I recommend making back-of-the-envelope computations. A processor can do billions of operations a second. How many operations are you doing, roughly? If you are doing a billion simple operations (like a billion multiplications) and it takes minutes, days or weeks, something is wrong and you can do much better.

    If you genuinely require millions of billions of operations, then you might need a supercomputer.

    Estimates are important. A student of mine once complained about running out of memory. I stupidly paid for much more RAM. Yet all I had to do to establish that the machine was not at fault was to compare the student code with a standard example found online. The example was much, much faster than the student’s code running on the same machine, and yet the example did much more work with not much more code. That was enough to establish the problem: I encouraged the student to look at the example code.

  • You often do not need fancy tools to make code run faster. Once you have determined that you could run your algorithm faster, you can often inspect the code and determine at a glance where most of the work is being done. Then you can search for alternatives libraries, or just think about different ways to do the work.

    In one project, my colleague’s code was generating many random integers, and this was a bottleneck since random number generation is slow in Python by default, so I just proposed a faster random number generation written in C. (See my blog post Ranged random-number generation is slow in Python… for details.) Most times, I do not need to work so hard, I just need to propose trying a different software library.

    If you do need help finding out the source of the problem, there are nifty tools like line-by-line profilers in Python. There are also profilers in R.

My main insight is that most people do not need supercomputers. Some estimates and common sense are often enough to get code running much faster.

Science and Technology links (December 8th, 2017)

  1. Facebook’s leading artificial-intelligence researcher Yan Lecun wrote:

    In the history of science and technology, the engineering artifacts have almost always preceded the theoretical understanding: the lens and the telescope preceded optics theory, the steam engine preceded thermodynamics, the airplane preceded flight aerodynamics, radio and data communication preceded information theory, the computer preceded computer science.

  2. In total, Sony has sold 2 million PSVR units, its virtual-reality headsets. I find the number impressive. How many people will get one for Christmas? Sadly the game line-up is unimpressive, but maybe others will see it otherwise.
  3. There are faculty positions open only to female applicants in Germany.
  4. A team from Google (Alphabet/DeepMind) has created a computer system (AlphaZero) that can learn games like Go and Chess in a few hours, based only on the rules, and then beat not only human players, but the very best software systems. In effect, they have made existing Chess and Go software obsolete. What is more, AlphaZero plays Chess in a remarkable new way:

    Imagine this: you tell a computer system how the pieces move — nothing more. Then you tell it to learn to play the game. And a day later — yes, just 24 hours — it has figured it out to the level that beats the strongest programs in the world convincingly!
    (…) Modern chess engines are focused on activity, and have special safeguards to avoid blocked positions as they have no understanding of them and often find themselves in a dead end before they realize it. AlphaZero has no such prejudices or issues, and seems to thrive on snuffing out the opponent’s play. It is singularly impressive, and what is astonishing is how it is able to also find tactics that the engines seem blind to. (…) The completely disjointed array of Black’s pieces is striking, and AlphaZero came up with the fantastic 21.Bg5!! After analyzing it and the consequences, there is no question this is the killer move here, (…) I gave it to Houdini 6.02 with 9 million positions per second. It analyzed it for one full hour and was unable to find 21.Bg5!!

    Nahr points out another remarkable fact about AlphaZero:

    What’s even more remarkable than AlphaGo Zero’s playing strength is the puny hardware it runs on: one single PC with four specialized TPUs, no distributed network needed

    AlphaZero has made the work done by many artificial-intelligence engineers obsolete. It is likely that many other games will follow in the near future. This is a remarkable breakthrough. Their paper can be found on arXiv.

  5. Qualcomm announced its latest mobile processor, which will find its way in many smartphones next year:

    The Adreno 630 GPU will be able to push 4K video at 60 FPS, but on top of that, display a split screen for VR of 2K x 2K (each eye) at 120 FPS. The icing on the cake comes from the ability to handle 10-bit color depth as well, for HDR10 UHD content. (…) Another addition to the chip is an increased presence of machine learning hardware (…) with the increased presence of VR and AR (slowly becoming XR for miXed Reality), being able to see the outside world and understand what’s going on, requires a major shift in processing capabilities, which is where machine learning comes in.

    You will be excused if you don’t understand everything about this processor, but if you are buying a high-end phone next year, that is what you are getting inside. You are getting a computer capable of doing advanced in-silicon artificial intelligence, a system supporting high-quality virtual reality and a machine able to display video at a quality that exceeds that of most televisions. These same processors will probably be dirt cheap in a few years. Anyone who thinks that virtual reality is a fad should pay close attention to what is being built in silicon right now.

  6. Our cells are powered by their mitochondria which are like “tiny cells” that live inside our cells. It is possible to improve our mitochondria with small molecules and direct supplementation. In mice, it seems to lead to a healthier brain. It also seems to heal human heart cells.
  7. One PlayStation 4 has more memory than all of the Atari 2600 consoles ever manufactured.
  8. Miltner asks: Who Benefits When Schools Push Coding? Are computers in the classroom more helpful to students or to corporations? She is right, of course, that computers don’t, by themselves, make people smarter… and I am very critical of attempts to turn random young people into “coders”.
  9. Apple published an interesting paper that describes how it can learn from your data while preserving your privacy, all of it economically.
  10. Microsoft is relaunching Windows for ARM processors. Most PCs run on Intel (or AMD) processors. Most mobile devices run on ARM processors. Microsoft wants you to run Windows on a laptop with ARM processors, thus cutting off its dependency on Intel. Though Intel still makes the very best PC processors money can buy, there is a sense that ARM processors will soon catch up and beat Intel, maybe. It is already the case that my iPad Pro with its ARM processor could run circles around many Intel-based laptops. One argument that is made in favor of ARM processors is their relative low power usage. ARM processors are now trying to enter the server market (Qualcomm recently proposed a 24-core processor) by putting together many low-power processors. It seems that we are experiencing a shift in hardware design that is beneficial to ARM processors: we no longer care so much about having one very fast processor… we prefer to have many moderately fast ones. I am not sure what is driving this apparent shift.
  11. A bitcoin is now worth $15,000. An app to buy and sell bitcoins is the most popular iPhone app. I once owned a bitcoin, it was given to me. For a long time, they were very cheap. I did not even keep a record of it, and it is now gone. My wife is very angry at me. I don’t understand why people are willing to pay so much for bitcoins, but then I never tried to be a good capitalist.
  12. Though gene therapy has had few successes in the last decades, we have had four remarkable therapeutic breakthroughs in a little over a month: hemophilia, spinal muscular atrophy, retinal dystrophy…
  13. Though we can measure human intelligence, we don’t know what makes us more or less intelligent at the biological level. Some German scientists think that they have found the answer: it has to do with information flow in the brain.
  14. Metformin is a diabetes drug that is believed to have anti-aging properties. It is somewhat toxic for mice, but by administrating it every other week to old mice, scientists got healthier old mice according to an article in Nature.
  15. Consuming lots of sugar makes you at risk for heart disease.
  16. A woman with a transplanted uterus gave birth.
  17. In the US, there are more Netflix subscriptions than cable TV subscriptions.
  18. Amazon released Cloud9, a programming environment that allows you to code and build your code with just a browser. They have also released a service called SageMaker to make it easier to build and deploy machine learning on Amazon’s infrastructure.
  19. Arveson challenges the hypocrisy of American colleges regarding graduate students and tuition fees. Indeed, most colleges waive tuition fees for graduate students. So why pretend to charge it in the first place? She writes:

    The most self-serving reason university administrators continue to charge tuition, though, is to use the fact that they waive payment of it as propaganda.

    South Korea is a major power in technology… yet South Korea needs to “change to its conformist culture and rigid education system, which stymie creativity” according to the Financial Times.