Accelerating Conway’s Game of Life with SIMD instructions

Conway’s Game of Life is one of the simplest non-trivial simulation one can program. It simulates the emergence of life from chaos. Though the rules are simple, the game of life is still being studied for the last five decades.

The rules are simple. You have a grid where each cell in the grid has a single bit of state: it is either alive or dead. During each iteration, we look at the 8 neighbours of each cells and count the number of live neighbours. If a cell is dead but has exactly three live neighbours, it comes alive. If a live cell has more than 3 or less than 2 live neighbours, it dies. That is all.

Implemented in a straight-forward manner, the main loop might look like this…

for (i = 0; i < height; i++) {
    for (j = 0; j < width; j++) {
      bool alive = states[i][j];
      int neighbours = count_live_neighbours[i][j];
      if (alive) {
        if (neighbours < 2 || neighbours > 3) {
          states[coord] = false;
        }
      } else {
        if (neighbours == 3) {
          states[coord] = true;
        }
      }
    }
}

However, if you implement it in this manner, it is hard for an optimizing compiler to generate clever code. For a 10,000 by 10,000 grid, my basic C implementation takes 0.5 seconds per iteration.

So I wondered whether I could rewrite the code in a vectorized manner, using the SIMD instructions available on our commodity processors. My first attempt brought this down to 0.02 seconds per iteration or about 25 times faster. My code is available.

scalar (C)0.5 s
vectorized (C + SIMD)0.02 s

I use 32-byte AVX2 intrinsics. I did not profile my code or do any kind of hard work.

Thoughts:

  1. At a glance, I would guess that the limiting factor is the number of “loads”. An x64 processor can, at best, load two registers from memory per cycle and I have many loads. The arithmetic operations (additions, subtractions) probably come for free. My implementation uses 8 bits per cell whereas a single bit is sufficient. Going to more concise representation would reduce the number of loads by nearly an order of magnitude. My guess is that, on main CPUs, I am probably between a factor of 5 and 10 away from the optimal implementation. I expect that I am at least a factor of two away from the optimal speed.
  2. The game-of-life problem is very similar to an image processing problem. It is a 3×3 moving/convolution filter. Tricks from image processing can be brought to bear. In particular, the problem a good fit for GPU processing.
  3. I did not look at existing game-of-life implementations. I was mostly trying to come up with the answer by myself as quickly as possible. My bet would be on GPU implementations beating my implementation by a wide margin (orders of magnitude).

Update: John Regher points me to Hashlife as a better high-speed reference.

Science and Technology links (July 15th, 2018)

  1. The majority of people dying are 80 years old or older.
  2. A heart-disease drug can partially reverse type 1 diabetes.
  3. We can at least partially reverse age-related immune-system decline using drugs.
  4. Senescent cells are old cells that refuse to die; they are behind several age-related diseases. We know how to clear them, at least in part. If we inject even few senescent cells in mice, they suffer. Clearing the senescent cells afterward rescues the mice.
  5. Omega 3 supplements are probably a good idea if you suffer from type 1 diabetes.
  6. The price of commercially-available computing power has come down about 7x in just four years.
  7. Less prestigious institutions receive less research funding but are better on a per-dollar basis. Effectively, we overrate prestigious institutions and should look at the good work from less prestigious places if we want to maximize productivity.

Are fungi making us sick?

Fungi are everywhere. Yeasts, molds, mushrooms. We eat them. They live on our skin.

But are they making us sick?

That’s a theory strongly held by Martin Laurence. Martin created his own lab (Shipshaw labs) and even wrote a book on the topic. He reports that psoriasis and inflammatory bowel disease are both strongly associated with fungi.

It appears that we all get fungal infections when we become sexually active. For men, this fungal infection could migrate to the prostate. What if you are never sexually active? Catholic priests, who are maybe less likely to have been sexually active, have an exceptionally low rate of prostate cancer.

If you fail to see how fungus could cause cancer, consider that cervical cancer in women is strongly related to a viral infection. That is, “bugs” can give you cancer.

What is well known is that cancer patients suffer from fungal infections. However, the fungal infection is generally viewed as the consequence, not the cause of the cancer.

So if Martin is right, why is this not known, or at least under investigation? One problem, according to Martin, it is that it is hard to test for some asymptomatic fungal infections or, at least, researchers and hospitals might be ill-equipped to do so. You may need to use something fancy like DNA analysis… not something your average doctor can do.

Martin is especially interested in Malassezia. Malassezia lives on our skin, eating our fat, and causes dandruff and some forms of eczema. It seems able to infect your prostate as well as your gut. There is no obvious reason why it could not infect other organs, including the brain. In fact, there is a theory that Alzheimer’s might be caused or triggered by fungal infections.

That’s all a bit freaky if you ask me.

Ok. So what do we do to test Martin’s theory out? The obvious path forward is to try to clear the fungal infections. Martin likes itraconazole a lot: it is a cheap and well-tolerated drug. It won’t cure cancer, but if you could flush out fungal infections, you might be able to prevent or delay some diseases. In a short note, some doctors from the Mayo clinic reported that treatment with itraconazole was able to effectively fight against Crohn’s disease.

My understanding is that Martin believes that what often passes as an autoimmune disease (e.g., psoriasis) might actually be a reaction to fungi. That is, most of the time your body does not attempt to fight off fungus, but if somehow forced to do so, it will start waging a losing war, hurting your own body.

Obviously, as you grow older and especially if your immune system becomes less effective, you are more likely to suffer from a fungal infection. Thus it is plausible that fungi could harm older people and be related to age-related diseases.

The main reason I find Martin’s theory appealing is that, if he is even partially correct, there might be dirt-cheap therapies that help many of us remain healthier. It seems that Martin is ready to move to clinical trials. Hopefully we shall soon find out whether there is substance to this theory.

Appendix: Martin had more pointers:

  1. Kellermayer (2012) found an association between Malassezia in gut biopsies and Crohn’s disease in teenagers.
  2. Kanda (2002) found an association between a response to Malassezia and psoriasis.
  3. Squiquera (1994) found an association between antibodies against Malassezia and psoriasis.
  4. Lober 1982 induced a psoriasis lesion by applying dead Malassezia to the skin of 10/10 genetically susceptible patients.
  5. Rautemaa (2007) and Delsing (2012): chronic Candida infections increase oral cancer risk. Unfortunately, we are finding that Malassezia chronically infects the breast (Boix-Amaros 2017).

Science and Technology links (July 6th, 2018)

  1. In the United Kingdom, only a tiny minority of high school female students take computer science (0.4% in 2017). Physics is ten times more popular.
  2. Russians once drilled a 12-km deep hole. Apparently, it gets very hot as you dig deeper.
  3. Deep neural networks can detect myocardial infarction in electrocardiography (ECG) better than the state-of-the-art. In clear: artificial intelligent can find out whether you are having a heart attack with high accuracy, assuming that you are wearing the right sensors.
  4. Drinking coffee, even lots of it, is good for you.
  5. Every year, more and more research papers are made freely available online. Sadly, in most fields, it is still a tiny minority of the published research. The net result is that if you are not affiliated with a university, it is hard to have access to the latest research.
  6. Spiders can fly over long distances.
  7. Aspirin might help keep Alzheimer’s at bay (speculative)
  8. Viruses might trigger Alzheimer’s (speculative).
  9. Stem cells helped a bit to improve the outcome after a heart attack in macaque monkeys.

How quickly can you compute the dot product between two large vectors?

A dot (or scalar) product is a fairly simple operation that simply sums the many products:

float sum = 0;
for (size_t i = 0; i < len; i++) {
    sum += x1[i] * x2[i];
}
return sum;

It is nevertheless tremendously important. You know these fancy machine learning algorithms we keep hearing about? If you dig deep under all the sophistication, you often find lots and lots of dot products.

Xavier Arteaga had a few interesting comments on a previous post of mine:

  1. “I would say that CPU frequency does not really make a difference when computing large amounts of data.”
  2. “I guess for intensive computations which require little memory and lots of operations AVX512 [new fancy instruction sets provided by Intel] provides a better performance.”

I disagree with Xavier, but I feel the need to provide some analysis. So let us evaluate Xavier’s observations with the dot product. Before I begin, let me point out that I am not a linear algebra expert, there has been lots and lots of fancy engineering done on these problems. If you need fast dot product software, find a good library, don’t grab it out of my blog.

I implemented a simple benchmark that computes dot products over increasingly larger vectors. I use three modes: “standard” math which is what you get when you simply compile a dot-product loop, a version with 128-bit vectors (SSE) and a version with 256-bit vectors (AVX2).

total input sizecycles per pair of floatsbytes per cyclemode
8 kB42standard math
8 kB1.07.8fast math (SSE)
8 kB0.5515fast math (AVX2)
16 MB42standard math
16 MB1.07fast math (SSE)
16 MB0.99fast math (AVX2)
256 MB42standard math
256 MB1.36.0fast math (SSE)
256 MB1.26.7fast math (AVX2)

Can we reason about the AVX2 speed? So you have two loads (uses as little as one cycle), one cheap vectorized multiplication and one cheap vectorized addition. You can multiply two vectors of eight floating-point values per cycle, easily on a recent Intel processor. Then there is overhead due to the loop. Generously, say that it takes 5 cycles to multiply to pair of vectors and add them to a set of accumulators… that is about 0.6 cycles per multiplication. As expected, my system does it faster (0.55 cycles per pair) which means that it takes fewer than 5 cycles per pair of eight values.

What is apparent is that we are quickly hitting a wall… for large inputs (e.g., 256 MB and more). This wall has to do with how quickly our single core can grab data from the memory subsystem. I suspect that Xavier is correct: this wall has probably little dependency on the CPU frequency. Furthermore, having fancier instructions (e.g., those from the AVX-512 instruction sets) will not help you.

What can we conclude?

  1. If you have to process lots of data, and do dirt cheap operations (e.g., a vectorized dot product), then your single processor core is easily starved for data. That’s the part where Xavier is right.
  2. However, it is important to qualify what we mean by “cheap tasks”. Even just computing the dot product in a standard-compliant manner is entirely compute-bound in my experiments. Lots and lots of streaming operations like parsing a document, compressing data, and so forth, are likely to be compute bound.

To put it otherwise, I would say that while it is possible to design functions that stream through data in such a way that they are memory-bound over large inputs (e.g., copying data), these functions need to be able to eat through several bytes of data per cycle. Because our processors are vectorized and superscalar, it is certainly in the realm of possibilities, but harder than you might think.

Income, wealth, intelligence and the fall of the American empire

Arcand‘s latest movie (the Fall of the American Empire) depicts a young man (Pierre-Paul Daoust) who is supposedly very intelligent, but not very wealthy. The movie begins with the character explaining that intelligence actually gets in the way of success.

I have certainly met my share of folks who must have very high intelligence and who are not particularly wealthy. There are many interesting intellectuals like Grigori Perelman who are obviously three standard deviations above the average in intelligence, but not in wealth.

What does the research says about this? Psychologists measure intelligence using IQ tests.

There is much evidence that people with a higher IQ have a higher income. The correlation is relatively high (between 0.3 and 0.4). And the elite are largely drawn from the intellectually gifted.

But wait! Income is not the same thing as wealth… the impact of IQ scores on net worth after accounting for other factors is either zero or close to zero, according to economist Jay Zagorsky. Sadly, this is a single study from 2007 and I could not find any relevant follow-up work to validate the analysis.

All and all, I can find no evidence that a high intelligence is an obstacle in the quest for success.

Predicting the truncated xorshift32* random number generator

Software programmers need random number generators. For this purpose, the often use functions with outputs that appear random. Gerstmann has a nice post about Better C++ Pseudo Random Number Generator. He investigates the following generator:

uint32_t xorshift(uint64_t *m_seed) {
  uint64_t result = *m_seed * 0xd989bcacc137dcd5ull;
  *m_seed ^= *m_seed >> 11;
  *m_seed ^= *m_seed << 31;
  *m_seed ^= *m_seed >> 18;
  return (uint32_t)(result >> 32ull);
}

This “truncated xorshift32*” function returns 32-bit “random” integers, and takes in a 64-bit state function. Each time you call the function, the state is updated, so that following random integers will vary. Thus the state and the returned random numbers are distinct concepts. The PCG family of random number generators also uses this nice trick.

Gerstmann asks whether the generator is “predictable” and writes “Unknown?” as an answer. What is the missing answer? The answer is that it is predictable.

What does predictable means?

Suppose that I tell you that the first random number generated is 1 and the second is 2… can you infer what the state is? If you try to setup the probably mathematically, you may find that the problem is quite vexing.

But, in fact, it is easy. I wrote a small program that gives you the answer in 4 seconds, using brute force. And once you know what the state is, you can predict all following random integers.

How does it work? Simply put, from the first 32-bit output of the function (1 in my case), you know the equivalent of 32 bits of the state. Thus you only have a bit over 4 billion possibilities. That’s too much for a human being, but remember that your processor does billions of instructions per second… so 4 billion possibilities is not very many.

My source code is available.

To be clear, it is not an argument against this particular generator.

Science and Technology links (June 29th, 2018)

  1. Hogarth imagines that artificial intelligence (AI) could progress much faster than we might anticipate due to what he calls “AI nationalism”:

    I believe that the current government spending on AI is tiny compared to the investment we will see as they come to realise what is at stake. What if rather than spending ~£500 million of public money on AI over a number of years the UK spent something closer to its annual defence budget of £45 billion?

  2. This year alone, the game Pokemon Go made 1.1 billion dollars in revenues. That is a lot of money, but another game called Fortnite made even more: 1.2 billion dollars. On the whole the videogame industry is much larger than the movie industry. The Grand Theft Auto V made six billion dollars, more than any movie ever made.
  3. Today, in Canada, around 15% of all college-level course enrolments are online. A report finds that not only do students enjoy studying at any place, but they also crave continuous enrolment and accelerated programs. Online education is less about reaching far away students, as there are campus colleges nearly everywhere, and more about reinventing time and increasing student productivity.
  4. Old female mice treated with a specific class of antibodies live 9% longer according to an article in Nature. It is interesting for a few reasons. One of them is that the sizeable life extension happens with mice that are already old. It is obviously much easier to make mice live longer if you intervene when they are still young. Another reason it is interesting is that it only works with females. Sex differences with respect to longevity is a fascinating topic. Though most animals have some sex differences with respect to longevity, human beings are the only known species where one gender (female) has a clear cut longevity advantage. The difference is huge: in the U.S. life expectancy is 77 years for males and 82 years for females. In Canada, life expectancy is 80 and 84. So we are talking about a difference of between 4 and 6 years. I don’t think we know why there is such a difference and any explanation would need to take into account the fact that it is a feature unique to human beings. It is not uncommon however for life-extending therapies to be gender specific.
  5. Naked mole rats are among the rare mammals that “do not age” (meaning that their rate of death does not increase with age). They are also largely immune to cancer. A recent article shows that they have more efficient gene repair abilities than normal mice.
  6. The Fermi paradox is the idea that intelligent life elsewhere in the universe is likely, yet we don’t see any trace of it. Sandberg et al. argues that Fermi was wrong: intelligent life elsewhere in the universe is not likely. We are probably alone.
  7. It is believed that age-related arthritis is caused by cell senescence. UNITY, a company funded in part by Jeff Bezos (Amazon), is starting clinical trials to wipe out senescent cells. Currently, there is little that can be done to reverse arthritis. You can go for joint replacement, but it is invasive and only works for major joints.
  8. Hanley, Bains and Church argue that we should not attempt to prevent medical self-experimentation:

    We conclude that self-experimenters should not have attempts made to terminate them, bar them from use of facilities, nor be barred from using themselves or their tissues except in exceptional circumstances. Organizational uncertainty over the ethical and regulatory status of self-experimentation, and resulting fear of consequences is unjustified and may be blocking a route to human experiments that practicing scientists widely consider appropriate, and which historical precedent had shown is valuable.

  9. In an article published by Nature, Wang et al. find support for the theory that aging is caused by a “quasi-program” as opposed to the accumulation of damage.

Data processing on modern hardware

If you had to design a new database system optimized for the hardware we have today, how would you do it? And what is the new hardware you should care about? This was the topic of a seminar I attended last week in Germany at Dagstuhl.

Here are some thoughts:

  • You can try to offload some of the computation to the graphics processor (GPU). This works well for some tasks (hint: deep learning) but not so well for generic data processing tasks. There is an argument that says that if you have the GPU anyhow, you might as well use it even if it is inefficient. I do not like this argument, especially given how expensive the good GPUs are. (Update: Elias Stehle pointed out to me that GPU memory is growing exponentially which could allow GPUs to serve as smart memory subsystems.)
  • The problem, in general, with heterogeneous data processing systems is that you must, somehow, somewhen, move the data from one place to the other. Moving lots of data is slow and expensive. However, pushing the processing to the data is appealing. So you can do some of the processing within the memory subsystem (processing in memory or PIM) or within the network interface controllers. I really like the idea of applying a filter within the memory subsystem instead of doing it at the CPU level. It is unclear to me at this point whether this can be made into generically useful technology. The challenges are great.
  • There are more and more storage systems, including persistent memory, solid state drives, and so forth. There is expensive fast storage and cheaper and slower storage. How do you decide where to invest your money? Jim Gray’s 5-minute rule is still relevant, though it needs to be updated. What is this 5-minute rule? You would think that slow and inexpensive storage is always cheaper… but being half as fast and half as expensive is no bargain at all! So you always want to be comparing the price per query. Frequent queries justify more expensive but faster storage.
  • There is much talk about FPGAs: programmable hardware that can be more power efficient than generic processors at some tasks. Because you have more control over the hardware, you can do nifty things like make use of all processing units at once in parallel, feed the output of one process directly into another process, and so forth. Of course, though it is used more efficiently, the silicon you have in an FPGA costs more. If your application is a great fit (e.g., signal processing), then it is probably a good idea to go the FPGA route… but it is less obvious to me why data processing, in general, would fit in this case. And if you need to outsource only some of your processing to the FPGA, then you need to pay a price for moving data.
  • The networking folks like something called RDMA (remote direct memory access). As I understand it, it allows one machine to access “directly” the memory of another machine, without impacting the remote CPU. Thus it should allow you to take several distinct nodes and make them “appear” like a multi-CPU machine. Building software on top of that requires some tooling and it is not clear to me how good it is right now. You can use MPI if it suits you.
  • There is talk of using “blockchains” for distributed databases. We have been doing a lot of work to save energy and keep computing green. Yet it is useless because we are burning CPU cycles like madmen for bitcoin and other cryptocurrencies. People talk about all the new possible applications, but details are scarce. I do not own any bitcoin.
  • We have cheap and usable machine learning; it can run on neat hardware if you need it to. Meanwhile, databases are hard to tune. It seems that we could combine the two to tune automagically database systems. People are working on this. It sounds a bit crazy, but I am actually excited about it and I plan to do some of my own work in this direction.
  • Cloud databases are a big deal. The latest “breakthrough” appears to be Amazon Aurora: you have a cheap, super robust, extendable relational database system. I have heard it described as an “Oracle killer”. The technology sounds magical. Sadly, most of us do not have the scale that Google and Amazon have, so it is less clear how we can contribute. However, we can all use it.
  • Google has fancy tensor processors. Can you use them for things that are not related to deep learning and low-precision matrix multiplications? It seems that you can. Whether it makes sense is another story.
  • People want more specialized silicon, deeper pipelines, more memory requests in-flight. It is unclear whether vendors like Intel are willing to provide any of it. There was some talk about going toward Risc-V; I am not sure why.

I am missing many things, and I am surely misreporting much of what was said. Still, I will write some more about some of these ideas over the next few weeks.

Some subjects that people did not cover much at the seminar I was at, as far as I know:

  • Data processing on mobile devices or “Internet of things” was not discussed.
  • I felt that there was relatively little said about large chips made of many, many cores. Vectorization was touched upon, but barely.

The female representation at the event was low in number, but not in quality.

Credit: The Dagstuhl seminar I attended was organized by Peter A. Boncz, Goetz Graefe, Bingsheng He, and Kai-Uwe Sattler. The list of participants include Anastasia Ailamaki, Gustavo Alonso, Witold Andrzejewski, Carsten Binnig, Philippe Bonnet, Sebastian Breß, Holger Fröning, Alfons Kemper, Thomas Leich, Viktor Leis, myself (Daniel Lemire), Justin Levandoski, Stefan Manegold, Klaus Meyer-Wegener, Onur Mutlu, Thomas Neumann, Anisoara Nica, Ippokratis Pandis, Andrew Pavlo, Thilo Pionteck, Holger Pirk, Danica Porobic, Gunter Saake, Ken Salem, Kai-Uwe Sattler, Caetano Sauer, Bernhard Seeger, Evangelia Sitaridi, Jan Skrzypczak, Olaf Spinczyk, Ryan Stutsman, Jürgen Teich, Tianzheng Wang, Zeke Wang, and Marcin Zukowski. The beer was quite good.

Science and Technology links (June 24th, 2018)

  1. Video gamers may soon be paid more than top pro athletes. Meanwhile, if you want to stand out in a crowd of university professors, point out that you are a fan of videogames.
  2. You can get your whole genome sequenced for $500.
  3. Machines can learn to solve the Rubik’s Cube “without human knowledge”.
  4. In most countries in the world, there are more cellular subscriptions than there are human beings. In the worst case (Afghanistan), there is 0.6 cellular subscriptions per person.
  5. More than 70 per cent of Australian men are now considered overweight or obese.
  6. There have never been more hives of honey bees; there are about 90 million in the world compared with about 60 million 50 years ago. In Europe and the UK, we are near to a record number of hives.
  7. Too much physical training can lead to bone loss.
  8. Heart disease is rising in India. We do not know why.
  9. The placebo effect does not exist or is much weaker than usually claimed.
  10. Vitamin K2 makes bone stronger.