Conway’s Game of Life is one of the simplest non-trivial simulation one can program. It simulates the emergence of life from chaos. Though the rules are simple, the game of life is still being studied for the last five decades.

The rules are simple. You have a grid where each cell in the grid has a single bit of state: it is either alive or dead. During each iteration, we look at the 8 neighbours of each cells and count the number of live neighbours. If a cell is dead but has exactly three live neighbours, it comes alive. If a live cell has more than 3 or less than 2 live neighbours, it dies. That is all.

Implemented in a straight-forward manner, the main loop might look like this…

for (i = 0; i < height; i++) { for (j = 0; j < width; j++) { bool alive = states[i][j]; int neighbours = count_live_neighbours[i][j]; if (alive) { if (neighbours < 2 || neighbours > 3) { states[coord] = false; } } else { if (neighbours == 3) { states[coord] = true; } } } }

However, if you implement it in this manner, it is hard for an optimizing compiler to generate clever code. For a 10,000 by 10,000 grid, my basic C implementation takes 0.5 seconds per iteration.

So I wondered whether I could rewrite the code in a vectorized manner, using the SIMD instructions available on our commodity processors. My first attempt brought this down to 0.02 seconds per iteration or about 25 times faster. My code is available.

scalar (C) | 0.5 s |

vectorized (C + SIMD) | 0.02 s |

I use 32-byte AVX2 intrinsics. I did not profile my code or do any kind of hard work.

Thoughts:

- At a glance, I would guess that the limiting factor is the number of “loads”. An x64 processor can, at best, load two registers from memory per cycle and I have many loads. The arithmetic operations (additions, subtractions) probably come for free. My implementation uses 8 bits per cell whereas a single bit is sufficient. Going to more concise representation would reduce the number of loads by nearly an order of magnitude. My guess is that, on main CPUs, I am probably between a factor of 5 and 10 away from the optimal implementation. I expect that I am at least a factor of two away from the optimal speed.
- The game-of-life problem is very similar to an image processing problem. It is a 3×3 moving/convolution filter. Tricks from image processing can be brought to bear. In particular, the problem a good fit for GPU processing.
- I did not look at existing game-of-life implementations. I was mostly trying to come up with the answer by myself as quickly as possible. My bet would be on GPU implementations beating my implementation by a wide margin (orders of magnitude).

**Update**: John Regher points me to Hashlife as a better high-speed reference.

I wonder how lookup tables and block compression would perform in comparison to AVX. (compression as all states are probably not equally likely)

I wonder how complicated the logic of computing new cell liveness value would be using binary logic – essentially reading cell and its neighborhood in nine (interleaved) AVX registers compromising of 2304 cell values and computing 256 cell liveness updates per step using essentially circuit logic of ANDs, ORs, XORs and such.

Unsurprisingly somebody has had a look at this already, and apparently 35 logic gates are enough for the task:

https://www.moria.us/old/3/programs/life/

Considering consecutive cells have four shared neighbors, this could be reduced to maybe 30 ANDs/ORs per round of 256 cells. (Further reductions are possible, but not really practical on circuit level.) Modern CPUs can do three such operations per clock cycle, and if register pressure doesn’t turn out to be a problem and overhead related to edges of vectors could be handled, this ineffective-sounding approach might be almost tenfold times faster than vectorized variant in the blog post!

But this is a back-of-an-envelope estimate without any code written for the task.

Golly uses a hashlife algorithm. I’ve been using it lately and finding it very useful. It can be astoundingly fast for some patterns.

https://en.wikipedia.org/wiki/Golly_(program)

http://golly.sourceforge.net/

http://conwaylife.com/w/index.php?title=Golly

I would guess that representing each row as bits in a SIMD register would yield good results, especially because we can (with a little adjustment) reuse the 3×1 sum 3 times (for the row above, below and the row itself).