Programming competition with $1000 in prizes: make my code readable!

Colm MacCárthaigh is organizing a programming competition with three 3 prizes: $500, $300, $200. The objective? Produce the most readable, easy to follow, and well tested implementation of the nearly divisionless random integer algorithm. The scientific reference is Fast Random Integer Generation in an Interval (ACM Transactions on Modeling and Computer Simulation, 2019). I have a few blog posts on the topic such as Nearly Divisionless Random Integer Generation On Various Systems.

This algorithm has been added to the Swift standard library by Pavol Vaskovic. It has also been added to the Go standard library by Josh Snyder. And it is part of the Python library Numpy thanks to Bernardt Duvenhage and others.

I have a C++ repository on GitHub with relevant experiments as well as a Python extension.

Colm wants the implementation to be licensed under the Apache Software License 2.0. It could be in any programming language you like. The deadline is September 1st 2019. You can find Colm on Twitter.

Arbitrary byte-to-byte maps using ARM NEON?

Modern processors have fast instructions that can operate on wide registers (e.g., 128-bit). ARM processors, the kind of processors found in your phone, have such instructions called “NEON”. Sebastian Pop pointed me to some of his work doing fast string transformations using NEON instructions. Sebastian has done some great work to accelerate the PHP interpreter on ARM processors. One of his recent optimization is a way to transform the case of strings quickly.

It suggested the following problem to me. Suppose that you have a stream of bytes. You want to transform byte values in an arbitrary manner. Maybe you want to map the byte value 1 to the byte value 12, the byte value 2 to the byte value 53… and so forth.

Here is how you might implement such a function in plain C:

 for(size_t i = 0; i < volume; i++) {
    values[i] = map[values[i]];
 }

For each byte, you need two loads (to get to map[values[i]]) and one store, assuming that the compiler does not do any magic.

To implement such a function on block of 16 bytes with NEON, we use the vqtbl4q_u8 function which is essentially a way to do 16 independent look-up in a 64-byte table. It uses the least significant 5 bits as a look-up index. If any of the other bits are non-zero, it outputs zero. Because there are 256 different values, we need four distinct calls to the vqtbl4q_u8 function. One of them will give non-zero results for byte values in [0,64), another one for bytes values in [64,128), another one for byte values in [128,192), and a final one for byte values in [192,256). We select the right values with a bitwise XOR (and the veorq_u8 function). Finally, we just need to apply bitwise ORs to glue the results back together (via the vorrq_u8 function).

uint8x16_t simd_transform16(uint8x16x4_t * table, uint8x16_t input) {
  uint8x16_t  t1 = vqtbl4q_u8(table[0],  input);
  uint8x16_t  t2 = vqtbl4q_u8(table[1],  
       veorq_u8(input, vdupq_n_u8(0x40)));
  uint8x16_t  t3 = vqtbl4q_u8(table[2],  
       veorq_u8(input, vdupq_n_u8(0x80)));
  uint8x16_t  t4 = vqtbl4q_u8(table[3],  
       veorq_u8(input, vdupq_n_u8(0xc0)));
  return vorrq_u8(vorrq_u8(t1,t2), vorrq_u8(t3,t4));
}

In terms of loads and stores, assuming that you enough registers, you only have one load and one store per block of 16 bytes.

A more practical scenario might be to assume that all my byte values fit in [0,128), as is the case with a stream of ASCII characters…

uint8x16_t simd_transform16_ascii(uint8x16x4_t * table, 
               uint8x16_t input) {
  uint8x16_t  t1 = vqtbl4q_u8(table[0],  input);
  uint8x16_t  t2 = vqtbl4q_u8(table[1],  
     veorq_u8(input, vdupq_n_u8(0x40)));
  return vorrq_u8(t1,t2);
}

To test it out, I wrote a benchmark which I ran on a Cortex A72 processor. My source code is available. I get a sizeable speed bump when I use NEON with an ASCII input, but the general NEON scenario is slower than a plain C version.

plain C 1.15 ns/byte
neon 1.35 ns/byte
neon (ascii) 0.71 ns/byte

What about Intel and AMD processors? Most of them do not have 64-byte lookup tables. They are limited to 16-byte tables. We need to wait for AVX-512 instructions for wider vectorized lookup tables. Unfortunately, AVX-512 is only available on some Intel processors and it is unclear when it will appear on AMD processors.

Science and Technology links (July 20th 2019)

  1. Researchers solve the Rubik’s cube puzzle using machine learning (deep learning).
  2. There has been a rise in the popularity of “deep learning” following some major breakthroughs in tasks like image recognition. Yet, at least as far as recommender systems are concerned, there are reasons to be skeptical of the good results being reported:

    In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. Overall, our work sheds light on a number of potential problems in today’s machine learning scholarship and calls for improved scientific practices in this area.

  3. Your blood contains about four grams of glucose/sugar (it is a tiny amount).
  4. Our brain oscillates at a rate of about 40 Hz (40 times per second). Some researchers may have found the cells responsible for coordinating these waves.
  5. Are we suffering from record-setting heat waves? A recent American government report concludes that we are not:

    (…) the warmest daily temperature of the year increased in some parts of the West over the past century, but there were decreases in almost all locations east of the Rocky Mountains. In fact, all eastern regions experienced a net decrease (…), most notably the Midwest (about 2.2°F) and the Southeast (roughly 1.5°F). (…) As with warm daily temperatures, heat wave magnitude reached a maximum in the 1930s.

    The same report observes that cold extremes have become less common, however.

Science and Technology links (July 13th 2019)

  1. Drinking juice increases your risk of having cancer.
  2. College completion rates are increasing and this may have to do with lower standards.
  3. Australia is going to try a flu vaccine that was partially designed with help from an artificial intelligence called Sam. Sam acquires knowledge and has new ideas, according to its creators. I think that this is misleading wording. What is real, however, is that we routinely underutilize software: many designs that are made entirely by human beings could be greatly improved via software. We are nowhere close to the end of the information technology revolution.
  4. Videoconferencing is now common. We know that it does not work as well as it should in part because we keep looking at the screen, not at the camera, and this breaks eye contact… an important behavior for human beings. Apple is deploying a fix for this problem in its own videoconferencing software by slightly altering the image to make it appear as if you were looking straight at the camera. It should be available publicly in a few months. Hopefully, other videoconferencing software will follow the lead.
  5. Omega 3 supplements lower risk of heart attacks. Calcium with vitamin D increases the risk of strokes. Multivitamins and antioxidants have no significant effect on mortality.
  6. People who live in cities are not less civil than people who live in the country. I think that if you live in a city, you have more social interactions and are therefore more likely to have rude encounters per time period. However, people in a big city are just as nice as people in a village, on average.
  7. Women with children who drink alcohol and use illegal substances are more likely to be a relationship.
  8. Researchers who study obesity frequently (50% of the time) confuse correlation or association with causality.
  9. Though the United States is often thought to be a a leader in research, in a comparison between the leading economies (G20), the researchers found that the output and impact of American researchers are in decline and that the output per researcher has fallen below the G20 average.
  10. Most new drugs entering the market are useless. I think that this should be expected: progress is hard and most of our attempts fail, even if we can only recognize the failures much later. In my mind, we should not worry that too many useless drugs are marketed, we should fear that companies shy away from risky new ventures. Don’t blame the innovators when they take risks and fail, encourage them to fail more often, to take bigger risks.
  11. Though our brains make new brain cells all the time, this effect (neurogenesis) can be slowed or prevented by our own immune system.

Parsing JSON quickly on tiny chips (ARM Cortex-A72 edition)

I own an inexpensive card-size ROCKPro64 computer ($60). It has a ARM Cortex-A72 processors, the same processors you find in the recently released Raspberry Pi 4. These are processors similar to those you find in your smartphone, although they are far less powerful than the best Qualcomm and Apple have to offer.

A common task in data science is to parse JSON documents. We wrote a fast JSON parser in C++ called simdjson; to go fast, it uses fancy (‘SIMD‘) instructions. I wondered how well it would fare on the ARM Cortex-A72 processor.

I am comparing against to top-of-the-line fast parsers (RapidJSON and sajson). I cannot stress enough how good these parsers are: out of dozens of JSON parsers, they clearly stand out. I think it is fair to say that RapidJSON is the ‘standard’ parser in C++. Of course, our parser (simdjson) is good too.

Here are the speeds compared on some of our standard documents, I use GNU GCC 8.3.

file simdjson RapidJSON sajson
twitter 0.39 GB/s 0.14 GB/s 0.27 GB/s
update-center 0.35 GB/s 0.14 GB/s 0.27 GB/s
github_events 0.39 GB/s 0.20 GB/s 0.29 GB/s
gsoc-2018 0.52 GB/s 0.20 GB/s 0.36GB/s

So it is interesting to note that code optimized with fancy ‘SIMD‘ instructions runs fast even on small ARM processors.

Yet these speeds are far below what is possible on better processors. Our simdjson parser on an Intel processor can easily achieve speeds well beyond 2 GB/s. There are at least three reasons for the vast difference:

  1. Intel processors have ‘faster’ instructions (AVX2) that can process 256-bit registers. There is no equivalent on ARM. In our case, this means that the Intel processor only need about 75% as many instructions as the ARM processors.
  2. Intel processors run at a higher clock. It is easy to find an Intel processor that reach 3.5 GHz or more, maybe much more. My ROCKPro64 runs at 1.8 GHz, so roughly half the speed.
  3. Intel processors easily retire 3 instructions per cycle on a task like JSON parsing whereas the ARM Cortex-A72 is capable of about half of that.

The last two points alone explain about a 4x difference in favour of Intel processors.

To reproduce the results, using Docker, you can do…

git clone https://github.com/lemire/simdjson.git
cd simdjson
docker build -t simdjson . 
docker run --privileged -t simdjson

Credit: Geoff Langdale with contributions from Yibo Cai and Io Daza Dillon.

Parsing JSON using SIMD instructions on the Apple A12 processor

Most modern processors have “SIMD instructions“. These instructions operate over wide registers, doing many operations at once. For example, you can easy subtract 16 values from 16 other values in a single SIMD instruction. It is a form of parallelism that can drastically improve the performance of some applications like machine learning, image processing or numerical simulations.

A common data format on the Internet is JSON. It is a ubiquitous format for exchanging data between computers. If you are using some data-intensive web or mobile application written in this decade, it almost surely relies on JSON, at least in part.

Parsing JSON can become a bottleneck. The simdjson C++ library applies SIMD instruction to the problem of parsing JSON data. It works pretty well on modern Intel processors, the kind you have in your laptop if you bought it in the last couple of years. These processors have wide (256-bit registers) and a corresponding set of instructions (AVX2) that is powerful.

It is not the end of the track: the upcoming generation of Intel processors support AVX-512 with its 512-bit registers. But such fancy processors are still uncommon, even though they will surely be everywhere in a few short years.

But what about processors that do not have such fancy SIMD instructions? The processor in your iPhone (an Apple A12) is an ARM processor and it “merely” has 128-bit registers, so half the width of current Intel processors and a quarter of the width of upcoming Intel processors.

It would not be fair to compare an ARM processor with its 128-bit registers to an Intel processor with 256-bit register… but we can even the odds somewhat by disabling the AVX2 instructions on the Intel processor, and forcing to rely only on smaller 128-bit registers (and the old SSE instructions sets).

Another source of concern is that mobile processors run at a lower clock frequency. You cannot easily compensate for differences in clock frequencies.

So let us run some code and look at a table of results! I make available the source code necessary to build an ios app to test the JSON parsing speed. If you follow my instructions, you should be able to reproduce my results. To run simdjson on an Intel processor, you can use the tools provided with the simdjson library. I rely on GNU GCC 8.3.

file AVX (Intel Skylake 3.7 GHz) SSE (Intel Skylake 3.7 GHz) ARM (Apple A12 2.5 GHz)
gsoc-2018 3.2 GB/s 1.9 GB/s 1.7 GB/s
twitter 2.2 GB/s 1.4 GB/s 1.3 GB/s
github_events 2.4 GB/s 1.6 GB/s 1.2 GB/s
update-center 1.9 GB/s 1.3 GB/s 1.1 GB/s

So we find that the Apple A12 processor is somewhere between an Intel Skylake processor with AVX disabled and a full Intel Skylake processor, if you account for the fact that it runs at a lower frequency.

Having wider registers and more powerful instructions is an asset: no matter how you look at the numbers, AVX instructions are more powerful than ARM SIMD instructions. Once Intel makes AVX-512 widely available, it may leave many ARM processors in the dust as far as highly optimized codebases are concerned. In theory, ARM processors are supposed to get more modern SIMD instructions (e.g., via the Scalable Vector Extension), but we are not getting our hands on them yet. It would be interesting to know whether Qualcomm and Apple have plans to step up their game.

Note: There are many other differences between the two tests (Intel vs. Apple). Among them is the compiler. The SSE and NEON implementations have not been optimized. For example, I merely ran the code on the Apple A12 without even trying to make it run faster. We just checked that the implementations are correct and pass all our tests.

Credit: The simdjson port to 128-bit registers for Intel processors (SSE4.2) was mostly done by Sunny Gleason, the ARM port was mostly done by Geoff Langdale with code from Yibo Cai. Io Daza Dillon completed the ports, did the integration work and testing.

Science and Technology links (July 6th, 2019)

  1. Jim Keller, the vice president of silicon engineering at Intel, is optimistic regarding the continued exponential progress in computing: “It’s going to keep going, Moore’s law is relentless.”
  2. England did away with free college. The result?

    England’s shift has resulted in increased funding per head and rising enrolments, with no apparent widening of the participation gap between advantaged and disadvantaged students.

    The authors make available a short summary of their paper.

  3. Japanese have significant lower risks of cardiovascular diseases. Yet the difference is not explained with normal risk factors such as cholesterol (which is higher in Japan) or blood pressure (higher in Japan).
  4. A protein called eNAMPT has a rejuvenating effect on mice when injected in the blood. Older mice and humans have lower levels of this protein compared to the young.
  5. Music training does not improve academic outcomes.
  6. We can grow functional hair out of stem cells.
  7. Even a little bit of exercise appears to be enough to promote learning and make the brain healthier. (Source: Nature, tested in mice)
  8. Future computers might benefit from persistent memory that is both fast and requires very little power. (Source: Nature)
  9. We sent a spaceship that runs on a solar sail. A solar sail is,
    as the name suggest, a sail pushed by solar winds.
  10. The Bitcoin cryptocurrency uses about 0.2% of the world’s electricity consumption (46 TWh).
  11. China is producing over 25% of the world’s factory output. The USA is second at 17%, Japan is at less than half the USA while India is a mere 3%.
  12. A Russian scientist is demanding the right to do “germline gene editing” to cure deafness in some families. Germline editing is drastic in the sense that it changes the genome of children and all their descendants.

A fast 16-bit random number generator?

In software, we often need to generate random numbers. Commonly, we use pseudo-random number generators.

A simple generator is wyhash. It is a multiplication followed by an XOR:

uint64_t wyhash64_x; 


uint64_t wyhash64() {
  wyhash64_x += 0x60bee2bee120fc15;
  __uint128_t tmp;
  tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;
  uint64_t m1 = (tmp >> 64) ^ tmp;
  tmp = (__uint128_t)m1 * 0x1b03738712fad5c9;
  uint64_t m2 = (tmp >> 64) ^ tmp;
  return m2;
}

It generates 64-bit numbers. I was recently asked whether we could build the equivalent but a smaller scale.

What if you only wanted 16-bit integers and you wanted to follow the model of the wyhash64 function? This might be useful if you are working with a limited processor and you only had modest needs. Here is my proposal:

uint16_t wyhash16_x; 

uint32_t hash16(uint32_t input, uint32_t key) {
  uint32_t hash = input * key;
  return ((hash >> 16) ^ hash) & 0xFFFF;
}


uint16_t wyhash16() {
  wyhash16_x += 0xfc15;
  return hash16(wyhash16_x, 0x2ab);
}

To use this code, you should first initialize wyhash16_x to a value of your choice. You may also be interested in replacing the increment (0xfc15) by any other odd integer. The period of this generator is 65536 meaning that after 65536 values, it comes back in full circle.

The essential idea is in the hash16 function. For a given key value, we have a map from 16-bit values to 16-bit values. The map is not invertible, in general. I pick the key value to 0xfc15. This choice “maximizes” the avalanche effect. That is, if you take a given value, hash it with hash16, then modify a single bit, and hash it again, the difference between the two outputs should have about 8 bits flipped (out of 16). That is, changing a little bit the input value should change the output a lot.

The hash16 function is not invertible… to make it invertible I would need to make the avalanche effect quite weak. With 0xfc15, its domain (set of output values) is only of size 44,114 whereas there are 65536 distinct 16-bit values. Is that bad? Well. If you generate 65536 random values between 0 and 65536, how many distinct values do you expect? The answer is about 41,427. So, if anything, my hash16 function might have too large of a domain.

If you only need to generate a few thousand 16-bit values, this simple generator might be good enough. It is also going to be fast if you have a fast 32-bit multiplier. Evidently, if you are planning to do serious number crunching, it will quickly show its limits. I did not even try to push it through standard statistical tests as none of them are designed for such small generators.

What I found interesting is that by optimizing the avalanche effect, I also got a generator that had a decent image size.

The source code I used to analyze this problem is available.

Appendix. Furthermore you can generate integer values in a range [0,s) efficiently:

uint16_t rand_range16(const uint16_t s) {
    uint16_t x = wyhash16();
    uint32_t m = (uint32_t)x * (uint32_t)s;
    uint16_t l = (uint16_t)m;
    if (l < s) {
        uint16_t t = -s % s;
        while (l < t) {
            x = wyhash16();
            m = (uint32_t)x * (uint32_t)s;
            l = (uint16_t)m;
        }
    }
    return m >> 16;
}

Science and Technology links (June 29th 2019)

  1. Consuming pornography has no effect on sexual desire for one’s partner.
  2. Exercise may increase the likelihood that a woman will become pregnant.
  3. Consuming more protein make it less likely that you will become frail as you age.
  4. There is no correlation between how often university professors fly and their academic success. Furthermore, academics who study environmental issues like global warming fly just as much as the other professors.
  5. In mammals, the heart does not regenerate. Dead heart cells are dead and no replacement is produced. New research shows that this can be changed, in mice, by pumping them up with a small molecule called miR-294. The treated mice saw good heart regeneration.
  6. In human beings, an important part of the immune system, the thymus, simply disappears over time, leaving older people with a weaker immune system. In mice, this can be reversed using transgenic exosomes.

Bounding the cost of the intersection between a small array and a large array

Consider the scenario where you are given a small sorted array of integers (e.g., [1,10,100]) and a large sorted array ([1,2,13,51,…]). You want to compute the intersection between these two arrays.

A simple approach would be to take each value in the small array and do a binary search in the large array. Let the size of the large array be n, then a binary search takes up to log n comparisons and data accesses. Of course, log n  may not be an integer, so I need to round it up to the smallest integer at least as large as log n: ceil( log n ). If the small array contains k elements, then I need a total of k ceil( log n ) comparisons. Each comparison involves one data access in the large array.

Is k log n comparisons the best bound? Clearly not: if k is almost as big as n, then k log n can far exceeds k+n. Yet a standard merge algorithm (as in the Merge Sort algorithm) requires as little as k+n comparisons. So for the bound to be reasonable, we must require that k is much smaller than n.

Even so, intuitively, it is wasteful to do k distinct binary searches. If the values in the small array are in sorted order, then you have new knowledge after doing the first binary search that you can reuse to help in the next binary search.

How few comparisons can you do? We can use an information-theoretical argument. The basic idea behind such an argument is that for an algorithm based on a decision tree to generate M distinct possible output, it must involve up to ceil( log( ) ) decisions in the worst case, a count corresponding to the height of the corresponding balanced tree.

To be clear, this means that if an adversary gets to pick the input data, and you have divine algorithm, then I can give you a lower bound on the number of decisions that your algorithm takes. It is entirely possible that, in practice, you will do better given some data; it is also entirely possible that you could do much worse, evidently. Nevertheless, let us continue.

We can recast the intersection problem as the problem of locating, for each key in the small value, the largest value in the large array that is small or equal to the key. How many possible outputs (of this algorithm) are there? I have to choose k values from a set of n elements, but the order does not matter: that’s nk/k! possibilities. This means that if my algorithm relies on a (balanced) decision tree, it must involve ceil( k log n – log k! ) decisions.

Let us put some numbers in there. Suppose that n is 4 million and k is 1024. Then ceil( log n ) is 22 so each key in the small array will involve 22 decisions. On a per key basis, our tighter model gives something like log n – (log k!)/k. The log of the factorial is almost equal to k log k by Stirling’s approximation, so we can approximate the result as log n – log k.

Let us plug some numbers in these formula to track the lower bound on the number of decisions per key in the small array:

k log n log n – (log k!) / k
8 22 20
16 22 19
64 22 17
1024 22 13

This suggests that you may be able to do quite a bit better than k distinct binary searches.

Often, however, it is not so much the decisions that one cares about as much as the number of accesses. Can this value 13 in table when n is 1024 be taken a lower bound on the number of accesses? Not as such because you can access one value from the large array and then reuse it for many comparisons.

To complicate matters, our systems have cache and cache lines. The cache means that repeated accesses to the same value can be much cheaper than accesses to distinct values and may even be free (if the value remains in a register). And our cache does not operate on a per-value basis, but rather data is loaded in chunks of, say, 64 bytes (a cache line).

All in all, my derivations so far may not be highly useful in practical applications, except maybe to argue that we ought to be able to beat the multiple binary search approach.

Can we sketch such an approach? Suppose that we start by sampling B different values in the big array, at intervals of size n / ( B + 1 ). And then we issue the same k binary searches, but targeting one of the interval of size n / ( B + 1 ). This will incur B  + k log (n / B + 1 )) accesses in the large array. Let us pick B so as to minimize the number of accesses. We could pick B to be k / ln (2) – 1. We get k / ln (2) – 1 + k log (n ln (2) / k) accesses in the large array. For simplicity, I am ignoring the fact that B must be an integer. I am also ignoring the cost of matching each value in the small array to the right subinterval. However, if I am only interested in the number of accesses in the large array, it is an irrelevant cost. Even so, under the assumption that k is small, this is a small cost (O(k)).

Is k / ln (2) + k log (n ln (2) / k) accesses in the large array much better than the naive approach doing k log n accesses? It is unclear to me how much better it might be on real systems once caching effects are accounted for.

Credit:  Travis Downs for the initial questions and many ideas. The mistakes are mine.

Further reading: SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience 46 (6), 2016