Performance of ranged accesses into arrays: modulo, multiply-shift and masks

Suppose that you wish to access values in an array of size n, but instead of having indexes in [0,n), you have arbitrary non-negative integers. This sort of problems happen when you build a hash table or other array-backed data structure.

The naive approach to this problem is to use the remainder of the division by n (sometimes called a modulo reduction):

uint32_t access(uint32_t * array, size_t n, size_t index) {
  return array[index % n];

However, if the compiler cannot inline this call and determine the value of n, then this code is likely to compile to a division instruction. Division instructions are among the slowest instructions on modern-day processors.

To avoid division, many people assume that n is a power of two. Then they use the mask trick: i & (n-1) = i % n.

uint32_t access(uint32_t * array, size_t n, size_t index) {
  return array[index & ( n - 1 )];

That’s typically much faster. You do pay a price, however. Your array lengths must all be a power of two. It is a small price to pay, but a price nonetheless.

Another approach I like is the multiply-shift fast alternative to the modulo reduction (see the fastrange library). It involves a multiplication followed by a shift:

uint32_t access(uint32_t * array, uint64_t n, size_t index) {
  return array[(index * n)>>32];

Undeniably, the masked approach ought to be faster. You cannot get much faster than a bitwise AND. It is nearly a free instruction on modern processors. Multiplications and shifts are typically more expensive.

But let us measure the throughput of these operations. One thing to take into account is that if you have to do more than one such access, the processor can vectorize it. That is, it can use fast instructions that do several multiplications at once. On x64 processors, the vpmuludq instruction can do four full 32-bit by 32-bit multiplications at once.

Let us try it out with the GCC compiler (5.5):

modulo8 cycles8 cycles
multiply-shift2.2 cycles1.5 cycles
mask1.7 cycles1.7 cycles

Clearly, my particular compiler does a poor job at optimizing the masked approach. It should be able to beat the multiply-shift approach. Yet what I think should be evident is that the approach with a mask is not massively more efficient than the multiply-shift approach, and might even be slower (depending on your compiler).

In effect, it may not be warranted to force your array lengths to be a power of two. With careful engineering, you might get much of the same performance with any array length. It is especially likely to be true if you often access several values at once in your array, because you can rely on vectorization.

Relying on the compiler to do some vectorization magic is fine in most instances, but what if you want more control? My original code looks like this…

uint32_t fastsum(uint32_t * z, uint32_t N, uint32_t * accesses, 
   uint32_t nmbr) {
  uint32_t sum = 0;
  uint64_t N64 = (uint64_t) N;
  for(uint32_t j = 0; j < nmbr ; ++j ) {
    sum += z[(accesses[j] * N64)>> 32] ;
  return sum;

Here is a version with Intel intrinsics:

uint32_t vectorsum(uint32_t * z, uint32_t N, uint32_t * accesses, 
     uint32_t nmbr) {
  __m256i Nvec = _mm256_set1_epi32(N);
  __m128i sum = _mm_setzero_si128();
  for(uint32_t j = 0; j < nmbr ; j+=4) {
     __m256i fourints = _mm256_loadu_si256(accesses + j;
     __m256i f4 =  _mm256_mul_epu32(fourints, Nvec);
     __m256i ft = _mm256_srli_epi64(f4,32);
     __m128i fi = _mm256_i64gather_epi32 (z,ft , 4);
     sum = _mm_add_epi32(sum, fi);
  uint32_t buffer[4];
  _mm_storeu_si128((__m128i *)buffer,sum);
  return buffer[0] + buffer[1] + buffer[2] + buffer[3];

The catch is that you have to process values four at a time.

My source code is available.

Science and Technology links (August 19th, 2018)

  1. Publishing your ideas is a central component of science and scholarship. To make it easier to publish, some companies and organizations have begun to offer pay-to-publish journals and conferences where pretty much anything can get published. A Canadian economics professor has apparently been suspended without pay from his tenured job for writing an article where he documents how common this questionable practice is. His paper is entitled The rewards of predatory publications at a small business school and the abstract is as follows:

    This study is the first to compare the rewards of publishing in predatory journals with the rewards of publishing in traditional journals. It finds that the majority of faculty with research responsibilities at a small Canadian business school have publications in predatory journals. In terms of financial compensation, these publications produce greater rewards than many non-predatory journal publications. Publications in predatory journals are also positively correlated with receiving internal research awards. By improving the understanding of the incentives to publish in predatory journals, this research aims to contribute to a better-informed debate on policies dealing with predatory journals.

    Another report finds that hundreds of researchers from some of the top universities (Yale, Harvard, Stanford) also use the same trick.

  2. Scientists who publish more have more impact (per publication unit). I wonder whether this includes researchers who use bogus venues to boost their publication lists?
  3. Both children and adults have more allergies than before. I am not sure we know why.
  4. The greatest mathematician of all times, Gauss, was born in poverty from a mother who could barely read.
  5. Sometimes journals retract articles when it is discovered that they are fatally flawed. A third of the citations to these papers appear a year after the retractation and that 90% of these citations are approving.
  6. After correcting for our aging population, cancer rates have fallen 15-20 percent since 1990. The most prevalent cancer types are breast, colon and prostate cancers.
  7. Artificial intelligence conclusively helps doctors assess colonoscopies. Thus we can effectively fight against colon cancer using software.
  8. Violent video games are not associated with violent behavior.
  9. Poor sleep could lead to loneliness.
  10. Editing your genes is hard, but it is comparatively easier to silence one of your genes. The first gene-silencing drug was approved in the USA.
  11. George Monbiot believes that we owe the obesity epidemy to sugar consumption. This model pretty much contradicts everything nutrition academics are telling us.
  12. Broadly speaking, female primates live longer than male primates. I don’t think we know why.
  13. Whenever I travel in the US, I am frustrated by how slow the Internet is. The United States ranks 48 on the speed of their mobile networks. Canada and Australia are much better.
  14. Sony sold three million Playstation VR headsets.
  15. Medical doctors generally do not disclose their conflicts of interest.
  16. Spermidine delays aging in human beings. Green peas and chicken contain spermidine.

The dangers of AVX-512 throttling: a 3% impact

Intel’s latest processors come with powerful new instructions from the AVX-512 family. These instructions operate over 512-bit registers. They use more power than regular (64-bit) instructions. Thus, on some Intel processors, the processor core that is using AVX-512 might run at a lower frequency, to keep the processor from overheating.

Can we measure this effect?

In a recent post, I used a benchmark provided by Vlad Krasnov from Cloudfare, on a Xeon Gold 5120 processor. In the test provided by Krasnov, the use of AVX-512 actually made things faster.

So I just went back to an earlier benchmark I designed myself. It is a CPU-intensive Mandelbrot computation, with very few bogus AVX-512 instructions thrown in (about 32,000). The idea is that if AVX-512 cause frequency throttling, the whole computation will be slowed. I use two types of AVX-512 instructions: light (additions) and heavy (multiplications).

I measured no AVX-512 throttling on the Skylake X server I own… but what about the Xeon Gold 5120 processor?

I run the benchmark ten times and measure the wall-clock time using the Linux/bash time command. I sleep 2 seconds after each sequence of ten tests. A complete script is provided. In practice, I just run the script after typing make and I record the user timings of each test.

Because there are run-to-run variations, I repeat the whole process several times. Here are my raw numbers:

trialNo AVX-512Light AVX-512Heavy AVX-512
113.23 s13.64 s13.51 s
213.15 s13.49 s13.53 s
313.10 s13.59 s13.51 s
413.07 s13.54 s13.59 s
513.08 s13.52 s13.51 s

So the No-AVX-512 program runs in about 13.1 s. The AVX-512 ones run in about 13.5 s. Thus AVX-512 incurs a 3% penalty. I can’t measure a difference between light and heavy AVX-512 instructions.

It is first time ever I am able to measure a negative difference that can be attributed, presumably, to AVX-512 throttling. It is quite exciting.

Is that a lot? It is hard for me to get terribly depressed at the fact that a benchmark I specifically designed to make AVX-512 look bad sees a 3% performance degradation on one core. Real code is not going to use AVX-512 in such a manner: the AVX-512 instructions will do useful work. It is not super difficult to recoup a 3% difference.

Of course, maybe my adversarial benchmark is not sufficiently bad. That may well be: please provide me with your examples.

Fast strongly universal 64-bit hashing everywhere!

In software, hashing is the process of taking a value and mapping it to a random-looking value. Suppose you are given 64-bit integers (a long in Java). You might want to “hash” these integers to other 64-bit values.

There are many good ways to achieve this result, but let me add some constraints:

  1. The hashing should be strongly universal, also called pairwise independent. This means that if h is a hash function picked at random, then you want the knowledge that h(x)=y for some x and some y, to give you absolutely no information about the value of h(x') for x' distinct from x. That’s not, as it appears, a security/cryptography issue, but more of a useful constraint for probabilistic algorithms. Indeed, there are many “probabilistic” algorithms that require different, and “independent”, hash functions. You want to be absolutely sure that your hash functions are unrelated. Strong universality is not perfect independence, but it is pretty good in practice.
  2. You don’t want to have large look-up tables occupying your cache.
  3. You want to code that works efficiently in most programming languages (including, say, Java).

My proposal is as follows. Instead of trying to hash the 64-bit values to other 64-bit values directly, we can hash them to 32-bit values. If you repeat twice (using two different hash functions), you get the 64-bit result you seek. All that is needed per hash function are three 64-bit values chosen at random, and then two multiplications, two additions and a single shift. The two multiplications are faster than they appear because they can be executed simultaneously as there is no data dependency. Two get a full 64-bit output, you thus need four multiplications.

// in Java, long is 64-bit, int 32-bit

long a,b,c; // randomly assigned 64-bit values

int hash32(long x) {
  int low = (int)x;
  int high = (int)(x >>> 32);
  return (int)((a * low + b * high + c) >>> 32);

How fast is it? We can try to compare it against a standard bit-mixing function:

long murmur64(long h) {
  h ^= h >>> 33;
  h *= 0xff51afd7ed558ccdL;
  h ^= h >>> 33;
  h *= 0xc4ceb9fe1a85ec53L;
  h ^= h >>> 33;
  return h;

This bit-mixing function is “obviously” faster. It has half the number of multiplications, and none of the additions. However, in my tests, the difference is less than you might expect (only about 50%). Moreover if you do need two 32-bit hash values, the 64-bit mixing function loses much of its edge and is only about 25% faster.

My code is available so you can run your own Java benchmark.

How do I know that this hashing function is, indeed, strongly universal? We wrote a paper about it: Strongly universal string hashing is fast. At the time, we were interested in hashing strings. The trick is to view a 64-bit word as a string of two 32-bit words. We could extend the same trick to 128-bit inputs or, indeed, inputs of any length.

The dangers of AVX-512 throttling: myth or reality?

Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once. They are multicore, which means that each CPU is made of several baby processors that are partially independent. And they are vectorized, which means that they have instructions that can operate over wide registers (spanning 128, 256 or even 512 bits).

Regarding vectorization, Intel is currently ahead of the curve with its AVX-512 instruction sets. They have the only commodity-level processors able to work over 512-bit registers. AMD is barely doing 256 bits and your phone is limited to 128 bits.

The more you use your CPU, the more heat it produces. Intel does not want your CPU to burn out. So it throttles the CPU (makes it run slower). Your CPU stays warm but not too hot. When the processor does not have to use AVX-512 instructions, some of the silicon remains dark, and thus less heat is generated.

The cores that are executing SIMD instructions may run at a lower frequency. Thankfully, Intel has per-core voltage and frequencies. And Intel processors can switch frequency very fast, certainly in a millisecond.

Vlad Krasnov from Cloudfare wrote a blog post last year warning us against AVX-512 throttling:

If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

I am sure that it is the case that AVX-512 can cause problems for some use cases. It is also the case that some people die if you give them aspirin; yet we don’t retire aspirin.

Should we really disable AVX-512 as a precautionary stance?

In an earlier blog post, I tried to measure this throttling on a server I own but found no effect whatsoever. It could be that this server does not have AVX-512 throttling or maybe I did not make use of sufficiently expensive instructions.

Vlad offered me test case in C. His test case involves AVX-512 multiplications, while much of the running time is spent on some bubble-sort routine. It can run in both AVX-512 mode and in the regular (non-AVX-512) mode. To be clear, it is not meant to be a demonstration in favour of AVX-512: it is meant to show that AVX-512 can be detrimental.

I did not want to run my tests using my own server this time. So I went to Packet and launched a powerful two-CPU Xeon Gold server (Intel Xeon Gold 5120 CPU @ 2.20GHz). Each of these processors have 14 cores, so we have 28 cores in total. Because of hyperthreading, it supports up to 56 physical threads (running two threads per core).

threadsAVX-512 disabledwith AVX-512
208.4 s7.2 s
406 s5.0 s
805.7 s4.7 s

In an earlier test, I was using an older compiler (GCC 5) and when exceeding the number of physically supported threads (56), I was getting poorer results, especially with AVX-512. I am not sure why that would be but I suspect it is not related to throttling; it might have to do with context switching and register initialization (though that is speculation on my part).

In my latest test, I use a more recent compiler (GCC 7). As you can see, the AVX-512 version is always faster. Otherwise, I see no negative effect from the application of AVX-512. If there is throttling, it appears that the benefits of AVX-512 offset it.

My code is available along with all the scripts and the outputs for your inspection. You should be able to reproduce my results. It is not like Xeon Gold processors are magical faeries: anyone can grab an instance. For the record, the bill I got from Packet was $2.

Note: I have no conflict of interest to disclose. I do not own Intel stock.

Instructions: On Packet, hit “deploy servers”. Choose Sunnyvale CA and choose m2.xlarge.x86. I went with the latest Ubuntu (18.04). You can then access it through ssh. It is a pain that GCC is not already installed once the server is deployed, so that sudo apt-get install gcc fails. I fixed the problem by prepending “us.” to hyperlinks in /etc/apt/sources.list and running sudo apt-get update.

Science and Technology links (August 10th, 2018)

  1. There are far fewer forest fires now than there was 15 years ago.
  2. The Earth is getting greener. There are more forests:

    We show that—contrary to the prevailing view that forest area has declined globally—tree cover has increased by 2.24 million km2 (+7.1% relative to the 1982 level).Global bare ground cover has decreased by 1.16 million km2 (−3.1%), most notably in agricultural regions in Asia.

  3. You can buy a fast 256GB memory card for barely more than $100. This is more than enough to record everything you hear during two years. It is enough to record everything you see for a week.
  4. The widespread usage of anti-cold medication helps to spread infections. If you want to be generous toward others, you should tough out your cold symptoms.
  5. I am on an all-bacon diet: Sugars (carbohydrates) increase human mortality, whereas fat reduces it.
  6. Workplace wellness programs are supposed to promote health among employees and to eventually improve productivity. In practice, they are useless as far as improving health. Yet they can help employers find out which employees are health-conscious. Thus, as an employee, you probably want to participate to signal your good health.
  7. Fewer than 15% of psychology researchers are willing to share their data with a data curation site (with or without restriction). It seems that many of these researchers are unconcerned by the fact that these records will surely go missing once they retire or die.
  8. Governments typically do not publish the results of the clinical trials they hold. A landmark decision forces the Canadian government to publish a specific set of results. There is hope that more could come in the future.
  9. Belly fat is associated with reduced cognitive function in older adults.
  10. If you are a woman, you have better odds if your doctor is a woman.
  11. High blood sugar decreases your odds if you have cancer.
  12. Why do we have 23 pairs of chromosomes and not 24 or 12, or just one chromosome? We do not know. Two teams of scientists changed the genome of yeast so it has just one chromosome, and it is almost fine.
  13. Ford Motors is trying out exoskeletons with its line employees.
  14. OpenAI defeated a team of great Dota 2 players. Dota 2 is one of the top competitive e-games around. Defeating a team of human players is a high bar.
  15. High intelligence is associated with low social conservatism and high economic conservatism. That is, smart people don’t care that their kids’ teacher is a trans-gender lesbian who smokes pot, but they favour capitalism and small governments. Oddly enough, it is hard to find political parties supporting such a mix of values.
  16. New motor and other skills can be acquired at any age.
  17. According to an article in Nature, climate change policies might increase food insecurity, more so than climate change itself.
  18. Genetically modified silkworms produce spider silk. Assuming you want lots of spider silk, that’s probably cheaper than raising spiders. I know at least one superhero who will be happy about that.
  19. Governments have few technology workers and are typically behind the time. This has apparently nothing to do with low salaries and is all down to bad management:

    After exploiting within-person variation—that is, comparing earnings for a given individual before versus after they make a switch into or out of the public sector—I find that public sector workers earn a 3.9% annual earnings premium. Moreover, they report a greater incidence and quality of non-wage benefits, which implies that their overall compensation premium is even larger. Second, public sector workers tend to report much lower levels of work-place practices, including fewer career opportunities, less intellectually stimulating work, and less responsibility and scope in their work.

    To put it bluntly, government jobs pay well, but they are boring. Young talented hackers would rather go do something fun.

  20. A robot can find Waldo. I bet that this robot was not built by a government employee.

Science and Technology links (August 4th, 2018)

  1. About 80% of the ocean remains unmapped and unexplored.
  2. Even a mild concussion (a fall on your head) can double your risk of dementia (e.g., Alzheimer’s).
  3. Apple is selling approximatively two smart watches for each laptop. And they are selling about three iPads for each laptops. (I do not own an Apple watch: I don’t know what I’d use it for.)
  4. Researchers from OpenAI have trained a human-like robot hand to manipulate objects like we would.
  5. Diabetes is not good for your brain:

    Compared with participants with normoglycaemia, the multivariable-adjusted rate of global cognitive decline associated with prediabetes and diabetes was increased (…) Similarly, memory, executive function and orientation z scores showed an increased rate of cognitive decline with diabetes.

  6. We have all been told to eat whole grain cereals. Do they keep the doctor away? Maybe not:

    There is insufficient evidence from randomised controlled trials to date to recommend consumption of whole grain diets to reduce the risk of cardiovascular disease, or lower blood cholesterol, or blood pressure.

  7. We are using much less land than we used to, which means that there is more land for forest and wild animals. People in 2014 used about a third of the land, compared with people from 1961, on a per capita basis.
  8. In the USA, energy use per capita has fallen by 20% since a peak in the late 1970s. In fact, in most occidental countries, energy use per capita has been falling for the last decade. Distance driven by cars, per capita, are also either plateauing or on the decline.

Getting 4 bytes or a full cache line: same speed or not?

Many software operations are believed to be “memory bound”, meaning that the processor spins empty while waiting for data to come from memory. To compensate, our processors rely on fast cache memory. If your problem fits in cache, it will typically run much faster than if the processor constantly needs to query the memory subsystem.

If you need to retrieve a block of data, the processor does not retrieve just the necessary bytes. It retrieves data in units of a “cache line” which is typically 64 bytes on Intel processors. So the first 64 bytes of memory addresses are on the first cache line, the next 64 bytes are on the second cache line, and so forth.

To be clear, it means that you should not expect it to be faster to retrieve only 8 bytes of data rather than 64 bytes of data, assuming that all the data in question lies in a single cache line. That is what I expect most programmers to tell you.

Let me state this as a “law”:

Given a large out-of-cache memory block, the time (latency) required to access 4, 8, 16 or 64 bytes of data from a cache line is the same.

How well does this mental model works in the real world?

Suppose that you build a large array of 32-bit integers. There are sixteen 32-bit integers per cache line. The array is large enough to exceed your cache by a wide margin. Then you repeatedly select a cache line at random and sum up a few 32-bit integers in the cache. Maybe you just pick one integer per cache line, maybe you pick four, maybe eight, maybe sixteen. Your code might look like the follow, where percache is the number of integer you access per cache line:

uint32_t counter = 0;
for(size_t i = 0; i < howmany; i++) {
    size_t idx = pick_address_of_random_cache_line();
    // there are 16 uint32_t per 64-byte cache line... 
    for(size_t j = 0; j < percache; j++) {
       counter += array[16 * idx + j];

So, how sensitive is the running to the parameter percache? A simple theory is that if the array is large enough, it does not matter whether you access one, two, four or sixteen values per cache line.

Let us run an experiment to see. I generate an 8 GB array, it far exceeds my CPU cache. Then I vary the number of integers accessed per cache line, and I sum it all up.

Integer per cache lineCycles per cache line

So even though I am randomly accessing cache lines, out of cache, whether I access one, eight, or sixteen 32-bit values per cache line matters a great deal.

To make things interesting, I can try to accelerate these operations by using SIMD instructions. SIMD instructions can do many additions at once. Let us see…

Integer per cache lineCycles per cache line

Wow! So hand-tuned vectorized helped this problem a lot, even though I am randomly accessing cache lines from a large array.

Let me twist the problem around. After accessing each cache line, and updating my counter by summing up one, two, four, eight or sixteen value, I am going to change the state of my random number generator in a way that depends on the newly computed counter. This makes the problem harder for my computer because, before it can fetch the next cache line, it absolutely must wait for the computation of the counter to be finished.

Integer per cache lineCycles per cache line

As we can see, the problem is about an order of magnitude harder. However, even then, whether I access one or sixteen values per cache line matters: the difference is about 10%. It is not negligible.

Thus it might be a lot harder than you expect to be “memory bound”. Counting the number of cache lines accessed is not a sane way to measure the efficiency of an out-of-cache algorithm.

My code is available.

Science and Technology links (July 27th, 2018)

  1. It is frequently argued that intelligence requires great complexity. But complex compared to what? To put things in perspective, my friend Leonid points out that tiny worms have simple brains and yet they can compute the solution to several problems:

    C elegans [a common worm] has less than 500 neural cells, but it has basic sensory system and muscle control! It can reproduce and mate.

  2. Almost everywhere in the world, people spend more time and money in schools. Sadly this does not obviously translate into smarter people.

    (…) if we want to increase education, it’s not enough to send kids to school. It’s important that attending school also translates into acquiring specific skills.

  3. Stronger men might be less fertile.
  4. In 2016, the fertility rate in the United States was the lowest it has ever been. It was 1,765 births per 1,000 population, below the replacement rate of 2,100.
  5. Japanese smoke about 50% more cigarettes than Americans. Yet Americans have almost twice as many lung cancers. Approximately half of the supercentenarians (110 years+) in Okinawa have a history of smoking.
  6. The death rate from Alzheimer’s is more than 5x lower in Japan compared with the United States.
  7. Cannabis has no apparent carcinogenic effect. Yet a common joint has 4 times more cancer causing material than a single cigarette.
  8. A drug that enhances garbage disposal in the brain made old mice smarter.
  9. There is a large underground lake on Mars.
  10. Some worms can eat and digest plastic; presumably, this includes plastic straws.

Writing about software and building software are different jobs

Code taken from a blog post is meant to illustrate an idea. Blogging is literature, not engineering. Don’t build production systems by copying and pasting random code form the Internet. It will not end well.

A reader commented that some of my C code from an earlier blog post is not proper C:

uint32_t random_bounded(uint32_t range) {
  uint64_t random32bit =  random32(); //32-bit random number 
  multiresult = random32bit * range;
  return multiresult >> 32;

This is true: the type of the “multiresult” variable to not specified. In that post, I deliberately omitted to specify the type because anyone familiar with C can correctly infer the type. In fact, a large fraction of the code you will find on my blog post is similarly simplified. I often remove special cases, I roll my loops and so forth. That’s because I hope to make it more readable. Real code is often filled with boring details.

It is not that I don’t believe in providing working code. I actually think it is super important and I do it systematically. For example, the post points to the source code in correct C. This blog post is accompanied by a research paper. The research paper points to a GitHub repository that provides reproducible experiments. This code was even peer-reviewed along with the research paper.

However, when someone explains a concept and provides a code sample, I take that as literature, not engineering. You are meant to read the code, not use it as a ready-made reusable artefact.

Simplification is not the same as carelessness. Simplified is not the same as wrong. You should report typographical errors and the like, but if you take the code that served to illustrate an idea, copy it mindlessly in your application code, you have missed a step. You are meant to read the code, not copy it into your production code.