The dangers of AVX-512 throttling: a 3% impact

Intel’s latest processors come with powerful new instructions from the AVX-512 family. These instructions operate over 512-bit registers. They use more power than regular (64-bit) instructions. Thus, on some Intel processors, the processor core that is using AVX-512 might run at a lower frequency, to keep the processor from overheating.

Can we measure this effect?

In a recent post, I used a benchmark provided by Vlad Krasnov from Cloudfare, on a Xeon Gold 5120 processor. In the test provided by Krasnov, the use of AVX-512 actually made things faster.

So I just went back to an earlier benchmark I designed myself. It is a CPU-intensive Mandelbrot computation, with very few bogus AVX-512 instructions thrown in (about 32,000). The idea is that if AVX-512 cause frequency throttling, the whole computation will be slowed. I use two types of AVX-512 instructions: light (additions) and heavy (multiplications). I run the benchmark ten times and measure the wall-clock time using the Linux/bash time command. I sleep 2 seconds after each sequence of ten tests. A complete script is provided. In practice, I just run the benchmark.sh script after typing make and I record the user timings of each test.

Because there are run-to-run variations, I repeat the whole process several times. Here are my raw numbers:

trialNo AVX-512Light AVX-512Heavy AVX-512
113.23 s13.64 s13.51 s
213.15 s13.49 s13.53 s
313.10 s13.59 s13.51 s
413.07 s13.54 s13.59 s
513.08 s13.52 s13.51 s

So the No-AVX-512 program runs in about 13.1 s. The AVX-512 ones run in about 13.5 s. Thus AVX-512 incurs a 3% penalty. I can’t measure a difference between light and heavy AVX-512 instructions.

It is first time ever I am able to measure a negative difference that can be attributed, presumably, to AVX-512 throttling. It is quite exciting.

Is that a lot? It is hard for me to get terribly depressed at the fact that a benchmark I specifically designed to make AVX-512 look bad sees a 3% performance degradation on one core. Real code is not going to use AVX-512 in such a manner: the AVX-512 instructions will do useful work. It is not super difficult to recoup a 3% difference.

Of course, maybe my adversarial benchmark is not sufficiently bad. That may well be: please provide me with your examples.

Fast strongly universal 64-bit hashing everywhere!

In software, hashing is the process of taking a value and mapping it to a random-looking value. Suppose you are given 64-bit integers (a long in Java). You might want to “hash” these integers to other 64-bit values.

There are many good ways to achieve this result, but let me add some constraints:

  1. The hashing should be strongly universal, also called pairwise independent. This means that if h is a hash function picked at random, then you want the knowledge that h(x)=y for some x and some y, to give you absolutely no information about the value of h(x') for x' distinct from x. That’s not, as it appears, a security/cryptography issue, but more of a useful constraint for probabilistic algorithms. Indeed, there are many “probabilistic” algorithms that require different, and “independent”, hash functions. You want to be absolutely sure that your hash functions are unrelated. Strong universality is not perfect independence, but it is pretty good in practice.
  2. You don’t want to have large look-up tables occupying your cache.
  3. You want to code that works efficiently in most programming languages (including, say, Java).

My proposal is as follows. Instead of trying to hash the 64-bit values to other 64-bit values directly, we can hash them to 32-bit values. If you repeat twice (using two different hash functions), you get the 64-bit result you seek. All that is needed per hash function are three 64-bit values chosen at random, and then two multiplications, two additions and a single shift. The two multiplications are faster than they appear because they can be executed simultaneously as there is no data dependency. Two get a full 64-bit output, you thus need four multiplications.

// in Java, long is 64-bit, int 32-bit

long a,b,c; // randomly assigned 64-bit values

int hash32(long x) {
  int low = (int)x;
  int high = (int)(x >>> 32);
  return (int)((a * low + b * high + c) >>> 32);
}

How fast is it? We can try to compare it against a standard bit-mixing function:

long murmur64(long h) {
  h ^= h >>> 33;
  h *= 0xff51afd7ed558ccdL;
  h ^= h >>> 33;
  h *= 0xc4ceb9fe1a85ec53L;
  h ^= h >>> 33;
  return h;
}

This bit-mixing function is “obviously” faster. It has half the number of multiplications, and none of the additions. However, in my tests, the difference is less than you might expect (only about 50%). Moreover if you do need two 32-bit hash values, the 64-bit mixing function loses much of its edge and is only about 25% faster.

My code is available so you can run your own Java benchmark.

How do I know that this hashing function is, indeed, strongly universal? We wrote a paper about it: Strongly universal string hashing is fast. At the time, we were interested in hashing strings. The trick is to view a 64-bit word as a string of two 32-bit words. We could extend the same trick to 128-bit inputs or, indeed, inputs of any length.

The dangers of AVX-512 throttling: myth or reality?

Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once. They are multicore, which means that each CPU is made of several baby processors that are partially independent. And they are vectorized, which means that they have instructions that can operate over wide registers (spanning 128, 256 or even 512 bits).

Regarding vectorization, Intel is currently ahead of the curve with its AVX-512 instruction sets. They have the only commodity-level processors able to work over 512-bit registers. AMD is barely doing 256 bits and your phone is limited to 128 bits.

The more you use your CPU, the more heat it produces. Intel does not want your CPU to burn out. So it throttles the CPU (makes it run slower). Your CPU stays warm but not too hot. When the processor does not have to use AVX-512 instructions, some of the silicon remains dark, and thus less heat is generated.

The cores that are executing SIMD instructions may run at a lower frequency. Thankfully, Intel has per-core voltage and frequencies. And Intel processors can switch frequency very fast, certainly in a millisecond.

Vlad Krasnov from Cloudfare wrote a blog post last year warning us against AVX-512 throttling:

If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

I am sure that it is the case that AVX-512 can cause problems for some use cases. It is also the case that some people die if you give them aspirin; yet we don’t retire aspirin.

Should we really disable AVX-512 as a precautionary stance?

In an earlier blog post, I tried to measure this throttling on a server I own but found no effect whatsoever. It could be that this server does not have AVX-512 throttling or maybe I did not make use of sufficiently expensive instructions.

Vlad offered me test case in C. His test case involves AVX-512 multiplications, while much of the running time is spent on some bubble-sort routine. It can run in both AVX-512 mode and in the regular (non-AVX-512) mode. To be clear, it is not meant to be a demonstration in favour of AVX-512: it is meant to show that AVX-512 can be detrimental.

I did not want to run my tests using my own server this time. So I went to Packet and launched a powerful two-CPU Xeon Gold server (Intel Xeon Gold 5120 CPU @ 2.20GHz). Each of these processors have 14 cores, so we have 28 cores in total. Because of hyperthreading, it supports up to 56 physical threads (running two threads per core).

threadsAVX-512 disabledwith AVX-512
208.4 s7.2 s
406 s5.0 s
805.7 s4.7 s

In an earlier test, I was using an older compiler (GCC 5) and when exceeding the number of physically supported threads (56), I was getting poorer results, especially with AVX-512. I am not sure why that would be but I suspect it is not related to throttling; it might have to do with context switching and register initialization (though that is speculation on my part).

In my latest test, I use a more recent compiler (GCC 7). As you can see, the AVX-512 version is always faster. Otherwise, I see no negative effect from the application of AVX-512. If there is throttling, it appears that the benefits of AVX-512 offset it.

My code is available along with all the scripts and the outputs for your inspection. You should be able to reproduce my results. It is not like Xeon Gold processors are magical faeries: anyone can grab an instance. For the record, the bill I got from Packet was $2.

Note: I have no conflict of interest to disclose. I do not own Intel stock.

Instructions: On Packet, hit “deploy servers”. Choose Sunnyvale CA and choose m2.xlarge.x86. I went with the latest Ubuntu (18.04). You can then access it through ssh. It is a pain that GCC is not already installed once the server is deployed, so that sudo apt-get install gcc fails. I fixed the problem by prepending “us.” to hyperlinks in /etc/apt/sources.list and running sudo apt-get update.

Science and Technology links (August 10th, 2018)

  1. There are far fewer forest fires now than there was 15 years ago.
  2. The Earth is getting greener. There are more forests:

    We show that—contrary to the prevailing view that forest area has declined globally—tree cover has increased by 2.24 million km2 (+7.1% relative to the 1982 level).Global bare ground cover has decreased by 1.16 million km2 (−3.1%), most notably in agricultural regions in Asia.

  3. You can buy a fast 256GB memory card for barely more than $100. This is more than enough to record everything you hear during two years. It is enough to record everything you see for a week.
  4. The widespread usage of anti-cold medication helps to spread infections. If you want to be generous toward others, you should tough out your cold symptoms.
  5. I am on an all-bacon diet: Sugars (carbohydrates) increase human mortality, whereas fat reduces it.
  6. Workplace wellness programs are supposed to promote health among employees and to eventually improve productivity. In practice, they are useless as far as improving health. Yet they can help employers find out which employees are health-conscious. Thus, as an employee, you probably want to participate to signal your good health.
  7. Fewer than 15% of psychology researchers are willing to share their data with a data curation site (with or without restriction). It seems that many of these researchers are unconcerned by the fact that these records will surely go missing once they retire or die.
  8. Governments typically do not publish the results of the clinical trials they hold. A landmark decision forces the Canadian government to publish a specific set of results. There is hope that more could come in the future.
  9. Belly fat is associated with reduced cognitive function in older adults.
  10. If you are a woman, you have better odds if your doctor is a woman.
  11. High blood sugar decreases your odds if you have cancer.
  12. Why do we have 23 pairs of chromosomes and not 24 or 12, or just one chromosome? We do not know. Two teams of scientists changed the genome of yeast so it has just one chromosome, and it is almost fine.
  13. Ford Motors is trying out exoskeletons with its line employees.
  14. OpenAI defeated a team of great Dota 2 players. Dota 2 is one of the top competitive e-games around. Defeating a team of human players is a high bar.
  15. High intelligence is associated with low social conservatism and high economic conservatism. That is, smart people don’t care that their kids’ teacher is a trans-gender lesbian who smokes pot, but they favour capitalism and small governments. Oddly enough, it is hard to find political parties supporting such a mix of values.
  16. New motor and other skills can be acquired at any age.
  17. According to an article in Nature, climate change policies might increase food insecurity, more so than climate change itself.
  18. Genetically modified silkworms produce spider silk. Assuming you want lots of spider silk, that’s probably cheaper than raising spiders. I know at least one superhero who will be happy about that.
  19. Governments have few technology workers and are typically behind the time. This has apparently nothing to do with low salaries and is all down to bad management:

    After exploiting within-person variation—that is, comparing earnings for a given individual before versus after they make a switch into or out of the public sector—I find that public sector workers earn a 3.9% annual earnings premium. Moreover, they report a greater incidence and quality of non-wage benefits, which implies that their overall compensation premium is even larger. Second, public sector workers tend to report much lower levels of work-place practices, including fewer career opportunities, less intellectually stimulating work, and less responsibility and scope in their work.

    To put it bluntly, government jobs pay well, but they are boring. Young talented hackers would rather go do something fun.

  20. A robot can find Waldo. I bet that this robot was not built by a government employee.

Science and Technology links (August 4th, 2018)

  1. About 80% of the ocean remains unmapped and unexplored.
  2. Even a mild concussion (a fall on your head) can double your risk of dementia (e.g., Alzheimer’s).
  3. Apple is selling approximatively two smart watches for each laptop. And they are selling about three iPads for each laptops. (I do not own an Apple watch: I don’t know what I’d use it for.)
  4. Researchers from OpenAI have trained a human-like robot hand to manipulate objects like we would.
  5. Diabetes is not good for your brain:

    Compared with participants with normoglycaemia, the multivariable-adjusted rate of global cognitive decline associated with prediabetes and diabetes was increased (…) Similarly, memory, executive function and orientation z scores showed an increased rate of cognitive decline with diabetes.

  6. We have all been told to eat whole grain cereals. Do they keep the doctor away? Maybe not:

    There is insufficient evidence from randomised controlled trials to date to recommend consumption of whole grain diets to reduce the risk of cardiovascular disease, or lower blood cholesterol, or blood pressure.

  7. We are using much less land than we used to, which means that there is more land for forest and wild animals. People in 2014 used about a third of the land, compared with people from 1961, on a per capita basis.
  8. In the USA, energy use per capita has fallen by 20% since a peak in the late 1970s. In fact, in most occidental countries, energy use per capita has been falling for the last decade. Distance driven by cars, per capita, are also either plateauing or on the decline.

Getting 4 bytes or a full cache line: same speed or not?

Many software operations are believed to be “memory bound”, meaning that the processor spins empty while waiting for data to come from memory. To compensate, our processors rely on fast cache memory. If your problem fits in cache, it will typically run much faster than if the processor constantly needs to query the memory subsystem.

If you need to retrieve a block of data, the processor does not retrieve just the necessary bytes. It retrieves data in units of a “cache line” which is typically 64 bytes on Intel processors. So the first 64 bytes of memory addresses are on the first cache line, the next 64 bytes are on the second cache line, and so forth.

To be clear, it means that you should not expect it to be faster to retrieve only 8 bytes of data rather than 64 bytes of data, assuming that all the data in question lies in a single cache line. That is what I expect most programmers to tell you.

Let me state this as a “law”:

Given a large out-of-cache memory block, the time (latency) required to access 4, 8, 16 or 64 bytes of data from a cache line is the same.

How well does this mental model works in the real world?

Suppose that you build a large array of 32-bit integers. There are sixteen 32-bit integers per cache line. The array is large enough to exceed your cache by a wide margin. Then you repeatedly select a cache line at random and sum up a few 32-bit integers in the cache. Maybe you just pick one integer per cache line, maybe you pick four, maybe eight, maybe sixteen. Your code might look like the follow, where percache is the number of integer you access per cache line:

uint32_t counter = 0;
for(size_t i = 0; i < howmany; i++) {
    size_t idx = pick_address_of_random_cache_line();
    // there are 16 uint32_t per 64-byte cache line... 
    for(size_t j = 0; j < percache; j++) {
       counter += array[16 * idx + j];
    }
}

So, how sensitive is the running to the parameter percache? A simple theory is that if the array is large enough, it does not matter whether you access one, two, four or sixteen values per cache line.

Let us run an experiment to see. I generate an 8 GB array, it far exceeds my CPU cache. Then I vary the number of integers accessed per cache line, and I sum it all up.

Integer per cache lineCycles per cache line
138
460
870
16110

So even though I am randomly accessing cache lines, out of cache, whether I access one, eight, or sixteen 32-bit values per cache line matters a great deal.

To make things interesting, I can try to accelerate these operations by using SIMD instructions. SIMD instructions can do many additions at once. Let us see…

Integer per cache lineCycles per cache line
835
1641

Wow! So hand-tuned vectorized helped this problem a lot, even though I am randomly accessing cache lines from a large array.

Let me twist the problem around. After accessing each cache line, and updating my counter by summing up one, two, four, eight or sixteen value, I am going to change the state of my random number generator in a way that depends on the newly computed counter. This makes the problem harder for my computer because, before it can fetch the next cache line, it absolutely must wait for the computation of the counter to be finished.

Integer per cache lineCycles per cache line
1300
4310
8320
16330

As we can see, the problem is about an order of magnitude harder. However, even then, whether I access one or sixteen values per cache line matters: the difference is about 10%. It is not negligible.

Thus it might be a lot harder than you expect to be “memory bound”. Counting the number of cache lines accessed is not a sane way to measure the efficiency of an out-of-cache algorithm.

My code is available.

Science and Technology links (July 27th, 2018)

  1. It is frequently argued that intelligence requires great complexity. But complex compared to what? To put things in perspective, my friend Leonid points out that tiny worms have simple brains and yet they can compute the solution to several problems:

    C elegans [a common worm] has less than 500 neural cells, but it has basic sensory system and muscle control! It can reproduce and mate.

  2. Almost everywhere in the world, people spend more time and money in schools. Sadly this does not obviously translate into smarter people.

    (…) if we want to increase education, it’s not enough to send kids to school. It’s important that attending school also translates into acquiring specific skills.

  3. Stronger men might be less fertile.
  4. In 2016, the fertility rate in the United States was the lowest it has ever been. It was 1,765 births per 1,000 population, below the replacement rate of 2,100.
  5. Japanese smoke about 50% more cigarettes than Americans. Yet Americans have almost twice as many lung cancers. Approximately half of the supercentenarians (110 years+) in Okinawa have a history of smoking.
  6. The death rate from Alzheimer’s is more than 5x lower in Japan compared with the United States.
  7. Cannabis has no apparent carcinogenic effect. Yet a common joint has 4 times more cancer causing material than a single cigarette.
  8. A drug that enhances garbage disposal in the brain made old mice smarter.
  9. There is a large underground lake on Mars.
  10. Some worms can eat and digest plastic; presumably, this includes plastic straws.

Writing about software and building software are different jobs

Code taken from a blog post is meant to illustrate an idea. Blogging is literature, not engineering. Don’t build production systems by copying and pasting random code form the Internet. It will not end well.

A reader commented that some of my C code from an earlier blog post is not proper C:

uint32_t random_bounded(uint32_t range) {
  uint64_t random32bit =  random32(); //32-bit random number 
  multiresult = random32bit * range;
  return multiresult >> 32;
}

This is true: the type of the “multiresult” variable to not specified. In that post, I deliberately omitted to specify the type because anyone familiar with C can correctly infer the type. In fact, a large fraction of the code you will find on my blog post is similarly simplified. I often remove special cases, I roll my loops and so forth. That’s because I hope to make it more readable. Real code is often filled with boring details.

It is not that I don’t believe in providing working code. I actually think it is super important and I do it systematically. For example, the post points to the source code in correct C. This blog post is accompanied by a research paper. The research paper points to a GitHub repository that provides reproducible experiments. This code was even peer-reviewed along with the research paper.

However, when someone explains a concept and provides a code sample, I take that as literature, not engineering. You are meant to read the code, not use it as a ready-made reusable artefact.

Simplification is not the same as carelessness. Simplified is not the same as wrong. You should report typographical errors and the like, but if you take the code that served to illustrate an idea, copy it mindlessly in your application code, you have missed a step. You are meant to read the code, not copy it into your production code.

It is more complicated than I thought: -mtune, -march in GCC

My favourite C compilers are GNU GCC and LLVM’s Clang. In C, you compile for some architecture. Thus you have to tell the compiler what kind of machine you have.

In theory, you could recompile all the code for the exact machine you have, but it is slow and error prone. So we rely on prebuilt binaries, typically, and these are often not tuned specifically for our hardware. It is possible for a program to detect the hardware it is running on and automatically adapt (e.g., via runtime dispatch) but that is not often done in C.

So GCC and Clang have flags that allow you to tell them what kind of hardware you have. They are “-march” and “-mtune”. I thought I understood them, until now.

Let me run through the basics of “-march” and “-mtune” that most experienced programmers will know about.

  • The first flag (“-march”) tells the compiler about the minimal hardware your code should run on. That is, if you write “-march=haswell”, then your code should run on machines that have haswell-type processors and anything better or compatible (anything that has the same instruction sets). They may not run on other machines. The GCC documentation is clear:

    -march=cpu-type allows GCC to generate code that may not run at all on processors other than the one indicated.

  • The other flag (“-mtune”) is just an optimization hint, e.g., if you write “-mtune=haswell”, you tell the compile to generate code that runs best on “haswell”-type processors. The GCC documentation is clear enough:

    While picking a specific cpu-type schedules things appropriately for that particular chip, the compiler does not generate any code that cannot run on the default machine type unless you use a -march=cpu-type option. For example, if GCC is configured for i686-pc-linux-gnu then -mtune=pentium4 generates code that is tuned for Pentium 4 but still runs on i686 machines.

    By default, when unspecified, “-mtune=generic” applies which means that the compiler will “produce code optimized for the most common processors”. This is somewhat ambiguous and will strictly depend on the compiler version you are using, as new processors being released might change this tuning.

Thankfully, your compiler can automatically detect your processor, it calls this automatically detected processor “native”. So I have been compiling my code with “-march=native” because I want the compiler to do the best it can do on the machine I am using. I assumed, until now, that if my processor is detected as having architecture X, doing “-march=native” implied “-march=X -mtune=X”. And that could almost be inferred from the documentation:

Specifying -march=cpu-type implies -mtune=cpu-type.

This has lead me to believe that “-march” trumps “-mtune” meaning that if you set “-march=native”, then the “-mtune” is effectively irrelevant.

I was wrong.

Let us check using funny command lines. I use a skylake processor with GNU GCC 5.5. It is important to note that this compiler predates skylake processors.

  1. I can type gcc -march=native -Q --help=target | grep -- '-march=' | cut -f3 to check which processor is automatically detected. On my favourite machine, I get “broadwell”. That is slightly wrong, but close enough given that the compiler does not know about skylake processors.
  2. One reading of the documentation is that “-march=native” implies “-mtune=native”, so let us check. I type gcc -march=native -Q --help=target | grep -- '-mtune=' | cut -f3 and I get “generic”. Ah! The compiler has detected “broadwell” but it is not tuning for “broadwell” or for “native”, rather it is tuning for “generic”.
  3. What if instead of “-march=native”, I type “-march=broadwell”. Surely it should make no difference? I type gcc -march=broadwell -Q --help=target | grep -- '-mtune=' | cut -f3 and I get “broadwell”. So even if you have a broadwell processor that gets recognized as such, the flags “-march=native” and “-march=broadwell” differ in the sense that they impact differently the tuning.

Let me repeat this: if you have a skylake processor that gets recognized as a broadwell processor, then “-march=broadwell” and “-march=native” are different flags having a different effect on your code.

What you care about is whether it produces different binaries. Does it? Unfortunately yes, it does. See my code sample.

Does it matter in practice, as far as performance goes? Probably not in actual systems, but if you are doing microbenchmarking, studying a specific function, small differences might matter.

I will keep using “-march=native” as it is the expedient approach, but I would really like to know how to best tune specifically for my hardware without having to do messy command-line Kung Fu.

Credit: The example and the key observation are due to Travis Downs.

Are vectorized random number generators actually useful?

Our processors benefit from “SIMD” instructions. These instructions can operate on several values at once, thus greatly accelerating some algorithms. Earlier, I reported that you can multiply the speed of common (fast) random number generators such as PCG and xorshift128+ by a factor of three or four by vectorizing them using SIMD instructions.

A reader challenged me: is this actually useful in practice?

There are problems where the generation of random number is critical to the performance. That is the case in many simulations, for example. The simplest and best known one is probably the random shuffling of arrays. The standard algorithm is quite simple:

for (i = size; i > 1; i--) {
  var j = random_number(i);
  switch_values(array, i, j);
}

If you are interested, O’Neill wrote a whole blog post of this specific problem.

So can I accelerate the shuffling of an array using SIMD instructions?

So I threw together a vectorized shuffle and a regular (scalar) shuffle, both of them using O’Neill’s PCG32. The net result? I almost double the speed using SIMD instructions when the array is in cache:

SIMD shuffle3.5 cycles per entry
scalar shuffle6.6 cycles per entry

My code is available. I do not do anything sophisticated so I expect it is possible to do a lot better. My sole goal was to demonstrate that SIMD random number generators are practical.