The dangers of AVX-512 throttling: myth or reality?

Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once. They are multicore, which means that each CPU is made of several baby processors that are partially independent. And they are vectorized, which means that they have instructions that can operate over wide registers (spanning 128, 256 or even 512 bits).

Regarding vectorization, Intel is currently ahead of the curve with its AVX-512 instruction sets. They have the only commodity-level processors able to work over 512-bit registers. AMD is barely doing 256 bits and your phone is limited to 128 bits.

The more you use your CPU, the more heat it produces. Intel does not want your CPU to burn out. So it throttles the CPU (makes it run slower). Your CPU stays warm but not too hot. When the processor does not have to use AVX-512 instructions, some of the silicon remains dark, and thus less heat is generated.

The cores that are executing SIMD instructions may run at a lower frequency. Thankfully, Intel has per-core voltage and frequencies. And Intel processors can switch frequency very fast, certainly in a millisecond.

Vlad Krasnov from Cloudfare wrote a blog post last year warning us against AVX-512 throttling:

If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

I am sure that it is the case that AVX-512 can cause problems for some use cases. It is also the case that some people die if you give them aspirin; yet we don’t retire aspirin.

Should we really disable AVX-512 as a precautionary stance?

In an earlier blog post, I tried to measure this throttling on a server I own but found no effect whatsoever. It could be that this server does not have AVX-512 throttling or maybe I did not make use of sufficiently expensive instructions.

Vlad offered me test case in C. His test case involves AVX-512 multiplications, while much of the running time is spent on some bubble-sort routine. It can run in both AVX-512 mode and in the regular (non-AVX-512) mode. To be clear, it is not meant to be a demonstration in favour of AVX-512: it is meant to show that AVX-512 can be detrimental.

I did not want to run my tests using my own server this time. So I went to Packet and launched a powerful two-CPU Xeon Gold server (Intel Xeon Gold 5120 CPU @ 2.20GHz). Each of these processors have 14 cores, so we have 28 cores in total. Because of hyperthreading, it supports up to 56 physical threads (running two threads per core).

threadsAVX-512 disabledwith AVX-512
208.4 s7.2 s
406 s5.0 s
805.7 s4.7 s

In an earlier test, I was using an older compiler (GCC 5) and when exceeding the number of physically supported threads (56), I was getting poorer results, especially with AVX-512. I am not sure why that would be but I suspect it is not related to throttling; it might have to do with context switching and register initialization (though that is speculation on my part).

In my latest test, I use a more recent compiler (GCC 7). As you can see, the AVX-512 version is always faster. Otherwise, I see no negative effect from the application of AVX-512. If there is throttling, it appears that the benefits of AVX-512 offset it.

My code is available along with all the scripts and the outputs for your inspection. You should be able to reproduce my results. It is not like Xeon Gold processors are magical faeries: anyone can grab an instance. For the record, the bill I got from Packet was $2.

Note: I have no conflict of interest to disclose. I do not own Intel stock.

Science and Technology links (August 10th, 2018)

  1. There are far fewer forest fires now than there was 15 years ago.
  2. The Earth is getting greener. There are more forests:

    We show that—contrary to the prevailing view that forest area has declined globally—tree cover has increased by 2.24 million km2 (+7.1% relative to the 1982 level).Global bare ground cover has decreased by 1.16 million km2 (−3.1%), most notably in agricultural regions in Asia.

  3. You can buy a fast 256GB memory card for barely more than $100. This is more than enough to record everything you hear during two years. It is enough to record everything you see for a week.
  4. The widespread usage of anti-cold medication helps to spread infections. If you want to be generous toward others, you should tough out your cold symptoms.
  5. I am on an all-bacon diet: Sugars (carbohydrates) increase human mortality, whereas fat reduces it.
  6. Workplace wellness programs are supposed to promote health among employees and to eventually improve productivity. In practice, they are useless as far as improving health. Yet they can help employers find out which employees are health-conscious. Thus, as an employee, you probably want to participate to signal your good health.
  7. Fewer than 15% of psychology researchers are willing to share their data with a data curation site (with or without restriction). It seems that many of these researchers are unconcerned by the fact that these records will surely go missing once they retire or die.
  8. Governments typically do not publish the results of the clinical trials they hold. A landmark decision forces the Canadian government to publish a specific set of results. There is hope that more could come in the future.
  9. Belly fat is associated with reduced cognitive function in older adults.
  10. If you are a woman, you have better odds if your doctor is a woman.
  11. High blood sugar decreases your odds if you have cancer.
  12. Why do we have 23 pairs of chromosomes and not 24 or 12, or just one chromosome? We do not know. Two teams of scientists changed the genome of yeast so it has just one chromosome, and it is almost fine.
  13. Ford Motors is trying out exoskeletons with its line employees.
  14. OpenAI defeated a team of great Dota 2 players. Dota 2 is one of the top competitive e-games around. Defeating a team of human players is a high bar.
  15. High intelligence is associated with low social conservatism and high economic conservatism. That is, smart people don’t care that their kids’ teacher is a trans-gender lesbian who smokes pot, but they favour capitalism and small governments. Oddly enough, it is hard to find political parties supporting such a mix of values.
  16. New motor and other skills can be acquired at any age.
  17. According to an article in Nature, climate change policies might increase food insecurity, more so than climate change itself.
  18. Genetically modified silkworms produce spider silk. Assuming you want lots of spider silk, that’s probably cheaper than raising spiders. I know at least one superhero who will be happy about that.
  19. Governments have few technology workers and are typically behind the time. This has apparently nothing to do with low salaries and is all down to bad management:

    After exploiting within-person variation—that is, comparing earnings for a given individual before versus after they make a switch into or out of the public sector—I find that public sector workers earn a 3.9% annual earnings premium. Moreover, they report a greater incidence and quality of non-wage benefits, which implies that their overall compensation premium is even larger. Second, public sector workers tend to report much lower levels of work-place practices, including fewer career opportunities, less intellectually stimulating work, and less responsibility and scope in their work.

    To put it bluntly, government jobs pay well, but they are boring. Young talented hackers would rather go do something fun.

  20. A robot can find Waldo. I bet that this robot was not built by a government employee.

Science and Technology links (August 4th, 2018)

  1. About 80% of the ocean remains unmapped and unexplored.
  2. Even a mild concussion (a fall on your head) can double your risk of dementia (e.g., Alzheimer’s).
  3. Apple is selling approximatively two smart watches for each laptop. And they are selling about three iPads for each laptops. (I do not own an Apple watch: I don’t know what I’d use it for.)
  4. Researchers from OpenAI have trained a human-like robot hand to manipulate objects like we would.
  5. Diabetes is not good for your brain:

    Compared with participants with normoglycaemia, the multivariable-adjusted rate of global cognitive decline associated with prediabetes and diabetes was increased (…) Similarly, memory, executive function and orientation z scores showed an increased rate of cognitive decline with diabetes.

  6. We have all been told to eat whole grain cereals. Do they keep the doctor away? Maybe not:

    There is insufficient evidence from randomised controlled trials to date to recommend consumption of whole grain diets to reduce the risk of cardiovascular disease, or lower blood cholesterol, or blood pressure.

  7. We are using much less land than we used to, which means that there is more land for forest and wild animals. People in 2014 used about a third of the land, compared with people from 1961, on a per capita basis.
  8. In the USA, energy use per capita has fallen by 20% since a peak in the late 1970s. In fact, in most occidental countries, energy use per capita has been falling for the last decade. Distance driven by cars, per capita, are also either plateauing or on the decline.

Getting 4 bytes or a full cache line: same speed or not?

Many software operations are believed to be “memory bound”, meaning that the processor spins empty while waiting for data to come from memory. To compensate, our processors rely on fast cache memory. If your problem fits in cache, it will typically run much faster than if the processor constantly needs to query the memory subsystem.

If you need to retrieve a block of data, the processor does not retrieve just the necessary bytes. It retrieves data in units of a “cache line” which is typically 64 bytes on Intel processors. So the first 64 bytes of memory addresses are on the first cache line, the next 64 bytes are on the second cache line, and so forth.

To be clear, it means that you should not expect it to be faster to retrieve only 8 bytes of data rather than 64 bytes of data, assuming that all the data in question lies in a single cache line. That is what I expect most programmers to tell you.

Let me state this as a “law”:

Given a large out-of-cache memory block, the time (latency) required to access 4, 8, 16 or 64 bytes of data from a cache line is the same.

How well does this mental model works in the real world?

Suppose that you build a large array of 32-bit integers. There are sixteen 32-bit integers per cache line. The array is large enough to exceed your cache by a wide margin. Then you repeatedly select a cache line at random and sum up a few 32-bit integers in the cache. Maybe you just pick one integer per cache line, maybe you pick four, maybe eight, maybe sixteen. Your code might look like the follow, where percache is the number of integer you access per cache line:

uint32_t counter = 0;
for(size_t i = 0; i < howmany; i++) {
    size_t idx = pick_address_of_random_cache_line();
    // there are 16 uint32_t per 64-byte cache line... 
    for(size_t j = 0; j < percache; j++) {
       counter += array[16 * idx + j];
    }
}

So, how sensitive is the running to the parameter percache? A simple theory is that if the array is large enough, it does not matter whether you access one, two, four or sixteen values per cache line.

Let us run an experiment to see. I generate an 8 GB array, it far exceeds my CPU cache. Then I vary the number of integers accessed per cache line, and I sum it all up.

Integer per cache lineCycles per cache line
138
460
870
16110

So even though I am randomly accessing cache lines, out of cache, whether I access one, eight, or sixteen 32-bit values per cache line matters a great deal.

To make things interesting, I can try to accelerate these operations by using SIMD instructions. SIMD instructions can do many additions at once. Let us see…

Integer per cache lineCycles per cache line
835
1641

Wow! So hand-tuned vectorized helped this problem a lot, even though I am randomly accessing cache lines from a large array.

Let me twist the problem around. After accessing each cache line, and updating my counter by summing up one, two, four, eight or sixteen value, I am going to change the state of my random number generator in a way that depends on the newly computed counter. This makes the problem harder for my computer because, before it can fetch the next cache line, it absolutely must wait for the computation of the counter to be finished.

Integer per cache lineCycles per cache line
1300
4310
8320
16330

As we can see, the problem is about an order of magnitude harder. However, even then, whether I access one or sixteen values per cache line matters: the difference is about 10%. It is not negligible.

Thus it might be a lot harder than you expect to be “memory bound”. Counting the number of cache lines accessed is not a sane way to measure the efficiency of an out-of-cache algorithm.

My code is available.

Science and Technology links (July 27th, 2018)

  1. It is frequently argued that intelligence requires great complexity. But complex compared to what? To put things in perspective, my friend Leonid points out that tiny worms have simple brains and yet they can compute the solution to several problems:

    C elegans [a common worm] has less than 500 neural cells, but it has basic sensory system and muscle control! It can reproduce and mate.

  2. Almost everywhere in the world, people spend more time and money in schools. Sadly this does not obviously translate into smarter people.

    (…) if we want to increase education, it’s not enough to send kids to school. It’s important that attending school also translates into acquiring specific skills.

  3. Stronger men might be less fertile.
  4. In 2016, the fertility rate in the United States was the lowest it has ever been. It was 1,765 births per 1,000 population, below the replacement rate of 2,100.
  5. Japanese smoke about 50% more cigarettes than Americans. Yet Americans have almost twice as many lung cancers. Approximately half of the supercentenarians (110 years+) in Okinawa have a history of smoking.
  6. The death rate from Alzheimer’s is more than 5x lower in Japan compared with the United States.
  7. Cannabis has no apparent carcinogenic effect. Yet a common joint has 4 times more cancer causing material than a single cigarette.
  8. A drug that enhances garbage disposal in the brain made old mice smarter.
  9. There is a large underground lake on Mars.
  10. Some worms can eat and digest plastic; presumably, this includes plastic straws.

Writing about software and building software are different jobs

Code taken from a blog post is meant to illustrate an idea. Blogging is literature, not engineering. Don’t build production systems by copying and pasting random code form the Internet. It will not end well.

A reader commented that some of my C code from an earlier blog post is not proper C:

uint32_t random_bounded(uint32_t range) {
  uint64_t random32bit =  random32(); //32-bit random number 
  multiresult = random32bit * range;
  return multiresult >> 32;
}

This is true: the type of the “multiresult” variable to not specified. In that post, I deliberately omitted to specify the type because anyone familiar with C can correctly infer the type. In fact, a large fraction of the code you will find on my blog post is similarly simplified. I often remove special cases, I roll my loops and so forth. That’s because I hope to make it more readable. Real code is often filled with boring details.

It is not that I don’t believe in providing working code. I actually think it is super important and I do it systematically. For example, the post points to the source code in correct C. This blog post is accompanied by a research paper. The research paper points to a GitHub repository that provides reproducible experiments. This code was even peer-reviewed along with the research paper.

However, when someone explains a concept and provides a code sample, I take that as literature, not engineering. You are meant to read the code, not use it as a ready-made reusable artefact.

Simplification is not the same as carelessness. Simplified is not the same as wrong. You should report typographical errors and the like, but if you take the code that served to illustrate an idea, copy it mindlessly in your application code, you have missed a step. You are meant to read the code, not copy it into your production code.

It is more complicated than I thought: -mtune, -march in GCC

My favourite C compilers are GNU GCC and LLVM’s Clang. In C, you compile for some architecture. Thus you have to tell the compiler what kind of machine you have.

In theory, you could recompile all the code for the exact machine you have, but it is slow and error prone. So we rely on prebuilt binaries, typically, and these are often not tuned specifically for our hardware. It is possible for a program to detect the hardware it is running on and automatically adapt (e.g., via runtime dispatch) but that is not often done in C.

So GCC and Clang have flags that allow you to tell them what kind of hardware you have. They are “-march” and “-mtune”. I thought I understood them, until now.

Let me run through the basics of “-march” and “-mtune” that most experienced programmers will know about.

  • The first flag (“-march”) tells the compiler about the minimal hardware your code should run on. That is, if you write “-march=haswell”, then your code should run on machines that have haswell-type processors and anything better or compatible (anything that has the same instruction sets). They may not run on other machines. The GCC documentation is clear:

    -march=cpu-type allows GCC to generate code that may not run at all on processors other than the one indicated.

  • The other flag (“-mtune”) is just an optimization hint, e.g., if you write “-mtune=haswell”, you tell the compile to generate code that runs best on “haswell”-type processors. The GCC documentation is clear enough:

    While picking a specific cpu-type schedules things appropriately for that particular chip, the compiler does not generate any code that cannot run on the default machine type unless you use a -march=cpu-type option. For example, if GCC is configured for i686-pc-linux-gnu then -mtune=pentium4 generates code that is tuned for Pentium 4 but still runs on i686 machines.

    By default, when unspecified, “-mtune=generic” applies which means that the compiler will “produce code optimized for the most common processors”. This is somewhat ambiguous and will strictly depend on the compiler version you are using, as new processors being released might change this tuning.

Thankfully, your compiler can automatically detect your processor, it calls this automatically detected processor “native”. So I have been compiling my code with “-march=native” because I want the compiler to do the best it can do on the machine I am using. I assumed, until now, that if my processor is detected as having architecture X, doing “-march=native” implied “-march=X -mtune=X”. And that could almost be inferred from the documentation:

Specifying -march=cpu-type implies -mtune=cpu-type.

This has lead me to believe that “-march” trumps “-mtune” meaning that if you set “-march=native”, then the “-mtune” is effectively irrelevant.

I was wrong.

Let us check using funny command lines. I use a skylake processor with GNU GCC 5.5. It is important to note that this compiler predates skylake processors.

  1. I can type gcc -march=native -Q --help=target | grep -- '-march=' | cut -f3 to check which processor is automatically detected. On my favourite machine, I get “broadwell”. That is slightly wrong, but close enough given that the compiler does not know about skylake processors.
  2. One reading of the documentation is that “-march=native” implies “-mtune=native”, so let us check. I type gcc -march=native -Q --help=target | grep -- '-mtune=' | cut -f3 and I get “generic”. Ah! The compiler has detected “broadwell” but it is not tuning for “broadwell” or for “native”, rather it is tuning for “generic”.
  3. What if instead of “-march=native”, I type “-march=broadwell”. Surely it should make no difference? I type gcc -march=broadwell -Q --help=target | grep -- '-mtune=' | cut -f3 and I get “broadwell”. So even if you have a broadwell processor that gets recognized as such, the flags “-march=native” and “-march=broadwell” differ in the sense that they impact differently the tuning.

Let me repeat this: if you have a skylake processor that gets recognized as a broadwell processor, then “-march=broadwell” and “-march=native” are different flags having a different effect on your code.

What you care about is whether it produces different binaries. Does it? Unfortunately yes, it does. See my code sample.

Does it matter in practice, as far as performance goes? Probably not in actual systems, but if you are doing microbenchmarking, studying a specific function, small differences might matter.

I will keep using “-march=native” as it is the expedient approach, but I would really like to know how to best tune specifically for my hardware without having to do messy command-line Kung Fu.

Credit: The example and the key observation are due to Travis Downs.

Are vectorized random number generators actually useful?

Our processors benefit from “SIMD” instructions. These instructions can operate on several values at once, thus greatly accelerating some algorithms. Earlier, I reported that you can multiply the speed of common (fast) random number generators such as PCG and xorshift128+ by a factor of three or four by vectorizing them using SIMD instructions.

A reader challenged me: is this actually useful in practice?

There are problems where the generation of random number is critical to the performance. That is the case in many simulations, for example. The simplest and best known one is probably the random shuffling of arrays. The standard algorithm is quite simple:

for (i = size; i > 1; i--) {
  var j = random_number(i);
  switch_values(array, i, j);
}

If you are interested, O’Neill wrote a whole blog post of this specific problem.

So can I accelerate the shuffling of an array using SIMD instructions?

So I threw together a vectorized shuffle and a regular (scalar) shuffle, both of them using O’Neill’s PCG32. The net result? I almost double the speed using SIMD instructions when the array is in cache:

SIMD shuffle3.5 cycles per entry
scalar shuffle6.6 cycles per entry

My code is available. I do not do anything sophisticated so I expect it is possible to do a lot better. My sole goal was to demonstrate that SIMD random number generators are practical.

Science and Technology links (July 21st, 2018)

  1. It seems that something called “Cartesian Genetic Programming” could be a match for Deep Learning.
  2. If you look at the best paid individuals in most fields, most of them are men. Why is that? There are many possible explanations. One of them is that men take more risks. If you hope to ever get larger rewards, it helps to take large risks.

    It is undeniable, I think, that boys take more physical risks than girls. But do men really take more risks than women in their careers?

    Biologists tend to tell us that females are more risk adverse. The usual story is that a female’s ability to reproduce is mostly limited by the availability of food and shelter, whereas a male’s ability to reproduce has far more to do with its ability to impress females, or to impose its will. We can test it out by entering animals in games. For example, Orsini et al. investigated the issue in rats:

    Consistent with findings in human subjects, females were more risk averse, choosing the large, risky reward significantly less than males. This effect was not due to differences in shock reactivity or body weight, and risk-taking in females was not modulated by estrous phase. Systemic amphetamine administration decreased risk-taking in both males and females; however, females exhibited greater sensitivity to amphetamine, suggesting that dopaminergic signaling may partially account for sex differences in risk-taking.

    Social scientists have another, different take. For example, Nelson, an economist, finds the whole question metaphysical:

    The statement “women are more risk averse than men” is fundamentally a metaphysical assertion about unobservable essences or characteristics, and therefore cannot be empirically proven or disproven. A review of the empirical literature, with attention paid to the misleading nature of generic beliefs and statements, the proper interpretation of statistical results, and the quantitative magnitudes of detectable differences and similarities, sheds doubt on whether statements such as these should have any place in an empirical science that aspires to objectivity. The widespread acceptance of such statements appears to perhaps be rooted more in confirmation bias than in reality.

  3. A honey bee has about one million neurons and a billion synapses. The best smartphones have many more transistors than honey bees have synapses.
  4. Do human beings have “general” intelligence? Kevin Kelly, a well-respected writer, does not think so:

    We like to call our human intelligence “general purpose,” because compared with other kinds of minds we have met, it can solve more types of problems, but as we build more and more synthetic minds we’ll come to realize that human thinking is not general at all. It is only one species of thinking.

  5. Our cells are powered by mitochondria (tiny cells living inside our cells). If your mitochondria cease to function well enough, your cells are as good as dead. But could they be revived? It seems so, at least as far as the heart of babies is concerned: fresh mitochondria can revive flagging cells and enable them to quickly recover.
  6. Each time your cells divide, the end of your chromosomes (the telomeres) shortens. Thus a given human cell can only divide so many time. This sounds terrible, but it is less terrible because cells have ways to elongated their telomeres when needed (telomerase)… however, most cells will just divide a fixed number of times. At that point, the cell either dies (apoptosis) or, rarely, becomes senescent. However, the cells don’t literally run out of telomeres. So what happens? According to Ly et al. the problem is structural: the telomeres loop around, hiding the end of the chromosome. When the telomeres are too short to effectively hide the end of the chromosome, it is treated as defective by our biochemistry.
  7. Will we continue to get faster computers? Denning and Lewis think so:

    These analyses show that the conditions exist at all three levels of the computing ecosystem to sustain exponential growth. They support the optimism of many engineers that many additional years of exponential growth are likely. Moore’s Law was sustained for five decades. Exponential growth is likely to be sustained for many more.

  8. There is a huge garbage patch in the Pacific ocean. It is huge: about the size of the state of Texas. A large fraction, nearly half of it, is made of fishing gear.
  9. We making progress against cancer, but it is not enough. In 2018, an estimated 1,735,350 new cases of cancer will be diagnosed in the United States and 609,640 people will die from the disease.
  10. Bill Gates wants to help diagnose Alzheimer’s early. It is believed that by the time you have symptoms of Alzheimer’s, you have already extensive damages. Many researchers believe it could be far easier to prevent these damages than to reverse them.
  11. It is believed that Alzheimer’s might be caused by protein aggregation in the brain. But it is not entirely clear how these proteins causes problems. New research suggests that their proteins induce cellular senescence in the brain. So cell senescence could be a key factor in the emergence of Alzheimers’.

Accelerating Conway’s Game of Life with SIMD instructions

Conway’s Game of Life is one of the simplest non-trivial simulation one can program. It simulates the emergence of life from chaos. Though the rules are simple, the game of life is still being studied for the last five decades.

The rules are simple. You have a grid where each cell in the grid has a single bit of state: it is either alive or dead. During each iteration, we look at the 8 neighbours of each cells and count the number of live neighbours. If a cell is dead but has exactly three live neighbours, it comes alive. If a live cell has more than 3 or less than 2 live neighbours, it dies. That is all.

Implemented in a straight-forward manner, the main loop might look like this…

for (i = 0; i < height; i++) {
    for (j = 0; j < width; j++) {
      bool alive = states[i][j];
      int neighbours = count_live_neighbours[i][j];
      if (alive) {
        if (neighbours < 2 || neighbours > 3) {
          states[coord] = false;
        }
      } else {
        if (neighbours == 3) {
          states[coord] = true;
        }
      }
    }
}

However, if you implement it in this manner, it is hard for an optimizing compiler to generate clever code. For a 10,000 by 10,000 grid, my basic C implementation takes 0.5 seconds per iteration.

So I wondered whether I could rewrite the code in a vectorized manner, using the SIMD instructions available on our commodity processors. My first attempt brought this down to 0.02 seconds per iteration or about 25 times faster. My code is available.

scalar (C)0.5 s
vectorized (C + SIMD)0.02 s

I use 32-byte AVX2 intrinsics. I did not profile my code or do any kind of hard work.

Thoughts:

  1. At a glance, I would guess that the limiting factor is the number of “loads”. An x64 processor can, at best, load two registers from memory per cycle and I have many loads. The arithmetic operations (additions, subtractions) probably come for free. My implementation uses 8 bits per cell whereas a single bit is sufficient. Going to more concise representation would reduce the number of loads by nearly an order of magnitude. My guess is that, on main CPUs, I am probably between a factor of 5 and 10 away from the optimal implementation. I expect that I am at least a factor of two away from the optimal speed.
  2. The game-of-life problem is very similar to an image processing problem. It is a 3×3 moving/convolution filter. Tricks from image processing can be brought to bear. In particular, the problem a good fit for GPU processing.
  3. I did not look at existing game-of-life implementations. I was mostly trying to come up with the answer by myself as quickly as possible. My bet would be on GPU implementations beating my implementation by a wide margin (orders of magnitude).

Update: John Regher points me to Hashlife as a better high-speed reference.