Microbenchmarking calls for idealized conditions

Programmers use software benchmarking to measure the speed of software. We need to distinguish system benchmarking, where one seeks to measure the performance of a system, with microbenchmarking, where one seeks to measure the performance of a small, isolated operation.

For example, if you are building a browser, you might want to know how quickly the iPhone 7 can retrieve and display the Wikipedia home page. That’s what I call system benchmarking.

Microbenchmarking is much more narrow in scope. So maybe you are interested in how quickly your processor can divide two numbers. Or maybe you want to know how quickly your processor can generate a random number.

Microbenchmarking is almost entirely distinct from system benchmarking, even if it looks superficially the same.

If you are trying to find out how quickly your processor can divide two integers, you want to know that, nothing else. You don’t want to measure how quickly the processor can divide two numbers while loading a website and cleaning out its registry.

Microbenchmarking is a form of science. Think about the physicist trying to measure the speed of light in some medium. It usually calls for idealized conditions. Why would you want to use idealized conditions when you know that they will not incur in the normal course of system operations?

  1. Idealized conditions makes it easier to reason about the results. You are much more likely to know about the factors that influence your code if you have far fewer factors.
  2. Idealized conditions are more consistent and reproducible. Real systems are not stationary. Their performance tends to fluctuate. And to reproduce even your own results can be a challenge when working with real systems. A small idealized scenario is much easier to reproduce.

Though microbenchmarking can be used as part of an engineering project, the primary purpose is to gain novel insights.

You might object to idealized conditions… what if you want to know how fast your code would run “in practice”? The problem is that “in practice” is ill-defined. You need to tell us what the “practice” is. That is, you need to describe a system, and then you are back into system benchmarking.

System benchmarking is more typically engineering. You seek to make a system that people care about better. You can and should be able to get reproducible results with system benchmarking, but it is much more costly. For one thing, describing a whole system is much harder than describing a tiny function.

In the type of microbenchmarking I like to do, CPU microbenchmarking, the idealized conditions often take the following form:

  • As much as possible, all the data is readily available to the processor. We often try to keep the data in cache.
  • We make sure that the processor runs at a flat clock frequency. In real systems, processors can run slower or faster depending on the system load. As far as I know, it is flat out impossible to know how many CPU cycles were spent on a given task if the CPU frequency varies on Intel processors. Let me quote Wikipedia on time-stamp counters (TSC):

    Constant TSC behavior ensures that the duration of each clock tick is uniform and makes it possible to use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior for all later Intel processors.

    Your processor can run faster or slower than the advertized frequency, unless you specifically forbid it.

  • If the code performance depends on branching, then we make sure that the branch predictor has had the chance to see enough data to make sound predictions. An even better option is to avoid branching as much as possible (long loops are ok though).
  • When using a programming language like Java, we make sure that the just-in-time compiler has done its work. Ideally, you’d want to avoid the just-in-time compiler entirely because it is typically not deterministic: you cannot count on it to compile the code the same way each time it runs.
  • You keep memory allocation (including garbage collection) to a minimum.
  • You keep system calls to a minimum.
  • You must benchmark something that lasts longer than ~30 cycles, but you can use loops.
  • You have to examine the assembly output from your compiler to make sure that your compiler is not defeating your benchmark. Optimizing compilers can do all sorts of surprising things. For example, your compiler could figure out that it does not need to compute the trigonometric functions you are calling very accurately, and so it could fall back on a cheaper approximation. If you can’t examine the assembly output, you should be very careful.

Of course, these are not requirements for a good microbenchmark, it is just a set of requirements that I have often found useful.

If you are lucky and can meet the above requirements, then you can run reliable CPU microbenchmarks in microseconds.

What do I mean by reliable? Last week, I reported that standard C code can interleave two 32-bit integers into a 64-bit integers using about 3.6 cycles on a Skylake processor. In the my original post, I only ran 5 trials, I picked the minimum and I divided the total clock time by the number of pairs. This was criticized: according to some, my benchmark did not last long enough to “stabilize” the results; according to others, my test is too short to have good accuracy.

I returned to this Skylake processor and computed a histogram. I run my short function (which computes 1000 interleaves) 2048 times, and each time I record the duration, before dividing by the number of pairs (1000). Running this whole experiment takes less than a second, and only requires a simple script. So let us look at the histogram:

The bulk of the results are in the 3.6 to 3.65 range (72% of all data points). The median is 3.636. As a first approximation, the noise follows a log-normal distribution. You might expect the measurement noise to follow a normal distribution (a bell curve), but normal distributions are uncommon when measuring computer performance.

From this histogram, one could argue that the “real” time is maybe slightly above 3.6 cycles per pair. It is maybe somewhere between 3.6 and 3.7 cycles. But that’s clearly a reasonable uncertainty.

Even though 72% of data points are between 3.6 and 3.65, the average is 3.6465… but that can be explained by the presence of outliers (for example, one measure was 17.2).

I like to report the minimum because it is easy to reason about: it is close to the best-case scenario for the processor. We know a lot about what processors can do in the best case… Yes, it is idealized but that’s fine.

My minimum is still the minimum of large averages (1000 pairs)… so it is more robust than you might expect. Also I can reasonably expect the noise to be strictly psitive, so the minimum makes sense as an estimator.

If you want to know how fast a human being can run, you look at the world’s records and pick the smallest time. When I run microbenchmarks, I want to know how fast my processor can run my code in ideal conditions.

Is the minimum necessarily a good metric? I think it is in CPU-bound microbenchmarks. If you can check that the average (on a quiet machine) is close to the minimum, then the minimum is surely sound. If you are doing service benchmarks, percentiles (1%,20%, 50%,80%,99%) are very useful. There is nothing magical about the average because your distributions are almost never normal, it is just easily computed.

When in doubt regarding which metric to use, just plot your data. If you can’t script it, just throw the numbers in a spreadsheet (like excel) and generate some graph.

Notice how I returned to this benchmark a week later, after the machine was updated and rebooted, and was able, without effort, to reproduce exactly my prior results.

Why don’t I include normality tests, standard errors and all that? Because I don’t need to. The best statistician is the one you never need. It is something of a myth that scientific results need fancy statistics: great scientific results require very little statistical sophistication. You just need to control the factors involved so that the numbers can speak for themselves.

A good analogy is how you weight people. If you do things right, that is, you weight the individual naked at always the same time at the beginning of the day, before any food was eaten, a single (short) measure each day is more than good enough. The “error” is probably small and irrelevant. If you weight people randomly during the day, after random meals, wearing random clothes, then you may need many measures to accurately estimate someone’s weight. Of course, once you throw in the complexity of having to deal with a whole population, then it becomes much harder to control all the factors involved and you may need fancy statistics again.

Wouldn’t I get more accurate results if I repeated my tests more often? Maybe not. If you pick a textbook, you might learn that averaging repeated measures brings you closer to a true value… but there are hidden assumptions behind this result that typically do not hold. Moreover, it is very hard to keep a modern system busy doing just one small thing for a long time. You can do it, but it requires a lot more care.

So long-running benchmarks are often not good idealized microbenchmarks because they also measure all sorts of irrelevant operations like unnecessary cache misses, context switches and so forth. The numbers you get often depend on how quiet your machine was at the time. You can run the same benchmark for three days in a row and get different numbers (with error margins far above 1%). It is certainly possible to get good long-running microbenchmarks, it is just harder.

Think about the surface of the Earth. If you move very little, then you can safely assume that the ground is flat and that we live on a planar surface. If you move a lot, you will encounter montains and seas, and so forth.

(Of course, system benchmarks can and maybe should be run over a long time… but that’s different!)

Can it hurt to run really exhaustive tests? I think it can. The computer might not tire of waiting for seconds and minutes, but the human mind fares a lot better when answers keep coming quickly. I favor many small and cheap microbenchmarks over a few long-running ones, even when the long-running benchmarks are perfectly clean.

The idea that if a benchmark takes longer to run, it ought to be better might seem intuitively self-evident, but I think we should be critical of this view. Fast is good.

Science and Technology links (January 12th, 2018)

  1. A few years ago, researchers in Danemark expressed concerns regarding high concentrations of pesticides that are showing up in urine samples of Danish mothers and children. Last time I was in Danemark, a newspaper was reporting that there are surprising amounts of pesticides in the products sold in Danemark. A recent research article found that

    the adverse health effects of chronic pesticide residue exposure in the Danish population are very unlikely. The Hazard Index for pesticides for a Danish adult was on level with that of alcohol for a person consuming the equivalent of 1 glass of wine every seventh year.

  2. For the first time in American history, healthcare is the largest source of jobs, ahead of retail, the previous largest source.
  3. Farmers use laser beams to stop eagles from attacking their animals.
  4. Last year, we saw many new therapies based on gene editing. The most exciting gene editing technique right now is CRISPR-Cas9. A paper just reported that most of us are probably immune to CRISPR-Cas9, which means that we can’t receive an in vivo gene therapy based on CRISPR-Cas9 without concern that our immune system will fight it. This suggests that new techniques are needed.
  5. Gary Marcus caused quite a debate online by posting a paper entitled Deep Learning: A Critical Appraisal. I believe that Marcus’s main point is deep learning is maybe not the golden path toward human-like intelligence. I’m not exactly sure why this should be so controversial given that I have heard one of the founders if the deep-learning school, Hinton, say exactly the same thing.

    Deep Learning is the dominant paradigm in artificial intelligence, right now. And it works. It is not all hype. It solves real problem people care about.

    But we don’t know how far it can go. Yet it seems that there are fairly obvious limits. For example, nobody knows how to use deep-learning alone to prove theorems. Yet we have pretty good tools (that work right now) to automatically prove theorems.

  6. Many people are concerned about how our climate could change in the near future. The Atlantic has a piece of how we can use technology to regulate the climate (something they call “geo-engineering”).

How fast can you bit-interleave 32-bit integers? (SIMD edition)

In a previous post, I asked how fast one could interleave the bits between two 32-bit integers. That is, given 0b1011 (11 in decimal) and 0b1100 (12 in decimal), I want the number 0b11011010.

On recent (2013) Intel processors, the answer is that you process one pair of integers every two cycles using the pdep instruction. Deinterleaving can be done just as quickly using the pext instruction.

There are a few downsides to relying on the pdep instruction:

  • It is only available on recent x64 processors.
  • It is very slow on even the most recent AMD processors.

What else can we do? We can vectorize the processing. Recent x64 processors support the AVX2 instruction set. Vector instructions tend to get better over time. For example, we can expect future processors to include wider instructions (AVX-512 on x64 processors and ARM SVE). So vectorization is a good long-term investment.

For the most part, the approach I have taken is derived from the scalar approach. That is, we start by interleaving each element of the pair with zeros…

uint64_t interleave_uint32_with_zeros(uint32_t input)  {
    uint64_t word = input;
    word = (word ^ (word << 16)) & 0x0000ffff0000ffff;
    word = (word ^ (word << 8 )) & 0x00ff00ff00ff00ff;
    word = (word ^ (word << 4 )) & 0x0f0f0f0f0f0f0f0f;
    word = (word ^ (word << 2 )) & 0x3333333333333333;
    word = (word ^ (word << 1 )) & 0x5555555555555555;
    return word;
}

And then we put the result back together, shifting one of the by one bit…

interleave_uint32_with_zeros(x) 
  | (interleave_uint32_with_zeros(y) << 1);

With vector instructions, we can skip part of the processing because it is easy to shuffle bytes (with the _mm256_shuffle_epi8 intrinsic). I need to credit Geoff Langdale because I was initially, stupidly, trying to shuffle the bytes with shifts as in the scalar code. Otherwise, the code is very similar, except that it relies on intrinsics.

__m256i interleave_uint8_with_zeros_avx(__m256i word) {
  const __m256i m3 = _mm256_set1_epi64x(0x0f0f0f0f0f0f0f0f);
  const __m256i m4 = _mm256_set1_epi64x(0x3333333333333333);
  const __m256i m5 = _mm256_set1_epi64x(0x5555555555555555);
  word = _mm256_xor_si256(word, _mm256_slli_epi16(word, 4));
  word = _mm256_and_si256(word, m3);
  word = _mm256_xor_si256(word, _mm256_slli_epi16(word, 2));
  word = _mm256_and_si256(word, m4);
  word = _mm256_xor_si256(word, _mm256_slli_epi16(word, 1));
  word = _mm256_and_si256(word, m5);
  return word;
}

void interleave_avx2(uint32_2 *input, uint64_t *out) {
  __m256i xy = _mm256_lddqu_si256((const __m256i *)input);
  __m256i justx = _mm256_shuffle_epi8(
      xy, _mm256_set_epi8(-1, 11, -1, 10, -1, 9, -1, 8,
            -1, 3, -1, 2, -1, 1, -1, 0, -1, 11, -1, 10,
            -1, 9, -1, 8, -1, 3, -1, 2, -1, 1,-1, 0));
  __m256i justy = _mm256_shuffle_epi8(
      xy, _mm256_set_epi8(-1 15, -1, 14, -1, 13, -1, 12, 
             -1, 7, -1, 6, -1, 5, -1, 4, -1, 15, -1, 14, -1, 
             -1, 13, -1, 12, -1, 7, -1, 6, -1, 5, -1, 4));
  justx = interleave_uint8_with_zeros_avx(justx);
  justy = interleave_uint8_with_zeros_avx(justy);
  __m256i answer = _mm256_or_si256(justx, 
            _mm256_slli_epi16(justy, 1));
  _mm256_storeu_si256((__m256i *)out, answer);
}

Is this the best you can do? Kendall Willets commented that you can use a look-up for the interleave. Let us put this to good use:

__m256i interleave_uint8_with_zeros_avx_lut(__m256i word) {
  const __m256i m = _mm256_set_epi8(85, 84, 81, 80, 69, 68,
               65, 64, 21, 20, 17, 16, 5, 4, 1, 0, 85, 84, 
               81, 80, 69, 68, 65, 64, 21, 20, 17, 16, 5, 
               4, 1, 0);
  __m256i lownibbles =
      _mm256_shuffle_epi8(m, _mm256_and_si256(word,
            _mm256_set1_epi8(0xf)));
  __m256i highnibbles = _mm256_and_si256(word, 
          _mm256_set1_epi8(0xf0));
   highnibbles = _mm256_srli_epi16(highnibbles,4);
   highnibbles = _mm256_shuffle_epi8(m, highnibbles);
   highnibbles =  _mm256_slli_epi16(highnibbles, 8);
  return _mm256_or_si256(lownibbles,highnibbles);
}

So how well does it do?

shifts3.6 cycles
pdep 2.1 cycles
SIMD 2.2 cycles
SIMD (with lookup) 1.6 cycles

So, roughly speaking, the vector code is just as fast as the code using the pdep instruction on my Skylake processor. I have not tested it on an AMD processor, but it should run fine (although maybe not as efficiently).

You can improve slightly the performance (down to 2.1 cycles) at the expense of using about twice as much code if you use an idea Geoff offered: save the final shift but using two-interleaving-with-zeros functions.

When you use table lookups, the interleave the nibbles, then the vectorized code is clearly faster than the pdep instruction.

It could probably be ported to ARM NEON, but I am not sure it would be efficient on current ARM processors because they only have 128-bit vector registers. Moreover, ARM processors can do the XOR and the shift operation in a single instruction.

My source code is available.

(This post was updated with better results.)

How fast can you bit-interleave 32-bit integers?

A practical trick in software is to “bit-interleave” your data. Suppose that I have two 4-bit integers like 0b1011 (11 in decimal) and 0b1100 (12 in decimal). I can interleave the two numbers to get one 8-bit number 0b11011010 where I simply pick the most significant bit from the first number, then the most significant bit from the second integer, and then the second most significant bit from the first integer, and so forth. This is a useful trick because if you sort the interleaved numbers, you can then quickly filter numbers on the most significant bits of all components at once. For example, if you are looking for coordinates such as the first and second component is greater or equal to 0b1000, then all values you are looking for will take the form 0b11****** and thus, they will be sorted toward at the end. This is called a z-order.

Of course, you still have to interleave and de-interleave bits. How expensive is that?

The problem is rather symmetrical, so I will only present the code for interleaving. A useful function is one that takes a 32-bit integer, and spreads its bits over a 64-bit integer, effectively interleaving it with zeros:

uint64_t interleave_uint32_with_zeros(uint32_t input)  {
    uint64_t word = input;
    word = (word ^ (word << 16)) & 0x0000ffff0000ffff;
    word = (word ^ (word << 8 )) & 0x00ff00ff00ff00ff;
    word = (word ^ (word << 4 )) & 0x0f0f0f0f0f0f0f0f;
    word = (word ^ (word << 2 )) & 0x3333333333333333;
    word = (word ^ (word << 1 )) & 0x5555555555555555;
    return word;
}

Given this function, you can interleave in the following manner:

interleave_uint32_with_zeros(x) 
  | (interleave_uint32_with_zeros(y) << 1);

I believe that this is the standard approach. It seems efficient enough.

Can you go faster? You might. On recent x64 processors, there are bit manipulation instructions ideally suited to the problem (pdep and pext). The interleaving code looks like this:

uint64_t interleave_pdep(uint32_2 input)  {
    return _pdep_u64(input.x, 0x5555555555555555) 
     | _pdep_u64(input.y,0xaaaaaaaaaaaaaaaa);
}

The decoding is similar but uses the pext instruction instead.

Suppose that you have a bunch of data points, is it worth it to use the fancy x64 instructions?

Let us record how many cycles are needed to interleave a pair of 32-bit values:

shifts3.6 cycles
pdep 2.1 cycles

So, roughly speaking, using specialized instructions doubles the speed. In some cases, it might be worth the effort.

The pdep function is probably optimal in the sense that pdep has a throughput of one instruction per cycle, and I need two pdep instructions to interleave a pair of values.

Deinterleaving takes about as long when using my implementation and the clang compiler. The GCC compiler seems to hate my deinterleaving code and produces very slow binaries.

My source code is available.

Is this the best we can do? I suspect not. My guess is that with more careful engineering, we can go down to 1 cycle per interleave.

Science and Technology links (January 5th, 2018)

  1. Regarding solar energy, Tyler Cowen writes:

    There is now a doctrine of what I call “solar triumphalism”: the price of panels has been falling exponentially, the technology makes good practical sense, and only a few further nudges are needed for solar to become a major energy source. Unfortunately, this view seems to be wrong.

  2. Laura Deming is a venture capitalist who has a “longevity FAQ” where she summarizes what is known about human longevity.
  3. Mitch Daniels is president of Purdue University and a former governor of Indiana. He writes that Avoiding GMOs [genetically modified organisms] isn’t just anti-science: It’s immoral.
  4. For web developers, having a Computer Science degree does not improve income, or if it does, the effect is small (about $1000).
  5. Year 2017 was the best ever for airline safety.
  6. Lixisenatide, a drug developed to treat type 2 diabetes, shows neuroprotective effects in a mouse model of Alzheimer’s disease.
  7. Only 6% of Britain’s territory is built upon, and only 0.1% is densely built upon. Yet people estimate much greater fractions:

    The story of Britain’s treasured green landscapes being gobbled up by greedy industrialists and developers is part of national folklore. The myth that we are “concreting over” our countryside plays to these anxieties, magnified in recent years by the social changes brought about by globalisation and new technology.

  8. The number 277232917 – 1 is prime. It takes about 30 hours to verify that it is a prime number. Prime numbers that are just off from a power of two by one are called “Mersenne primes”, and this latest prime number is the largest Mersenne prime number known to us. There are now 50 Mersenne primes. They are useful in cryptography though I am not sure this latest prime number will serve a useful purpose.
  9. A young postdoc has discovered a security flaw affecting all recent Intel x64 processors called Meltdown. The issue arises because Intel processors speculatively try to read arbitrary memory, and only later cancel the read when it finds out that the read is not allowed. It allows a program to bring any memory to the CPU cache where it can be read using some clever tricks. Given that processors cannot be changed and that most software won’t be recompiled anytime soon, this means that vendors like Apple and Microsoft have to issue fixes that will potentially slow down our computers. It seems like AMD processors are not affected. A related flaw (Spectre) is believed to theoretically affect most modern processors, including those in your mobile devices. Unlike Meltdown, Spectre is harder to exploit, but it is still being taken seriously.

    Simpler processors like that of the Raspberry Pi are immune to these problems. In effect, the lesson is that more complex systems are more difficult to secure.

Can 32-byte alignment alleviate 4K aliasing?

In my previous post, I considered some performance problems that can plague simple loops that read and write data…

for (int i = 0; i < a.length; ++i) {
  a[i] += s * b[i];
}

Once vectorized, such a loop can suffer from what Intel calls 4K Aliasing: if the arrays are stored in memory at locations that differ by “almost” a multiple of 4kB, then it can look to the processor like you are storing data and quickly reading it again. This confuses the processor because it only discriminates on addresses using the least significant 12 bits of addresses.

In my previous blog post, I conjectured that the problem is harder to generate if your arrays are aligned on 32-byte boundaries (that is, if the array starts at an address that is divisible by 32).

Aleksey Shipilëv disagreed and stated on Twitter that the problem was easily reproducible even when arrays are 32-byte aligned.

Aleksey refers to work done on OpenJDK to prevent this problem with respect to array copies.

Aleksey used an older Haswell processor, instead of the more recent Skylake processor that I am using. So I am going to go back to a Haswell processor.

What I am going to do this time is align the destination array (a) on 32-byte boundaries, always. Then I am going to set the other array 4096 bytes ahead, minus some offset in bytes. I then report the number of cycles used per array element.

$ gcc -O3 -o align32byte align32byte.c -mavx2  && ./align32byte
offset: 0 bytes
vecdaxpy(a, b, s, N)	:  0.543 cycles per operation
offset: 1 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 2 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 3 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 4 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 5 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 6 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 7 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 8 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 9 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 10 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 11 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 12 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 13 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 14 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 15 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 16 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 17 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 18 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 19 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 20 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 21 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 22 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 23 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 24 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 25 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 26 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 27 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 28 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 29 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 30 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 31 bytes
vecdaxpy(a, b, s, N)	:  0.707 cycles per operation
offset: 32 bytes (1 32-byte)
vecdaxpy(a, b, s, N)	:  0.648 cycles per operation
offset: 64 bytes (2 32-byte)
vecdaxpy(a, b, s, N)	:  0.637 cycles per operation
offset: 96 bytes (3 32-byte)
vecdaxpy(a, b, s, N)	:  0.625 cycles per operation
offset: 128 bytes (4 32-byte)
vecdaxpy(a, b, s, N)	:  0.613 cycles per operation
offset: 160 bytes (5 32-byte)
vecdaxpy(a, b, s, N)	:  0.613 cycles per operation
offset: 192 bytes (6 32-byte)
vecdaxpy(a, b, s, N)	:  0.613 cycles per operation
offset: 224 bytes (7 32-byte)
vecdaxpy(a, b, s, N)	:  0.625 cycles per operation
offset: 256 bytes (8 32-byte)
vecdaxpy(a, b, s, N)	:  0.613 cycles per operation
offset: 288 bytes (9 32-byte)
vecdaxpy(a, b, s, N)	:  0.625 cycles per operation
offset: 320 bytes (10 32-byte)
vecdaxpy(a, b, s, N)	:  0.613 cycles per operation
offset: 352 bytes (11 32-byte)
vecdaxpy(a, b, s, N)	:  0.602 cycles per operation
offset: 384 bytes (12 32-byte)
vecdaxpy(a, b, s, N)	:  0.590 cycles per operation
offset: 416 bytes (13 32-byte)
vecdaxpy(a, b, s, N)	:  0.566 cycles per operation
offset: 448 bytes (14 32-byte)
vecdaxpy(a, b, s, N)	:  0.555 cycles per operation
offset: 480 bytes (15 32-byte)
vecdaxpy(a, b, s, N)	:  0.555 cycles per operation
offset: 512 bytes (16 32-byte)
vecdaxpy(a, b, s, N)	:  0.555 cycles per operation

So there is a 30% performance penalty for offsets by a small number of bytes. The penalty drops to 20% or less as soon as the second array is aligned on a 32-byte boundary. The penalty eventually goes away entirely when the offset reaches about 512 bytes.

Let us run the same experiment on my Skylake processor:

$ gcc -O3 -o align32byte align32byte.c -mavx2  && ./align32byte
offset: 0 bytes
vecdaxpy(a, b, s, N)	:  0.477 cycles per operation
offset: 1 bytes
vecdaxpy(a, b, s, N)	:  0.805 cycles per operation
offset: 2 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 3 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 4 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 5 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 6 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 7 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 8 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 9 bytes
vecdaxpy(a, b, s, N)	:  0.805 cycles per operation
offset: 10 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 11 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 12 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 13 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 14 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 15 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 16 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 17 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 18 bytes
vecdaxpy(a, b, s, N)	:  0.812 cycles per operation
offset: 19 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 20 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 21 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 22 bytes
vecdaxpy(a, b, s, N)	:  0.805 cycles per operation
offset: 23 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 24 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 25 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 26 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 27 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 28 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 29 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 30 bytes
vecdaxpy(a, b, s, N)	:  0.797 cycles per operation
offset: 31 bytes
vecdaxpy(a, b, s, N)	:  0.789 cycles per operation
offset: 32 bytes (1 32-byte)
vecdaxpy(a, b, s, N)	:  0.586 cycles per operation
offset: 64 bytes (2 32-byte)
vecdaxpy(a, b, s, N)	:  0.570 cycles per operation
offset: 96 bytes (3 32-byte)
vecdaxpy(a, b, s, N)	:  0.562 cycles per operation
offset: 128 bytes (4 32-byte)
vecdaxpy(a, b, s, N)	:  0.562 cycles per operation
offset: 160 bytes (5 32-byte)
vecdaxpy(a, b, s, N)	:  0.555 cycles per operation
offset: 192 bytes (6 32-byte)
vecdaxpy(a, b, s, N)	:  0.555 cycles per operation
offset: 224 bytes (7 32-byte)
vecdaxpy(a, b, s, N)	:  0.547 cycles per operation
offset: 256 bytes (8 32-byte)
vecdaxpy(a, b, s, N)	:  0.555 cycles per operation
offset: 288 bytes (9 32-byte)
vecdaxpy(a, b, s, N)	:  0.555 cycles per operation
offset: 320 bytes (10 32-byte)
vecdaxpy(a, b, s, N)	:  0.555 cycles per operation
offset: 352 bytes (11 32-byte)
vecdaxpy(a, b, s, N)	:  0.555 cycles per operation
offset: 384 bytes (12 32-byte)
vecdaxpy(a, b, s, N)	:  0.523 cycles per operation
offset: 416 bytes (13 32-byte)
vecdaxpy(a, b, s, N)	:  0.508 cycles per operation
offset: 448 bytes (14 32-byte)
vecdaxpy(a, b, s, N)	:  0.500 cycles per operation
offset: 480 bytes (15 32-byte)
vecdaxpy(a, b, s, N)	:  0.484 cycles per operation
offset: 512 bytes (16 32-byte)
vecdaxpy(a, b, s, N)	:  0.477 cycles per operation

Interestingly, the penalty on Skylake is higher (over 50%) for offsets by arbitrary numbers of bytes. But when the arrays differ by multiple of 32 bytes, the penalty is reduced to about 20%, and it eventually goes away entirely.

My numbers suggest the following observations:

  • It does not look like 4K aliasing improved between Haswell and Skylake. I have a recent laptop with an even more recent Kaby Lake processor, and the problem is still there. I am not including my laptop numbers because I do not trust benchmarking on a laptop… but I have enough confidence to state that the 4k aliasing problem is still a thing on the very latest Intel processors. What about AMD processors?
  • Aligning arrays on 32-byte boundaries does seem to alleviate the problem. In my particular experiments, I only got a 20% penalty due to aliasing which seems more acceptable than a 50% penalty.

To me, it is an annoying problem because it means that when doing rather common computations over loops, your performance will differ quite a bit depending on where in the memory your arrays are located. Memory addresses is not something you have a lot of control over in the normal course of software. It is also not obvious to me how to make the problem go away. You can try to detect the problem and make the loop run in reverse… but it is not entirely trivial.

My source code is available.

Don’t make it appear like you are reading your own recent writes

Richard Statin recently published a Java benchmark where the performance of a loop varies drastically depending on the size of the arrays involved. The loop is simple:

for (int i = 0; i < a.length; ++i) {
  a[i] += s * b[i];
}

If the array size is 1000, the performance is lower than if the array size is 1024. The weird result only occurs if AVX2 autovectorization is enabled, that is, only if Java uses vector instructions.

It is hard to reason from a Java benchmark. So many things could be going on! What is the address of the arrays? I am sure you can find out, but it is ridiculously hard.

Let me rewrite the loop in C using AVX2 intrinsics:

void vecdaxpy(double * a, double * b, double s, size_t len) {
    const __m256d vs = _mm256_set1_pd(s);
    for (size_t i = 0; i + 4 <= len ; i += 4) {
      const __m256d va = _mm256_loadu_pd(a + i);
      const __m256d vb = _mm256_loadu_pd(b + i);
      const __m256d mults = _mm256_mul_pd(vb,vs);
      const __m256d vavb = _mm256_add_pd(va,mults);
      _mm256_storeu_pd(a + i,vavb);
    }
}

In Java, you have to work hard to even know that it managed to vectorize the loop the way you expect. With my C intrinsics, I have a pretty good idea of what the compiler will produce.

I can allocate one large memory block and fit two arrays of size N, for various values of N, in a continuous manner. It seems that this is what Richard expected Java to do, but we cannot easily know how and where Java allocates its memory.

I can then report how many cycles are used per pair of elements (I divide by N):

$ gcc -O3 -o align align.c -mavx2  && ./align
N = 1000
vecdaxpy(a,b,s,N)   	:  0.530 cycles per operation
N = 1004
vecdaxpy(a,b,s,N)   	:  0.530 cycles per operation
N = 1008
vecdaxpy(a,b,s,N)   	:  0.530 cycles per operation
N = 1012
vecdaxpy(a,b,s,N)   	:  0.528 cycles per operation
N = 1016
vecdaxpy(a,b,s,N)   	:  0.530 cycles per operation
N = 1020
vecdaxpy(a,b,s,N)   	:  0.529 cycles per operation
N = 1024
vecdaxpy(a,b,s,N)   	:  0.525 cycles per operation

So the speed is constant with respect to N (within an error margin of 1%).

There are four doubles in each 256-bit registers, so I use about 2 cycles to process a pair of 256-bit registers. That sounds about right. I need to load two registers, do a multiplication, an addition, and a store. It is not possible to do two loads and a store in one cycle, so 2 cycles seem close to the best one can do.

I could flush the arrays from cache, and things get a bit slower (over four times slower), but the speed is still constant with respect to N.

Whatever hardware issue you think you have encountered, you ought to be able to reproduce it with other (simpler) programming languages. Anything hardware related should be reproducible with several programming languages. Why reason about performance from Java alone, unless it is a Java-specific issue? If you cannot reproduce it with another programming language, how can you be sure that you have the right model?

Still, Richard’s result is real. If I use arrays of size just under a multiple of 4kB, and I offset them just so that they are not 32-byte aligned (the size of a vector register), I get a 50% performance penalty.

Intel CPU discriminates between memory addresses based on their least significant bits (e.g., the least significant 12 bits). A worst-case scenario is one where you read memory at an address that looks (as far as the least significant bits are concerned) like an address that was very recently written to.

The Intel documentation calls this 4K Aliasing:

4-KByte memory aliasing occurs when the code stores to one memory location and shortly after that it loads from a different memory location with a 4-KByte offset between them. The load and store have the same value for bits 5 – 11 of their addresses and the accessed byte offsets should have partial or complete overlap. 4K aliasing may have a five-cycle penalty on the load latency. This penalty may be significant when 4K aliasing happens repeatedly and the loads are on the critical path. If the load spans two cache lines it might be delayed until the conflicting store is committed to the cache. Therefore 4K aliasing that happens on repeated unaligned Intel AVX loads incurs a higher performance penalty.

You can minimize the risk of trouble by aligning your data on 32-byte boundaries. It is likely that Java does not align arrays on cache lines or even on 32-byte boundaries. Touching more than one 32-byte region increases the likelihood of aliasing.

The problem encountered by Richard is different from an old-school data alignment issue where loads are slower because the memory address is not quite right. Loading and storing vector registers quickly does not require alignment. The problem we have in Richard’s example is that we are storing values, and then it looks (from the processor’s point of view) like we might be loading it again quickly… this confusion incurs a penalty.

What would happen if Richard flipped a and b in his code?

for (int i = 0; i < a.length; ++i) {
  b[i] += s * a[i];
}

Because we write to array b in this example (instead of array a), then I suspect he would get that the worst case is having arrays of size just slightly over 4kB.

He could also try to iterate over the data in reverse order but this could confuse the compiler and prevent autovectorization.

I also suspect that this type of aliasing is only going to get worse as vector registers get larger (e.g., as we move to AVX-512). In the C version of Roaring bitmaps, we require 32-byte alignment when allocating bitsets. We might consider 64-byte alignment in the future.

My source code is available.

(To make sure you results are stable, avoid benchmarking on a laptop. Use a desktop configured for testing with a flat CPU frequency.)

Year 2017: technological highlights

  1. DeepStack and Libratus become the first computer programs to beat professional poker players.
  2. We are using synthetic cartilage to help people with arthritis.
  3. Stem cells can regrow a whole new tooth.
  4. We have made significant progress in rejuvenating old tissues and organs with senolytics. Senescent cells are believed to be one of the major cause of aging, and we approaching the day when we can clear them from our bodies. Clearing senescent cells is shown to alleviate osteoporosis in mice.
  5. Google has computing pods capable of 11.5 petaflops. It is a lot. By some estimates, it should be sufficient to simulate the human brain.
  6. Intel releases the Core i9 X-series which can produce one teraflop of computing performance for about $2000.
  7. Apple now offers sophisticated augmented reality applications on iPhones and iPads.
  8. The German company DeepL achieves a breakthrough in machine translation. It is reported to be as good as amateur human translators at many tasks.
  9. For 350$, you can buy a 400 GB memory card on Amazon. It is enough to record everything you hear during four years. It fits on the tip of a finger.
  10. Supplementation with osteocalcin, a natural hormone, rejuvenates memory in mice in addition to rejuvenating muscles.
  11. A robot did better than 80% of human students in a Japanese university entrance tests.
  12. Google’s Pixel Buds are headphones that can translate in real time any one of 40 different languages.
  13. A team from Google (Alphabet/DeepMind) has created a computer system (AlphaZero) that can learn games like Go and Chess in a few hours based only the rules.
  14. A woman with a transplanted uterus gives birth.

Multicore versus SIMD instructions: the “fasta” case study

Setting aside graphics processors, most commodity processors support at least two distinct parallel execution models.

  1. Most programmers are familiar with the multicore model whereas we split the computation into distinct parts that get executed on distinct cores using threads or goroutine.

    Speeding up your program by using many cores is not exactly free, however. The cores you are using are not available to the other programs on your machine. Thus you may be able to complete the processing faster, but it is unclear if, all things said, you are going to use less energy. Moreover, on a busy server, multicore processing could offer little gain.

    More critically, as a rule, you should not expect your program to go N times faster even if you have N cores. Even when your problems are conveniently embarrassingly parallel, it can be difficult to scale in practice with the number of cores you have.

    It is not uncommon for threaded programs to run slower.

  2. There is another parallelization model: vectorization. Vectorization relies on Single instruction, multiple data (SIMD) instructions. It is relatively easy for hardware engineers to design SIMD instructions, so most processors have them.

    Sadly, unlike multicore programming, there is very little support within programming languages for vectorization. For the most part, you are expected to hope that your compiler will automatically vectorize your code. It works amazingly well, but compilers simply do not reengineer your algorithms… which is often what is needed to benefit fully from vectorization.

My belief is that the performance benefits of vectorization are underrated.

As an experiment, I decided to look at the fasta benchmark from the Computer Language Benchmarks Game. It is a relatively simple “random generation” benchmark. As is typical, you get the best performance with C programs. I ran the current best-performing programs for single-threaded (one core) and multithreaded (multicore) execution. The multicore approach shaves about 30% of the running time, but it uses the equivalent of two cores fully:

elapsed timetotal time (all CPUs)
single-threaded1.36 s1.36 s
multicore1.00 s2.00 s

Is the multicore approach “better”? It is debatable. There are many more lines of code, and the gain is not large.

The numbers one the Computer Language Benchmarks Game page differs from my numbers but the conclusion is the same: the multicore version is faster (up to twice as fast) but it uses twice the total running time (accounting for all CPUs).

I decided to reengineer the single-threaded approach so that it would be vectorized. Actually, I only vectorized a small fraction of the code (what amounts to maybe 20 lines in the original code).

elapsed timetotal time (all CPUs)
vectorized0.31 s0.31 s

I am about four times faster, and I am only using one core.

My version is not strictly equivalent to the original, so extra work would be needed to qualify as a drop-in replacement, but I claim that it could be done without performance degradation.

I make my code available so that you can easily reproduce my results and correct as needed. Your results will vary depending on your compiler, hardware, system, and so forth.

Science and Technology links (December 29th, 2017)

  1. It is often believed that, in the developed world, more land is used to host human beings as time goes by, reducing space for forest and wildlife. That is not true:

    Forests are spreading in almost all Western countries, with the fastest growth in places that historically had rather few trees. In 1990 28% of Spain was forested; now the proportion is 37%. In both Greece and Italy, the growth was from 26% to 32% over the same period. Forests are gradually taking more land in America and Australia. Perhaps most astonishing is the trend in Ireland. Roughly 1% of that country was forested when it became independent in 1922. Now forests cover 11% of the land, and the government wants to push the proportion to 18% by the 2040s.

  2. Having older brothers makes you more likely to be gay (assuming you are man). The evidence appears to be overwhelming. I have not yet read an explanation for why this would be true.
  3. Envy served us well in the past when someone taking more food that needed could kill the rest of the tribe. Today, being envious of others is a disease:

    No evidence is found for the idea that envy acts as a useful motivator. Greater envy is associated with slower — not higher — growth of psychological well-being in the future. Nor is envy a predictor of later economic success.

    If you find yourself being envious of people having more success than you do, seek help… try to calm your envy. It is not good for you.

  4. Vitamin K is an anticalcification, anticancer, bone-forming and insulin-sensitising molecule. It is not uncommon to be vitamin-K deficient.
  5. Due to parabiosis, we know that there are “aging factors” in the blood of old people that reduce their fitness. As we age, we tend to have more Eotaxin-1 and it gets passed in blood plasma. Eotaxin-1 is associated with poorer mental abilities. What would happen if we removed Eotaxin-1 from the blood of older people? We do not know.