A dot (or scalar) product is a fairly simple operation that simply sums the many products:

float sum = 0; for (size_t i = 0; i < len; i++) { sum += x1[i] * x2[i]; } return sum;

It is nevertheless tremendously important. You know these fancy machine learning algorithms we keep hearing about? If you dig deep under all the sophistication, you often find lots and lots of dot products.

Xavier Arteaga had a few interesting comments on a previous post of mine:

- “I would say that CPU frequency does not really make a difference when computing large amounts of data.”
- “I guess for intensive computations which require little memory and lots of operations AVX512 [new fancy instruction sets provided by Intel] provides a better performance.”

I disagree with Xavier, but I feel the need to provide some analysis. So let us evaluate Xavier’s observations with the dot product. Before I begin, let me point out that I am not a linear algebra expert, there has been lots and lots of fancy engineering done on these problems. If you need fast dot product software, find a good library, don’t grab it out of my blog.

I implemented a simple benchmark that computes dot products over increasingly larger vectors. I use three modes: “standard” math which is what you get when you simply compile a dot-product loop, a version with 128-bit vectors (SSE) and a version with 256-bit vectors (AVX2).

total input size | cycles per pair of floats | bytes per cycle | mode |
---|---|---|---|

8 kB | 4 | 2 | standard math |

8 kB | 1.0 | 7.8 | fast math (SSE) |

8 kB | 0.55 | 15 | fast math (AVX2) |

16 MB | 4 | 2 | standard math |

16 MB | 1.0 | 7 | fast math (SSE) |

16 MB | 0.9 | 9 | fast math (AVX2) |

256 MB | 4 | 2 | standard math |

256 MB | 1.3 | 6.0 | fast math (SSE) |

256 MB | 1.2 | 6.7 | fast math (AVX2) |

Can we reason about the AVX2 speed? So you have two loads (uses as little as one cycle), one cheap vectorized multiplication and one cheap vectorized addition. You can multiply two vectors of eight floating-point values per cycle, easily on a recent Intel processor. Then there is overhead due to the loop. Generously, say that it takes 5 cycles to multiply to pair of vectors and add them to a set of accumulators… that is about 0.6 cycles per multiplication. As expected, my system does it faster (0.55 cycles per pair) which means that it takes fewer than 5 cycles per pair of eight values.

What is apparent is that we are quickly hitting a wall… for large inputs (e.g., 256 MB and more). This wall has to do with how quickly our single core can grab data from the memory subsystem. I suspect that Xavier is correct: this wall has probably little dependency on the CPU frequency. Furthermore, having fancier instructions (e.g., those from the AVX-512 instruction sets) will not help you.

What can we conclude?

- If you have to process lots of data, and do dirt cheap operations (e.g., a vectorized dot product), then your single processor core is easily starved for data. That’s the part where Xavier is right.
- However, it is important to qualify what we mean by “cheap tasks”. Even just computing the dot product in a standard-compliant manner is entirely compute-bound in my experiments. Lots and lots of streaming operations like parsing a document, compressing data, and so forth, are likely to be compute bound.

To put it otherwise, I would say that while it is possible to design functions that stream through data in such a way that they are memory-bound over large inputs (e.g., copying data), these functions need to be able to eat through several bytes of data per cycle. Because our processors are vectorized and superscalar, it is certainly in the realm of possibilities, but harder than you might think.

You didn’t try the dot product instruction available since SSE 4.1: DPPS. SSE code that use this instruction: https://github.com/WojciechMula/toys/blob/master/sse-dotprod/sse-dpps.c

It is going to be slower, no? Keep in mind that I am using large inputs.

Perhaps it’d be slower (AFAIR dpps throughput is low); I’m curious how bad it’d be.

I’m confused about your conclusion (2) that the operation is compute bound. As your benchmarks suggest and you have stated, the dot product operation is clearly memory bound when the input size is large. Do you mean that the bottleneck is the number of floating point units rather than instruction decode or something when the inputs size is small?

I refer to the standard-compliant version. The use of the fast intrinsics is not equivalent to the standard C code I offered.

Standard-compliant dot product is, in my tests, entirely compute bound. Look at my table, follow the last column.

Once you allow yourself to use fast SIMD instructions, then the problem becomes memory-bound… for large arrays.

Unless compilers are more clever than I thought (and I think you don’t actually care about exact order of floating point additions anyway, if my glance on code was correct) it would seem you are creating an unnecessarily tight dependency chain from one sum to another, and this… might be a limiting factor for throughput. Or are compilers that (overly) clever they would rewrite vector addition intrinsics with -ffast-math?

I would consider trying out a small amount of strided vector sums whose intermediate results would fit in the ISA register set. Then just sum stride vectors like you sum vector entries (“buffer”) now.

As a side note: at least Skylake CPUs have reciprocally throughput of two multiply-accumulates per clock cycle, but the latency of one instruction is four cycles, which is much more pronounced than one you would see in an operation such as a sum of vector elements. This might completely dominate your results on small inputs, and this “four cycles per pair of floats” seems curiously close to that.

My analysis is based on code I wrote using intrinsics, without multiply-accumulates. I suspect that in the fast-math mode, the compiler might use vfmadd instructions.

In any case, my analysis was a “back of the envelope” thing without a serious look at the sources of contention. I just wanted to check that things ran roughly as fast as they should.

the compiler is allowed to replace multiply+add by a single multiply-add even without fast-math.

Can you get a compiler to produce a memory-bound code while remaining standard compliant? (No fast-math)

not unless you make the code do several sums in parallel or use integers

If one would truly want to maintain the simple sequential order of additions, one could pre-permute vectors to produce this order of summation even with parallel (vectorised) sums. I think it might be completely plausible for some tasks (such as simple neural networks). It’s a different question if the summation order actually matters much except when using some sort of exotic reduced precision floats.

Got any code?

This applies to all variants, but effect is quite clear already on plain dot function (really dumb loop unrolling, but this way I’m more convinced the compiler does what I want):

`__attribute__((noinline)) float dot(float *x1, float *x2, size_t len) {`

float sum0, sum1, sum2, sum3, sum4, sum5, sum6, sum7;

sum0 = sum1 = sum2 = sum3 = sum4 = sum5 = sum6 = sum7 = 0;

for (size_t i = 0; i < len;) {

sum0 += x1[i] * x2[i];

i++;

sum1 += x1[i] * x2[i];

i++;

sum2 += x1[i] * x2[i];

i++;

sum3 += x1[i] * x2[i];

i++;

sum4 += x1[i] * x2[i];

i++;

sum5 += x1[i] * x2[i];

i++;

sum6 += x1[i] * x2[i];

i++;

sum7 += x1[i] * x2[i];

i++;

}

return sum0 + sum1 + sum2 + sum3 + sum4 + sum5 + sum6 + sum7;

}

Of course, order of floating point additions makes the result pedantically different from original, but if you would be performing vector dot products with constant vectors, you might permute vectors to produce the preferred order of additions.

Heh. What I thought the compiler was about to do to my code is to break dependencies, but it actually autovectorized the loop, at least on my Mac. This doesn’t actually demonstrate what I attempted to do.

Your compiler should autovectorize my original loop too.

It doesn’t without -ffast-math, because that would violate semantics of floating point math on C.

The following comical contraption autovectorises and removes loop dependencies as far as I can understand on my compiler (Apple LLVM version 9.1.0 (clang-902.0.39.2)), even without -ffast-math. It does stumble a bit at least on my system on final summation, but that’s only a minor part of it all:

`__attribute__((noinline)) float dot(float *x1, float *x2, size_t len) {`

float sum00, sum01, sum02, sum03, sum04, sum05, sum06, sum07;

float sum10, sum11, sum12, sum13, sum14, sum15, sum16, sum17;

float sum20, sum21, sum22, sum23, sum24, sum25, sum26, sum27;

float sum30, sum31, sum32, sum33, sum34, sum35, sum36, sum37;

float sum40, sum41, sum42, sum43, sum44, sum45, sum46, sum47;

float sum50, sum51, sum52, sum53, sum54, sum55, sum56, sum57;

float sum60, sum61, sum62, sum63, sum64, sum65, sum66, sum67;

float sum70, sum71, sum72, sum73, sum74, sum75, sum76, sum77;

sum00 = sum01 = sum02 = sum03 = sum04 = sum05 = sum06 = sum07 = 0;

sum10 = sum11 = sum12 = sum13 = sum14 = sum15 = sum16 = sum17 = 0;

sum20 = sum21 = sum22 = sum23 = sum24 = sum25 = sum26 = sum27 = 0;

sum30 = sum31 = sum32 = sum33 = sum34 = sum35 = sum36 = sum37 = 0;

sum40 = sum41 = sum42 = sum43 = sum44 = sum45 = sum46 = sum47 = 0;

sum50 = sum51 = sum52 = sum53 = sum54 = sum55 = sum56 = sum57 = 0;

sum60 = sum61 = sum62 = sum63 = sum64 = sum65 = sum66 = sum67 = 0;

sum70 = sum71 = sum72 = sum73 = sum74 = sum75 = sum76 = sum77 = 0;

for (size_t i = 0; i < len;) {

sum00 += x1[i] * x2[i];

i++;

sum01 += x1[i] * x2[i];

i++;

sum02 += x1[i] * x2[i];

i++;

sum03 += x1[i] * x2[i];

i++;

sum04 += x1[i] * x2[i];

i++;

sum05 += x1[i] * x2[i];

i++;

sum06 += x1[i] * x2[i];

i++;

sum07 += x1[i] * x2[i];

i++;

sum10 += x1[i] * x2[i];

i++;

sum11 += x1[i] * x2[i];

i++;

sum12 += x1[i] * x2[i];

i++;

sum13 += x1[i] * x2[i];

i++;

sum14 += x1[i] * x2[i];

i++;

sum15 += x1[i] * x2[i];

i++;

sum16 += x1[i] * x2[i];

i++;

sum17 += x1[i] * x2[i];

i++;

sum20 += x1[i] * x2[i];

i++;

sum21 += x1[i] * x2[i];

i++;

sum22 += x1[i] * x2[i];

i++;

sum23 += x1[i] * x2[i];

i++;

sum24 += x1[i] * x2[i];

i++;

sum25 += x1[i] * x2[i];

i++;

sum26 += x1[i] * x2[i];

i++;

sum27 += x1[i] * x2[i];

i++;

sum30 += x1[i] * x2[i];

i++;

sum31 += x1[i] * x2[i];

i++;

sum32 += x1[i] * x2[i];

i++;

sum33 += x1[i] * x2[i];

i++;

sum34 += x1[i] * x2[i];

i++;

sum35 += x1[i] * x2[i];

i++;

sum36 += x1[i] * x2[i];

i++;

sum37 += x1[i] * x2[i];

i++;

sum40 += x1[i] * x2[i];

i++;

sum41 += x1[i] * x2[i];

i++;

sum42 += x1[i] * x2[i];

i++;

sum43 += x1[i] * x2[i];

i++;

sum44 += x1[i] * x2[i];

i++;

sum45 += x1[i] * x2[i];

i++;

sum46 += x1[i] * x2[i];

i++;

sum47 += x1[i] * x2[i];

i++;

sum50 += x1[i] * x2[i];

i++;

sum51 += x1[i] * x2[i];

i++;

sum52 += x1[i] * x2[i];

i++;

sum53 += x1[i] * x2[i];

i++;

sum54 += x1[i] * x2[i];

i++;

sum55 += x1[i] * x2[i];

i++;

sum56 += x1[i] * x2[i];

i++;

sum57 += x1[i] * x2[i];

i++;

sum60 += x1[i] * x2[i];

i++;

sum61 += x1[i] * x2[i];

i++;

sum62 += x1[i] * x2[i];

i++;

sum63 += x1[i] * x2[i];

i++;

sum64 += x1[i] * x2[i];

i++;

sum65 += x1[i] * x2[i];

i++;

sum66 += x1[i] * x2[i];

i++;

sum67 += x1[i] * x2[i];

i++;

sum70 += x1[i] * x2[i];

i++;

sum71 += x1[i] * x2[i];

i++;

sum72 += x1[i] * x2[i];

i++;

sum73 += x1[i] * x2[i];

i++;

sum74 += x1[i] * x2[i];

i++;

sum75 += x1[i] * x2[i];

i++;

sum76 += x1[i] * x2[i];

i++;

sum77 += x1[i] * x2[i];

i++;

}

`return`

sum00 + sum10 + sum20 + sum30 + sum40 + sum50 + sum60 + sum70 +

sum01 + sum11 + sum21 + sum31 + sum41 + sum51 + sum61 + sum71 +

sum02 + sum12 + sum22 + sum32 + sum42 + sum52 + sum62 + sum72 +

sum03 + sum13 + sum23 + sum33 + sum43 + sum53 + sum63 + sum73 +

sum04 + sum14 + sum24 + sum34 + sum44 + sum54 + sum64 + sum74 +

sum05 + sum15 + sum25 + sum35 + sum45 + sum55 + sum65 + sum75 +

sum06 + sum16 + sum26 + sum36 + sum46 + sum56 + sum66 + sum76 +

sum07 + sum17 + sum27 + sum37 + sum47 + sum57 + sum67 + sum77;

}

I didn’t note the processor you used for the test but my experience (40 years of trying to improve dot_product using Fortran) would suggest a similar conclusion.

Recently, for Intel i5 and i7 processors:

CPU frequency is not significant (but should be for small len?).

Memory transfer rates are significant (especially for large len).

L3 cache defines when AVX improvement declines as data must be in cache for AVX+ to work effectively.

Using !$OMP doesn’t help as there is not enough work for each thread to overcome the thread initialisation overhead.

I have been trying to understand how to code the DO (for) loop so the compiler will provide an optimised instruction set for claimed AVX+ support. The advertised performance can be very impressive. Then there is the issue of transferring this calculation into a real code with other memory demands, besides X1 and X2. For me, this remains elusive.

Strategies where the data is blocked to reduce memory <> cache transfers do help. It would be interesting to see test results for architecture with larger cache.

There is also the situation where for multiple CPU : shared memory, what other tasks are running.

This is a fascinating micro-experiment, thanks! I find trying to understand these things useful for improving my intuition about computer system performance.

The general trend that concerns me is how are we going to make “general software” faster now that single thread performance is not increasing? As your experiment shows, we can get tremendous improvements (3-7X!) if we use application-specific hardware, but the difficulty of writing the code increases substantially.

I think it is an interesting time to be involved in trying to improve software performance.

As your experiment shows, we can get tremendous improvements (3-7X!) if we use application-specific hardware, but the difficulty of writing the code increases substantially.Though that’s true, I would argue that a lot of energy goes toward making sure that most programmers can get the great performance without having to worry about it. The goal is that few of us should ever need to worry day-to-day about performance so that most programmers can be free to focus on other things.

There is division of labor involved. If you are doing a lot of dot products, you shouldn’t write them by hand yourself… find a library that does it for you.

PyPy blog recently featured a possibly unintentionally tragicomical comparison of levels of performance between script languages and fine-tuned implementations taking advantage of domain-specific instructions on modern CPUs:

Of course most people understand that implementing matrix multiplication in a scripting language is going to be inefficient, but I suspect many don’t quite comprehend how inefficient it can be.

Maybe that inline image didn’t come through then. Anyway, it’s included in the linked blog post.

I hacked your comment so that the figure comes up, at least for now.

Right. Part of the problem is people tend to use one tool and do everything with this one tool, with no context awareness.

I’m also personally concerned about “general application” performance, where the CPU time is not spent in a small number of primitives like say deep learning training or large matrix math for scientific simulations. If you look at the profile for busy servers at Internet services (my personal experience), they tend to be fairly flat, with CPU cycles getting spent all over the place. If we want to be able to process more requests, are we just going to need bigger data centers and more electricity if CPUs no longer get faster?

Certainly step one is probably to stop writing these applications in Python and Ruby, as the benchmark mentioned above demonstrates.

So my blog post is not representative of actual loads, I am aware of that.

For now, electricity bills are not a concern. Chips and electricity are still way, way cheaper than developers for most applications. The big whales (Facebook, Google…) may have core applications where throwing millions of engineer time at the problem for years makes sense… but most businesses do not.

Possibly this could change if we get huge AI applications… but we are not there (except maybe within the core of the whales).

My experience is that most businesses are concerned with quality-of-service issues. They don’t mind spinning twice as many cores… but they do mind having high latency. Throwing more machines at a problem does not make the latency go down magically.

As you well know, it is entirely possible to get great performance with Python

ifyou have the critical parts written in a faster language.Yes for long vectors this is well understood and can be measured using the STREAM benchmark:

https://www.cs.virginia.edu/stream/

For short vectors compiler autovectorization will do an ok job but GCC will not do reductions by default, you need to at least enable -fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math (all a subset of the much more aggressive -ffast-math); #pragma omp simd reduction(+:result) also forces the compiler to vectorize without those flags.

I did some testing to evaluate possibilities for our new Beluga system (for Compute Canada here in Montreal, coming soon — you should get an account too, it’s free for all Canadian academics :), see the table below.

The 6.7 number above looks memory bound, and resonates with the Broadwell number of 17.4 GB/s (2.6GHz * 6.7)

Broadwell = E5-2683 (16 cores/socket, 2 sockets)

Skylake = Gold 6148 (20 cores/socket, 2 sockets)

EPYC = EPYC 7601 (32 cores/socket, 2 sockets)

n Broadwell Skylake EPYC

1 17.4 10.3 26.2

2 34.8 20.3 52.2

4 65.8 40.2 104.4

6 92.4 59.5

8 109.8 78.3 207.6

10 118.3 96.3

12 120.8 113.4

14 120.4 130.8

16 120.3 146.3 271.7

20 119.3 164.5

24 118.7 175.1

28 117.1 179.8

32 117.0 181.3 279.0

36 —– 181.8

40 —– 183.2

64 —– —– 267.6

I think it’s pretty easy to reason that dot product code is always memory bound (when vectorisation tricks are allowed) as, for example on Skylake, reciprocal throughput of multiply-accumulates is two per clock cycle, and each such operation needs two memory reads, but the CPU can perform only two reads per clock cycle, even at the best case scenario when all data is available optimally on L1.

For matrix multiply the scenario is different: for non-small matrices, amount of reads per multiply-accumulate operation can approach 1, which would put memory and computation operations nicely in balance on this microarchitecture. But of course this applies only to operations which can be performed at full L1 access rate!

As with many performance related discussions, one should draw a distinction between

latencyandthroughput.I think those who say that computation is more likely to bottleneck an algorithm than memory access have a good argument on the throughput side, but not as much on the latency side. This is because memory bandwidth has at least largely kept up with advances in computational speed [2], but latency has not. In fact latency has barely budged in a decade or more. There’s some expression about it that I forget, something like “More bandwidth is merely expensive, but better latency often can’t be had at any price”.

Some back of the envelope calculations for bandwidth, CPU vs memory. Current single core bandwidths for recent Intel chips are in the 15 – 30 GB range. Let’s say 20 GB. For a 3 GHz chip, that means about 7 bytes of bandwidth per cycle. On the computation size, if you are using scalar 64-bit instructions, you are going to execute

at most4 per cycle, i.e., 32 bytes of output per cycle, at best. More realistic code with an IPC of 1 will be at 8 bytes per cycle of computation. So we can say the computation/bandwidth ratio is somewhere between 4.8 and 1.2: meaning that each bit of data is touched by more than 4.8 to 1.2 instructions, you are likely to be computation limited, not bandwidth limited.Certainly most non-trivial algorithms are doing to exceed the low end of that range: it is rare that an algorithm would only execute a single computation on each word of data. In general we could say that except for high IPC, simple code scalar algorithms are likely to be limited by computation in a bandwidth sense. Vectorized codes are another matter, since AVX2 (for example) multiplies this ratio by 4, so there are many interesting algorithms are that are still bandwidth limited (well-implemented dot products being among them) when using SIMD algorithms. On the flip side, many algorithms operate on 32-bit values or even bytes, dividing the ratios discussed by 2 or 8, making them

verylikely to be computation bound.Let’s do the same calculation for latency!

Most ALU operations on modern chips take 1 cycle, although multiplication and some other operations often take a handful of cycles (3 on modern Intel and AMD), and division is usually much slower at 20 or more cycles. Let’s take 1 cycle as typical. A miss to DRAM takes at best 50 nanoseconds or so on the most favorable hardware (which, perhaps unexpectedly, is “low-end” stuff like client chips) while it approaches or exceeds 100 nanoseconds on heavy-duty hardware (chips with server uncore, multiple sockets, registered DIMMs, etc). Converted into cycles at 3 GHz, that’s 150 to 300 cycles.

So the computation/memory ratio now is 150 to 300 for single cycle instructions, or 50 to 100 for 3-cycle instructions. In other worse, close to a couple of orders of magnitude larger than before. Almost any algorithm that is latency bound and includes misses to memory in the dependency chain is likely to be memory bound, unless it is doing a ton of extra work (say at least 100 instructions per retrieved value).

The bandwidth and latency computation/memory ratios have been diverging over time (indeed, back in the days before caches the reciprocal throughput and latency were often identical: one memory access per cycle available that same cycle was common). I expect them to diverge more over time, as computation and bandwidth continues to increase, but latency is likely to stagnate.

A note about multiple cores: the tests above are implicitly for a single core. The pictures changes quite a bit with multiple cores: computation scales almost[Note 1] perfectly with additional cores, but memory bandwidth is generally shared among all cores on a socket. Modern Intel CPUs can’t always max out the memory bandwidth with a single core, but usually a few will do it. In particular, a single core may have a bandwidth of 25 GB/s, but the maximum available bandwidth across all cores is only 50 GB/s, so with 10 cores running at once, you’ll only have 1/5th the pro rata bandwidth you had in the single core case. This effect is especially pronounced on very high core count chips: chips can have 20+ cores, but no real increase in memory bandwidth over any other quad channel parts (and even the lowliest Intel chips are dual channel, I think, so there’s only a 2x scaling in memory bandwidth from a $100 chip to a $10,000 one).

BTW, your test is actually kind of comparing a latency bound computation scenario to a bandwidth bound memory one: AVX2 chips like Skylake can sustain 2 full-width FMAs per cycle (2×8 = 16 total input pairs), so you should be seeing something like 0.0625 cycles per pair (or maybe half that if using separate M and A), rather than 0.55: you get the worse result because the algorithm is bottlenecked on the addition latency with the single accumulator. With a throughput of 2 per cycle but a latency of 4 cycles, you need 8 accumulators (8 parallel computation chains) to get full throughput. The memory accesses aren’t part of the chain though (in the dependency graph they are root nodes with no important parents), so they are only throughput limited. So the comparison ended up being between computational latency and memory throughput.

[Note 1] One reason computation resources may not scale perfectly with increasing core usage is variable (lower) turbo ratios when more cores are used, and thermal, power or current limits. These effects are usually in the 10s of percent, however, compared to pro rata bandwidth limits of 5 or 10

timeson high core count CPUs.[Note 2] Here is a good blog entry on the topic.

Thanks for the interesting article as always.

I’ve enjoyed reading lots of interesting discussions while I am not sure if I entirely agree with all the above. Moreover, I’m a bit surprised to see no one brought up a hardware prefetcher in the discussion.

Even when the vectors are big enough not to fit in any of caches, it doesn’t mean that the data is brought from memory especially if the test was repeated some number of times (50 in this example?)

I wonder what it would look like if the hardware prefetcher is turned off and if the above conclusions remain same or not.