The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. Given two arrays such as (1,2) and (5,3), the scalar product is the sum of products 1 Ć 5 + 2 Ć 3. We have strong incentives to compute the scalar product as quickly as possible. Here are a few facts that I find surprising:

**Recent processors (e.g., Intel i7) can multiply and add a 32-bit number in a single CPU cycle or less.**For each component in the arrays, we need to execute one multiplication and one addition. The latency of the multiplication is at least 3 cycles, meaning that we require 3 cycles to complete a multiplication. Similarly, additions require at least one cycle. Yet processors make aggressive use of pipe-lining: they execute several multiplications simultaneously so that they can produce one result every cycle.**If you work with 64-bit integers and use some recent GNU GCC compilers (e.g., 4.5), you should disable Streaming SIMD Extensions (SSE) for better speed.**In theory, the SSE instructions are ideally suited to the computation of scalar products. Yet the throughput with 64-bit integers goes from 1.3 cycle per multiplication with SSE2 disabled to 3.4 cycles per multiplication with SSE2. There is an optimization bug in the otherwise excellent GNU GCC compiler. Update: According to the numbers provided by John Regehr, this problem also affects some Intel compilers.**When using SSE, 64-bit floating point numbers may be faster than 32-bit floating numbers.**Years ago I was told to avoid 64-bit floating point numbers for performance reasons. It is not automatically good advice on all compilers especially if you require standards compliance.

For my tests, I initially used the flags “-funroll-loops -O3” on a recent Intel i7-2600 with the GNU GCC compiler version 4.5. In each instance, I have tested with and without manual loop-unrolling and I only report the best score (in cycles per multiplication). The C code is available.

computation | with SSE | SSE2 disabled (-mno-sse2) | with AVX (-mavx) |
---|---|---|---|

32-bit integers | 1.0 | 1.1 | 0.5 |

64-bit integers | 3.4 | 1.3 |
3.4 |

float | 10 |
10 |
10 |

double | 7.0 |
170 | 7.0 |

Upgrading to GCC 4.6.2 and replacing -O3 by the new -Ofast flag changes the results quite a bit at the expense of reliability (-Ofast disregards standards compliance).

computation | with SSE | SSE2 disabled (-mno-sse2) | with AVX (-mavx) |
---|---|---|---|

32-bit integers | 1.0 | 1.1 | 0.5 |

64-bit integers | 1.0 |
1.1 | 1.0 |

float | 0.7 | 0.9 | 0.3 |

double | 5.1 | 2.5 |
5.0 |

**Further reading**: See my previous post on this topic Fast computation of scalar products, and some lessons in optimization

**Code:** Source code posted on my blog is available from a github repository.

@John

Thanks!

Interesting. Thank you for sharing your results!

@Aleksey

Thanks. I confirm your good results, I will update the blog post.

@John

I’m trying to update my results the best I can.

Who knew that a technical blog post could be so time consuming?

@Aleksey

I know, I don’t usually update posts so fast. Sorry.

@John

Good question. I wonder why it would make a difference for integer arithmetic. Something to do with overflows maybe?

@Anonymous

(1) Scalar products are not limited to linear algebra. Almost all software uses something like a scalar product. Something like this:

sales = price * quantity

is a scalar product. Also, it is often a bottleneck.

Image processing is filled with scalar products. Computer graphics use scalar products all the time.

(2) BLAS implementations almost certainly use something that looks like a scalar product underneath. You may not want to tinker with it, but it will still be impacted by the features of your processors and of the compiler.

Wow, there are a lot of intricate details to know here that could be useful. Thanks for sharing!

Nice to see measurement-based blogging :).

When compiling any code of this sort, you should try the Intel compiler as well.

10,000 reps may be enough for your simple rdtsc to give reliable results, but I wouldn’t count on it. It’s easy to do better:

http://blog.regehr.org/archives/330

It’s probably worth to enable -mavx optimization option on recent core i7.

I’ve got speed up for 32-bit integers: from 0.991442 to 0.438392.

BTW, for 64-bit integer with SSE2 enabled I have best result – 1.073550 (manually unrolled loop).

The rest is pretty similar.

gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)

Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz

Here’s what I get on an i7-920, which does not have AVX. “current-gcc” is from a few days ago, “icc” is Intel 12.1.0.

$ current-gcc -funroll-loops -Ofast scalar.c ; ./a.out

uint32 = 1.041528

uint64 = 4.266846

uint32 2×2 = 1.347658

uint64 2×2 = 4.050428

float = 0.786304

float 2×2 = 0.785491

double = 10.993460

double 2×2 = 5.679850

$ current-gcc -funroll-loops -Ofast -mno-sse2 scalar.c ; ./a.out

uint32 = 2.007275

uint64 = 2.011980

uint32 2×2 = 2.006285

uint64 2×2 = 2.013779

float = 0.942085

float 2×2 = 2.471436

double = 279.546434

double 2×2 = 142.930305

$ icc -fast scalar.c ; ./a.out

uint32 = 0.766303

uint64 = 3.629643

uint32 2×2 = 1.520190

uint64 2×2 = 3.605145

float = 0.512532

float 2×2 = 0.827342

double = 2.104804

double 2×2 = 2.390149

Even though my CPU has AVX, gcc utilizes only 4-element registers (xmm) even with -mavx option enabled.

Pretty strange…

I would expect to see ymm registers utilization.

I played a bit with the code and it seems gcc is not good enough to optimize scalar product with SIMD.

“In theory, the SSE instructions are ideally suited to the computation of scalar products.”

It’s true only for SOA data layout and there is no data dependency between loop iterations.

I separated multiplication of two arrays and accumulation: 3.583192 (compare with 10). More then 2x speed up because of full utilization of AVX registers.

I’m surprised that gcc failed to optimize so common problem.

Oh…

Did see the latests update – you updating your post too quick for me.

I seem to need GCC upgrade because -Ofast option doesn’t help for 4.6.1.

Thanks.

Another thing to ask is, does -ffast-math affect results in a way that matters for this application? I believe -Ofast enables that flag, and perhaps others that may have observable behavior. I’m not sure if icc’s -fast flag has similar consequences.

“The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. ”

In high performance computing, normally you do not compute matrix products directly through scalar products, but rely on blocking as much as you can. The reason is that loading data to the cache is expensive, so you’d rather perform as much computation as you can while they’re in, and for this blocking is more effective than successive scalar products.

As for which is the best algorithm, the general answer for floating point numbers is “don’t do it yourself, rely a highly optimized BLAS implementation”. š

Daniel, I don’t know the implications of -ffast-math. I’m basically an integer person…

Re. BLAS, sure — but it’s still fun (for some of us at least) to understand the interactions between, and limitations of, our compilers and SIMD units.