3 surprising facts about the computation of scalar products

The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. Given two arrays such as (1,2) and (5,3), the scalar product is the sum of products 1 Ɨ 5 + 2 Ɨ 3. We have strong incentives to compute the scalar product as quickly as possible. Here are a few facts that I find surprising:

  • Recent processors (e.g., Intel i7) can multiply and add a 32-bit number in a single CPU cycle or less. For each component in the arrays, we need to execute one multiplication and one addition. The latency of the multiplication is at least 3 cycles, meaning that we require 3 cycles to complete a multiplication. Similarly, additions require at least one cycle. Yet processors make aggressive use of pipe-lining: they execute several multiplications simultaneously so that they can produce one result every cycle.
  • If you work with 64-bit integers and use some recent GNU GCC compilers (e.g., 4.5), you should disable Streaming SIMD Extensions (SSE) for better speed. In theory, the SSE instructions are ideally suited to the computation of scalar products. Yet the throughput with 64-bit integers goes from 1.3 cycle per multiplication with SSE2 disabled to 3.4 cycles per multiplication with SSE2. There is an optimization bug in the otherwise excellent GNU GCC compiler. Update: According to the numbers provided by John Regehr, this problem also affects some Intel compilers.
  • When using SSE, 64-bit floating point numbers may be faster than 32-bit floating numbers. Years ago I was told to avoid 64-bit floating point numbers for performance reasons. It is not automatically good advice on all compilers especially if you require standards compliance.

For my tests, I initially used the flags “-funroll-loops -O3” on a recent Intel i7-2600 with the GNU GCC compiler version 4.5. In each instance, I have tested with and without manual loop-unrolling and I only report the best score (in cycles per multiplication). The C code is available.

computationwith SSESSE2 disabled (-mno-sse2)with AVX (-mavx)
32-bit integers1.01.10.5
64-bit integers3.41.33.4
float101010
double7.01707.0

Upgrading to GCC 4.6.2 and replacing -O3 by the new -Ofast flag changes the results quite a bit at the expense of reliability (-Ofast disregards standards compliance).

computationwith SSESSE2 disabled (-mno-sse2)with AVX (-mavx)
32-bit integers1.01.10.5
64-bit integers1.01.11.0
float0.70.90.3
double5.12.55.0

Further reading: See my previous post on this topic Fast computation of scalar products, and some lessons in optimization

Code: Source code posted on my blog is available from a github repository.

18 thoughts on “3 surprising facts about the computation of scalar products”

  1. @Anonymous

    (1) Scalar products are not limited to linear algebra. Almost all software uses something like a scalar product. Something like this:

    sales = price * quantity

    is a scalar product. Also, it is often a bottleneck.

    Image processing is filled with scalar products. Computer graphics use scalar products all the time.

    (2) BLAS implementations almost certainly use something that looks like a scalar product underneath. You may not want to tinker with it, but it will still be impacted by the features of your processors and of the compiler.

  2. It’s probably worth to enable -mavx optimization option on recent core i7.
    I’ve got speed up for 32-bit integers: from 0.991442 to 0.438392.

  3. BTW, for 64-bit integer with SSE2 enabled I have best result – 1.073550 (manually unrolled loop).
    The rest is pretty similar.

    gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)
    Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz

  4. Here’s what I get on an i7-920, which does not have AVX. “current-gcc” is from a few days ago, “icc” is Intel 12.1.0.

    $ current-gcc -funroll-loops -Ofast scalar.c ; ./a.out
    uint32 = 1.041528
    uint64 = 4.266846
    uint32 2×2 = 1.347658
    uint64 2×2 = 4.050428
    float = 0.786304
    float 2×2 = 0.785491
    double = 10.993460
    double 2×2 = 5.679850

    $ current-gcc -funroll-loops -Ofast -mno-sse2 scalar.c ; ./a.out
    uint32 = 2.007275
    uint64 = 2.011980
    uint32 2×2 = 2.006285
    uint64 2×2 = 2.013779
    float = 0.942085
    float 2×2 = 2.471436
    double = 279.546434
    double 2×2 = 142.930305

    $ icc -fast scalar.c ; ./a.out
    uint32 = 0.766303
    uint64 = 3.629643
    uint32 2×2 = 1.520190
    uint64 2×2 = 3.605145
    float = 0.512532
    float 2×2 = 0.827342
    double = 2.104804
    double 2×2 = 2.390149

  5. Even though my CPU has AVX, gcc utilizes only 4-element registers (xmm) even with -mavx option enabled.
    Pretty strange…
    I would expect to see ymm registers utilization.

  6. I played a bit with the code and it seems gcc is not good enough to optimize scalar product with SIMD.
    “In theory, the SSE instructions are ideally suited to the computation of scalar products.”
    It’s true only for SOA data layout and there is no data dependency between loop iterations.
    I separated multiplication of two arrays and accumulation: 3.583192 (compare with 10). More then 2x speed up because of full utilization of AVX registers.
    I’m surprised that gcc failed to optimize so common problem.

  7. Oh…
    Did see the latests update – you updating your post too quick for me.
    I seem to need GCC upgrade because -Ofast option doesn’t help for 4.6.1.
    Thanks.

  8. Another thing to ask is, does -ffast-math affect results in a way that matters for this application? I believe -Ofast enables that flag, and perhaps others that may have observable behavior. I’m not sure if icc’s -fast flag has similar consequences.

  9. “The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. ”

    In high performance computing, normally you do not compute matrix products directly through scalar products, but rely on blocking as much as you can. The reason is that loading data to the cache is expensive, so you’d rather perform as much computation as you can while they’re in, and for this blocking is more effective than successive scalar products.

    As for which is the best algorithm, the general answer for floating point numbers is “don’t do it yourself, rely a highly optimized BLAS implementation”. šŸ™‚

  10. Daniel, I don’t know the implications of -ffast-math. I’m basically an integer person…

    Re. BLAS, sure — but it’s still fun (for some of us at least) to understand the interactions between, and limitations of, our compilers and SIMD units.

Leave a Reply

Your email address will not be published. Required fields are marked *