Given two arrays, say (1,2,3,4) and (4,3,1,5), their scalar product is simply the sum of the products: 1 × 4 + 2 × 3 + 3 × 1 + 4 × 5. It is also known as the inner product or dot product and it is a routine operation in software graphics, database systems and machine learning. Many processors even have instructions to optimize the computation of the scalar product (e.g., the Multiply–accumulate operation on your iPhone, or the SSE instructions on your desktop).

How can you accelerate the scalar product without using assembly language? First, we should recognize that the main bottleneck might be the multiplications. On most processors, multiplying two numbers is several times more expensive than adding them. For example, 3 CPU cycles could be necessary for a multiplication as opposed to a single cycle for an addition.

Can I somehow reduce the number of required multiplications? To begin with, we can rewrite the scalar product

v1 × w1 + v2 × w2 + v3 × w3 + v4 × w4

as

(v1+w2)×(v2+w1)+(v3+w4)×(v4+w3)

- (v1 × v2 + w1 × w2 + v3 × v4 + w3 × w4).

Even though it looks like the second form has even more multiplications, you must realize that I can precompute values such as v1 × v2 + v3 × v4. For each one of my array, I can store the product of its successive terms. With this, I can replace the scalar product by (v1+w2)×(v2+w1)+(v3+w4)×(v4+w3) minus the sum of two precomputed quantities.

That is, I have reduced the number of multiplications by half! Can I do better? It turns out that you cannot. To compute scalar products, you need at least half as many multiplications as you have components.

Maybe a more interesting question is: is it faster? If I count one cycle per addition (or subtraction) and three cycles per multiplication, I get 15 cycles for the naive scalar product. My fast alternative uses up only 12 cycles. (Of course, the processor needs to do more than just add and multiply, so I am working off a simplified model.)

But how well does it fare in real life? I wrote a little test in Java.

- Regular scalar product: 460 ms
- Fast scalar product: 1,314 ms

Wow! Even though the fast version uses half the number of multiplications, it is at least twice as slow! Of course, your results may vary depending on the programming language, your coding skills and your processor. Nevertheless, I could have expected a small gain, and instead I get degraded performance. What is wrong with my mathematical analysis in terms of CPU cycles?

It is fundamentally flawed because modern processors are superscalar. It may take 3 cycles to complete a multiplication, but while the multiplication is being computed, other operations can be started. The scalar product is highly parallelizable: while one multiplication is computed, other multiplications can be computed (in parallel).

In some sense, when people contemplate the idea that smart software could automatically parallelize their programs, they should realize that, at a low level, their CPUs are already doing just that! The other take away message is that unless you have a lot of experience as a developer, you are unlikely to accurately predict the result of an optimization.

**Code:** Source code posted on my blog is available from a github repository.

The usual advise I’ve read is to use a linear algebra library, like LAPACK, for this type of operations. Do you know what type of “trick” LAPACK et. al use?

@Weinstein

I have no idea what LAPACK uses. For various reasons, I rarely use numerical libraries.

@John

Indeed. Sometimes, it is almost as if the computations themselves were free, but the bandwidth is very tight.

@cb

In any case, trying to count the clocks of arithmetic operations is not remotely the way to write fast code these days.I agree. This was the main point of my post.

This is not really how most modern CPU’s work.We can play semantic games, but that is not much fun. Effectively, while one multiplication is being computed, another multiplication begins so that even though 3 cycles are required to complete an integer multiplication, you can execute two multiplications in 4 cycles. I call this parallelism.

On modern Intel chips multiplication simply takes only 1 clock, no parallelism required, multiplication is just fast.The latency for multiplication on Intel processors is at least 3 cycles (for integers), and 1 cycle for additions. (Reference: http://www.intel.com/products/processor/manuals/, check “Intel® 64 and IA-32 Architectures Optimization Reference Manual”). Floating point multiplications are more expensive (at least 5 cycles). We could get technical and dig into SSE instructions and so on, but that wasn’t my intent. The throughput is often 1 cycle, of course.

Another factor is that in a number crunching application, most of the time is spent moving data around, not crunching numbers. Clever software rearranges calculations to do as much with data as it can while the data is in cache.

Once again we are reminded of Knuth:

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil”

“It is fundamentally flawed because modern processors are superscalar. It may take 3 cycles to complete a multiplication, but while the multiplication is being computed, other operations can be started. The scalar product is highly parallelizable: while one multiplication is computed, other multiplications can be computed (in parallel).”

This is not really how most modern CPU’s work. They almost always only have one multiplier unit, because multipliers are somewhat large on the die.

On some chips multiplication has long latency (7-9 clocks) but still has throughput of 1 mul per clock because of pipelining.

Pipelining is in some ways similar to parallelism but is not quite the same thing.

On modern Intel chips multiplication simply takes only 1 clock, no parallelism required, multiplication is just fast. (division is another story – that’s still slow)

In any case, trying to count the clocks of arithmetic operations is not remotely the way to write fast code these days. Writing fast code is something like :

1. use a better algorithm (use threads)

2. rearrange memory layout for better cache use

3. make data more compact

4. remove branches

5. remove data dependency chains

6. use SIMD

7. use CUDA or SPUs or what have you

8. remove divisions, float-to-ints very expensive ops like that

we just never worry about things like multiplies, they are a tiny fraction of a rounding error in overall speed.

I think a better justification for algorithms that focus on reducing multiplications (as so often show up in literature in algebraic complexity theory) is: what if your “numbers” aren’t just single-precision ints? For example, what if the entries in your vector are polynomials? I believe (though I have not tested) that multiplying polynomials is significantly more time-consuming in practice than adding them. I believe this is also true for “big ints” (multi-precision integers, as used e.g. in cryptography), but again I have not tested this.

The point of your post is still well-taken, I just wanted to add that there may be other situations that arise where the multiplication-minimizing algorithm actually does perform better.

@Josh: yes, multiplying two big-ints is more expensive than adding them, exactly like polynomials. In fact, the two problems are pretty similar – bigints are “more or less” polynomials in the variable x=10…