Earlier, I asked whether integer addition was faster than bitwise exclusive or. My tests showed no difference, and nobody contradicted me.

However, everyone knows that multiplication is slower than addition? Right? In cryptography, there are many papers on how to trade multiplications for additions, to speed up software.

So? Can you predict which piece of code runs faster?

**scalar product (N multiplications):**

for(int k =0; k < N ; ++k)

answer += vector1[k] * vector2[k];

**scalar product two-by-two (N multiplications):**

` for(int k =0; k < N ; k+=2)`

answer += vector1[k] * vector2[k]

+vector1[k+1] * vector2[k+1];

**non-standard scalar product (N/2 multiplications):**

for(int k =0; k < N ; k+=2)

answer += ( vector1[k] + vector2[k] )

* ( vector1[k+1] + vector2[k+1] );

**just additions (no multiplication):**

for(int k =0; k < N ; ++k)

answer += vector1[k] + vector2[k];

**Answer:** Merely reducing the number of multiplications has no benefit, in these tests. Hence, simple computational cost models (such as counting the number of multiplications) may not hold on modern superscalar processors.

My results using GNU GCC 4.2.1 on both a desktop and a laptop:

algorithm | Intel Core i7 | Intel Core 2 Duo |
---|---|---|

scalar product | 0.30 | 0.39 |

scalar product (2×2) | 0.25 | 0.39 |

fewer multiplications | 0.25 | 0.39 |

just additions | 0.16 | 0.23 |

Times are in seconds. The source code is available without pointer arithmetics. The same test with pointer arithmetics gives faster results, but the same conclusion. I tried a similar experiment in Java. It confirms my result.

**Code:** Source code posted on my blog is available from a github repository.

Using pointer arithmetics does speed things up, but it does not change the conclusion:

http://pastebin.com/d0kgLPwA

@Thomas In this case, I multiply 64-bit integers.

I’m not saying there is no difference at all between multiplications and additions. However, going from two to one multiplications per unit of data might not do you any good.

Basically, I’m arguing that it is silly to count the number of multiplications and conclude that whenever you have fewer multiplications, your code is faster. It does not seem to stand true on recent microprocessors.

Unless I am mistaken, the [] operator on an array in C/C++ is a multiplication itself (pointer + (index * sizeof element)). So removing the one multiplication out of the lot might not make a true impact.

The gcc optimizer might also mess with the loop and optimize some things out.

You can remove those [] operators by walking your array with pointer arithmetic (++). Maybe your results would be different.

Multiplication of what? I suppose you’re talking about integers or maybe floats? In defense of the cryptography papers, they probably had (very) large integers in mind, for which I’m pretty sure we don’t know how to multiply as fast as add.

Still, good to know there’s no time difference for machine words; bust the myth.

Answering this is going to be a bit tricky.

I believe the on-chip hardware for doing one-cycle multiply landed in mass-market CPUs quite a long ways back (much more than ten years). So if you are using a current generation mass-market desktop CPU, likely all you will see is one-cycle execution.

However…

The one-cycle multiply requires a much larger chunk of chip space than does (say) the more common [add, substract, and, or, xor, shift] operations. The device budget on desktop chips is huge, so the space needed for multiply does not matter.

When the device budget does matter, chip designers make trade-offs.

On some (past?) chips a multiply operation would put a bubble in the pipeline. Where a chip might routinely work one two or three instructions at once, with multiply in the pipeline it might only work on one. A CPU often has more than one unit capable simpler operations, but only one multiplier unit.

If you wrote a micro-benchmark (timing runs of a single instruction, written in assembly), you would find the multiplies take only a single cycle, but everything else runs a bit slower.

As to what exact trade-offs are used by current chip designers, I have no idea.

Chips aimed at low-cost and low-power are more likely to make this sort of trade-off. So you may not see any (or much) difference on a desktop CPU, but the CPU in a cell-phone (or iPad?) might yield more interesting (and different) results. You might also see a difference in designs like Sun’s “Niagara” chip – where the aim was many CPUs at the most efficient MIPS/power ratio.

Incidental notes…

Array index references do not necessarily invoke multiply. If the elements in the array are a power-of-two in size, a simple left-shift will suffice. Some CPUs embed the indexing operation into their addressing modes, and so do not even require a separate instruction. (I used to know which, but lost interest in this level much more than a decade back.)

Chip designers spend a lot of time analyzing “typical” instruction-stream workloads, so to choose the optimal numbers of integer-add, integer-multiply, and floating-point units within the chip. Conversely, you can make a pretty good guess at the intended market segment for a new CPU chip from the relative number of integer and floating point units within each CPU. Gamers, CAD workers, and scientific workloads require more floating point. Web servers have little need for floating point.

I’m not sure I agree with your conclusion, Daniel, at least if your conclusion is that we shouldn’t care about multiplications. It may be true that “in cryptography, there are many papers on how to trade multiplications for additions, to speed up software.” But the higher level point is that, in a tight inner loop that is responsible for most of the run-time of an application, it can be important to squeeze out every last bit of performance in that inner loop. Whether it is done by removing multiplications and whether that still works on modern hardware is irrelevant. The post is to maximize the performance on machines it will run on. Usually, but not always, that means minimizing instructions and maximizing cache hits, but the important thing is to test the changes on actual hardware.

On a related note, did you see “You’re Doing It Wrong” in the July 2010 CACM? Main point of that article is that very little else matters if you are ever hitting disk since they cost the equivalent of millions of instructions. A good point there, and more evidence that what matters for optimization is only what matters, that is, what actually is proven to matter on the hardware on which the code will be running.

@Linden

I’m confused about your disagreement. I totally agree with Kamp’s paper and with what you are saying : “what matters for optimization is only what matters, that is, what actually is proven to matter on the hardware on which the code will be running”.

That is why I have this humble quest here on this blog to get people to think about all these truisms we pick up, like “multiplication is slower than addition.”

Hi, Daniel. Sorry for my confusion, maybe it is a matter of emphasis. I completely agree that people should not focus on multiplication being slower than addition, but wanted to say what the focus should be on, which is not that we shouldn’t worry about multiplication in inner loops, but that we should focus on testing what makes a real difference in performance (and that what often makes the most difference in practice is cache misses or, much worse, hitting disk).

I agree with your conclusion. In my thesis, we were looking at designing efficient byte-aligned coding algorithms, and the old tricks of strength-reduced arithmetic and unwinding showed very little performance improvement. I’m pretty sure I discuss this somewhere in the final thesis.

You can now take some of the carefully crafted arithmetic coders of the late 90s which did a lot of work to avoid multiplication and division, simplify them, and get nearly identical performance for the two variants on new hardware. The same definitely wasn’t true 10 years ago.

Of course, this may not be true on low power embedded chips. These processors have different design goals, and are clearly the real growth area over the next few years. I haven’t done much work with these processors, but would be interested to see if this is a universal or localised phenomenon.

I find it interesting that the Intel Core i7 presents a different timing profile. The Intel Nehalem processor family, in which the Core i7 is a member, has a feature called Turbo Boost which automatically boosts performance on demand based on workload. If not disabled, this feature does appear to invalidate results from benchmarks run in sequence, since later benchmarks will run at higher clock frequencies. Unfortunately, I do not have a Nehalem processor on hand to test this, so I’m curious to know whether Turbo Boost was enabled when you ran the benchmarks.

With this aren’t you mistakenly indulging in premature optimization?

I’m arguing that it is silly to count the number of multiplications and conclude that whenever you have fewer multiplications, your code is faster.This is generally not true, because on modern architectures cache miss ratio (both L1/L2 … and disk cache) may cost much more than adding a few extra operations.

The well known example from searching is a trie (string prefix tree). Even though it performs approximately the same number of operations as a hash, it is often 5-10 times slower.

I agree with your conclusion. In my thesis, we were looking at designing efficient byte-aligned coding algorithms, and the old tricks of strength-reduced arithmetic and unwinding showed very little performance improvement.JSC, one should be careful to say that unwinding loops does not help on modern architectures. Because it does for many reasons: first it indeed decreases the number of operations. Second, unwinded loops work better with pipelines.

Next, byte-aligned codes do work considerably (2-3 times) faster in many cases.

However, I agree with your opinion on strength-reduced arithmetic. It may not have sense any more in many cases. For instance, because I/O operations may take much longer time.

On my CPU (Phenom II) a 32 bit MUL can retire every clock cycle which seems pretty brisk until you consider that 3 Adds can retire per clock cycle.