If you must multiply matrices, you should use dedicated libraries. However, we sometimes need to roll our own code. In C++, you can quickly write your own Matrix template:

template <typename T> struct Matrix { Matrix(size_t rows, size_t cols) : data(new T[rows * cols]), rows(rows), cols(cols) {} T &operator()(size_t i, size_t j) { return data.get()[i * cols + j]; } const T &operator()(size_t i, size_t j) const { return data.get()[i * cols + j]; } std::unique_ptr<T[]> data; size_t rows; size_t cols; };

How do you implement a matrix multiplication? A matrix multiplication is a sequence of three loops. If you do not want to get fancy, you have therefore six possibilities:

template <typename T> void multiply_ikj(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) { for (size_t i = 0; i < a.rows; i++) { for (size_t k = 0; k < a.cols; k++) { for (size_t j = 0; j < b.cols; j++) { c(i, j) += a(i, k) * b(k, j); } } } } template <typename T> void multiply_ijk(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) { for (size_t i = 0; i < a.rows; i++) { for (size_t j = 0; j < b.cols; j++) { for (size_t k = 0; k < a.cols; k++) { c(i, j) += a(i, k) * b(k, j); } } } } template <typename T> void multiply_kij(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) { for (size_t k = 0; k < a.cols; k++) { for (size_t i = 0; i < a.rows; i++) { for (size_t j = 0; j < b.cols; j++) { c(i, j) += a(i, k) * b(k, j); } } } } template <typename T> void multiply_kji(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) { for (size_t k = 0; k < a.cols; k++) { for (size_t j = 0; j < b.cols; j++) { for (size_t i = 0; i < a.rows; i++) { c(i, j) += a(i, k) * b(k, j); } } } } template <typename T> void multiply_jki(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) { for (size_t j = 0; j < b.cols; j++) { for (size_t k = 0; k < a.cols; k++) { for (size_t i = 0; i < a.rows; i++) { c(i, j) += a(i, k) * b(k, j); } } } } template <typename T> void multiply_jik(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) { for (size_t j = 0; j < b.cols; j++) { for (size_t i = 0; i < a.rows; i++) { for (size_t k = 0; k < a.cols; k++) { c(i, j) += a(i, k) * b(k, j); } } } }

If you use an optimizing compiler and you tell it to compile specifically for your processor, you should get some fast code, at least in some instances. Which order is best?

The exact result depends on your data type (double, float, int), on the size of the matrices, on your compiler and your hardware. I wrote a benchmark where I use 100 by 100 matrices containing double values. I use GCC 12 (with full optimization -O3) and an Intel Ice Lake processor. I tell the compiler to optimize for the exact processor I have thus I expect that it will use advanced AVX-512 instructions when possible.

The net result in my experiment is that the best ordering is ikj. The worst ordering is jik.

If you were to compute manually and naively the matrix multiplications, you would need to do 100 times 100 times 100 multiplications, so 1 million multiplications and 1 million additions. Interestingly, the best ordering (ikj) uses roughly a million of instructions to load the data, do the multiplications, the additions and storing the data.

order | speed | frequency | instructions per cycle | instructions/mult. |

ikj | 7240 mult/s | 2 GHz | 3.3 | 916k |

ijk | 3190 mult/s | 2 GHz | 2.4 | 1526k |

kij | 3160 mult/s | 2 GHz | 2.5 | 1561k |

kji | 3090 mult/s | 2 GHz | 2.4 | 1519k |

jki | 3120 mult/s | 3.2 GHz | 3.5 | 3526k |

jik | 1280 mult/s | 3.2 GHz | 1.7 | 4331k |

The resulting assembly shows that for `ikj`, the instruction `vfmadd213pd` is generated by the compiler. There are fancier routines that the compiler could use so the result is likely far from optimal.

**Further work**. Justine Tunney suggests manually unrolling the outer loops. One might also use an OpenMP to get good parallelism (e.g., `#pragma omp pararallel for collapse(2) if (m*n*k > 300000`)).

Daniel Lemire, "Rolling your own fast matrix multiplication: loop order and vectorization," in *Daniel Lemire's blog*, June 13, 2024.

The reason that multiply_ikj and multiply_ijk take the same time is that they differ only in name. A copy-paste error. If you change multiply_ijk to do what its name suggests, it’s slower.

Blog post updated. Thanks.

I haven’t looked too closely, but I expect that the ordering you found to be fastest is likely the same as what Goto and Van de Geijn found: https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf

Also, this page describes a number of fairly straightforward optimizations that can have a substantial benefit: https://en.algorithmica.org/hpc/algorithms/matmul

Disappointing that we’re still doing O(n^3) cubed matrix multiplication. Is it not possible to add at least Strasser’s algorithm for an O(n^2.8) runtime?

Johan: pull request invited. If you propose an implementation of Strasser’s algorithm for my scenario, I will happily benchmark it.

I thought you would only see the benefit of Strasser’s et al algorithms for much larger matrices as their time complexity is better, the the actual cost of one iteration is higher?

Actually gcc does not use avx512 instructions for “multiply_ikj” (see https://godbolt.org/z/bMzaYx1cj). However, if we change the signature and mark all matrices arguments as “__restrict__” it uses them (see https://godbolt.org/z/xvKKTfx1z). I guess the resulting code should be significantly faster, however it contains additional assumptions.

Related: “Beating OpenBLAS’ Matrix Multiplication in 150 Lines of C Code”, . From what I can see, the kernel there uses a different ordering from the compiler-vectorized code linked in Kevin M.’s comment above, so I’m not sure how it ends up performance-wise. (Most of the tricks described in that article after the kernel are not really relevant to matrices as small as 100×100, as far as I understand.)

Oops, link in angle brackets was stripped by the HTML sanitizer. Here it is: Beating OpenBLAS’ Matrix Multiplication in 150 Lines of C Code

Oops, link in angle brackets was stripped by the HTML sanitizer. Here it is: https://salykova.github.io/matmul-cpu