# Rolling your own fast matrix multiplication: loop order and vectorization

If you must multiply matrices, you should use dedicated libraries. However, we sometimes need to roll our own code. In C++, you can quickly write your own Matrix template:

```template <typename T>
struct Matrix { Matrix(size_t rows, size_t cols) :
data(new T[rows * cols]), rows(rows), cols(cols) {}

T &operator()(size_t i, size_t j) {
return data.get()[i * cols + j];
}

const T &operator()(size_t i, size_t j) const {
return data.get()[i * cols + j];
}

std::unique_ptr<T[]> data; size_t rows; size_t cols;
};
```

How do you implement a matrix multiplication? A matrix multiplication is a sequence of three loops. If you do not want to get fancy, you have therefore six possibilities:

```template <typename T>
void multiply_ikj(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
for (size_t i = 0; i < a.rows; i++) {
for (size_t k = 0; k < a.cols; k++) {
for (size_t j = 0; j < b.cols; j++) {
c(i, j) += a(i, k) * b(k, j);
}
}
}
}

template <typename T>
void multiply_ijk(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
for (size_t i = 0; i < a.rows; i++) {
for (size_t j = 0; j < b.cols; j++) {
for (size_t k = 0; k < a.cols; k++) {
c(i, j) += a(i, k) * b(k, j);
}
}
}
}

template <typename T>
void multiply_kij(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
for (size_t k = 0; k < a.cols; k++) {
for (size_t i = 0; i < a.rows; i++) {
for (size_t j = 0; j < b.cols; j++) {
c(i, j) += a(i, k) * b(k, j);
}
}
}
}

template <typename T>
void multiply_kji(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
for (size_t k = 0; k < a.cols; k++) {
for (size_t j = 0; j < b.cols; j++) {
for (size_t i = 0; i < a.rows; i++) {
c(i, j) += a(i, k) * b(k, j);
}
}
}
}

template <typename T>
void multiply_jki(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
for (size_t j = 0; j < b.cols; j++) {
for (size_t k = 0; k < a.cols; k++) {
for (size_t i = 0; i < a.rows; i++) {
c(i, j) += a(i, k) * b(k, j);
}
}
}
}

template <typename T>
void multiply_jik(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
for (size_t j = 0; j < b.cols; j++) {
for (size_t i = 0; i < a.rows; i++) {
for (size_t k = 0; k < a.cols; k++) {
c(i, j) += a(i, k) * b(k, j);
}
}
}
}```

If you use an optimizing compiler and you tell it to compile specifically for your processor, you should get some fast code, at least in some instances. Which order is best?

The exact result depends on your data type (double, float, int), on the size of the matrices, on your compiler and your hardware. I wrote a benchmark where I use 100 by 100 matrices containing double values. I use GCC 12 (with full optimization -O3) and an Intel Ice Lake processor. I tell the compiler to optimize for the exact processor I have thus I expect that it will use advanced AVX-512 instructions when possible.

The net result in my experiment is that the best ordering is ikj. The worst ordering is jik.

If you were to compute manually and naively the matrix multiplications, you would need to do 100 times 100 times 100 multiplications, so 1 million multiplications and 1 million additions. Interestingly, the best ordering (ikj) uses roughly  a million of instructions to load the data, do the multiplications, the additions and storing the data.

 order speed frequency instructions per cycle instructions/mult. ikj 7240 mult/s 2 GHz 3.3 916k ijk 3190 mult/s 2 GHz 2.4 1526k kij 3160 mult/s 2 GHz 2.5 1561k kji 3090 mult/s 2 GHz 2.4 1519k jki 3120 mult/s 3.2 GHz 3.5 3526k jik 1280 mult/s 3.2 GHz 1.7 4331k

The resulting assembly shows that for ikj, the instruction vfmadd213pd is generated by the compiler. There are fancier routines that the compiler could use so the result is likely far from optimal.

Further work. Justine Tunney suggests manually unrolling the outer loops. One might also use an OpenMP to get good parallelism (e.g., #pragma omp pararallel for collapse(2) if (m*n*k > 300000)).

Daniel Lemire, "Rolling your own fast matrix multiplication: loop order and vectorization," in Daniel Lemire's blog, June 13, 2024.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 10 thoughts on “Rolling your own fast matrix multiplication: loop order and vectorization”

1. The reason that multiply_ikj and multiply_ijk take the same time is that they differ only in name. A copy-paste error. If you change multiply_ijk to do what its name suggests, it’s slower.

2. Johan Bontes says:

Disappointing that we’re still doing O(n^3) cubed matrix multiplication. Is it not possible to add at least Strasser’s algorithm for an O(n^2.8) runtime?

1. Johan: pull request invited. If you propose an implementation of Strasser’s algorithm for my scenario, I will happily benchmark it.

2. Danny Prairie says:

I thought you would only see the benefit of Strasser’s et al algorithms for much larger matrices as their time complexity is better, the the actual cost of one iteration is higher?

3. Kevin M. says:

Actually gcc does not use avx512 instructions for “multiply_ikj” (see https://godbolt.org/z/bMzaYx1cj). However, if we change the signature and mark all matrices arguments as “__restrict__” it uses them (see https://godbolt.org/z/xvKKTfx1z). I guess the resulting code should be significantly faster, however it contains additional assumptions.

4. Alex Shpilkin says:

Related: “Beating OpenBLAS’ Matrix Multiplication in 150 Lines of C Code”, . From what I can see, the kernel there uses a different ordering from the compiler-vectorized code linked in Kevin M.’s comment above, so I’m not sure how it ends up performance-wise. (Most of the tricks described in that article after the kernel are not really relevant to matrices as small as 100×100, as far as I understand.)

1. Alex Shpilkin says:

Oops, link in angle brackets was stripped by the HTML sanitizer. Here it is: Beating OpenBLAS’ Matrix Multiplication in 150 Lines of C Code

You may subscribe to this blog by email.