Rolling your own fast matrix multiplication: loop order and vectorization

If you must multiply matrices, you should use dedicated libraries. However, we sometimes need to roll our own code. In C++, you can quickly write your own Matrix template:

template <typename T> 
struct Matrix { Matrix(size_t rows, size_t cols) : 
  data(new T[rows * cols]), rows(rows), cols(cols) {} 

  T &operator()(size_t i, size_t j) { 
      return data.get()[i * cols + j]; 
  } 
  
  const T &operator()(size_t i, size_t j) const { 
      return data.get()[i * cols + j]; 
  }

  std::unique_ptr<T[]> data; size_t rows; size_t cols; 
};

How do you implement a matrix multiplication? A matrix multiplication is a sequence of three loops. If you do not want to get fancy, you have therefore six possibilities:

template <typename T>
void multiply_ikj(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t i = 0; i < a.rows; i++) {
    for (size_t k = 0; k < a.cols; k++) {
      for (size_t j = 0; j < b.cols; j++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

template <typename T>
void multiply_ijk(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t i = 0; i < a.rows; i++) {
    for (size_t j = 0; j < b.cols; j++) {
      for (size_t k = 0; k < a.cols; k++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

template <typename T>
void multiply_kij(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t k = 0; k < a.cols; k++) {
    for (size_t i = 0; i < a.rows; i++) {
      for (size_t j = 0; j < b.cols; j++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

template <typename T>
void multiply_kji(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t k = 0; k < a.cols; k++) {
    for (size_t j = 0; j < b.cols; j++) {
      for (size_t i = 0; i < a.rows; i++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

template <typename T>
void multiply_jki(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t j = 0; j < b.cols; j++) {
    for (size_t k = 0; k < a.cols; k++) {
      for (size_t i = 0; i < a.rows; i++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

template <typename T>
void multiply_jik(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t j = 0; j < b.cols; j++) {
    for (size_t i = 0; i < a.rows; i++) {
      for (size_t k = 0; k < a.cols; k++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

If you use an optimizing compiler and you tell it to compile specifically for your processor, you should get some fast code, at least in some instances. Which order is best?

The exact result depends on your data type (double, float, int), on the size of the matrices, on your compiler and your hardware. I wrote a benchmark where I use 100 by 100 matrices containing double values. I use GCC 12 (with full optimization -O3) and an Intel Ice Lake processor. I tell the compiler to optimize for the exact processor I have thus I expect that it will use advanced AVX-512 instructions when possible.

The net result in my experiment is that the best ordering is ikj. The worst ordering is jik.

If you were to compute manually and naively the matrix multiplications, you would need to do 100 times 100 times 100 multiplications, so 1 million multiplications and 1 million additions. Interestingly, the best ordering (ikj) uses roughly  a million of instructions to load the data, do the multiplications, the additions and storing the data.

order speed frequency instructions per cycle instructions/mult.
ikj 7240 mult/s 2 GHz 3.3 916k
ijk 3190 mult/s 2 GHz 2.4 1526k
kij 3160 mult/s 2 GHz 2.5 1561k
kji 3090 mult/s 2 GHz 2.4 1519k
jki 3120 mult/s 3.2 GHz 3.5 3526k
jik 1280 mult/s 3.2 GHz 1.7 4331k

The resulting assembly shows that for ikj, the instruction vfmadd213pd is generated by the compiler. There are fancier routines that the compiler could use so the result is likely far from optimal.

Further work. Justine Tunney suggests manually unrolling the outer loops. One might also use an OpenMP to get good parallelism (e.g., #pragma omp pararallel for collapse(2) if (m*n*k > 300000)).

Daniel Lemire, "Rolling your own fast matrix multiplication: loop order and vectorization," in Daniel Lemire's blog, June 13, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

10 thoughts on “Rolling your own fast matrix multiplication: loop order and vectorization”

  1. The reason that multiply_ikj and multiply_ijk take the same time is that they differ only in name. A copy-paste error. If you change multiply_ijk to do what its name suggests, it’s slower.

  2. Disappointing that we’re still doing O(n^3) cubed matrix multiplication. Is it not possible to add at least Strasser’s algorithm for an O(n^2.8) runtime?

    1. I thought you would only see the benefit of Strasser’s et al algorithms for much larger matrices as their time complexity is better, the the actual cost of one iteration is higher?

  3. Related: “Beating OpenBLAS’ Matrix Multiplication in 150 Lines of C Code”, . From what I can see, the kernel there uses a different ordering from the compiler-vectorized code linked in Kevin M.’s comment above, so I’m not sure how it ends up performance-wise. (Most of the tricks described in that article after the kernel are not really relevant to matrices as small as 100×100, as far as I understand.)

    1. Oops, link in angle brackets was stripped by the HTML sanitizer. Here it is: Beating OpenBLAS’ Matrix Multiplication in 150 Lines of C Code

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.