Writing a program for efficient matrix multiplication is tricky. A naive implementation usually looks like this (I will use a square matrix for simplicity):
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++i)
for (k = 0, m[i][j] = 0.; k < n; ++i)
m[i][j] += a[i][k] * b[k][j];
The problem is the inner loop a[i][k] * b[k][j]
where b[k][j]
and b[k+1][j]
are not adjacent in memory. This leads to frequent cache misses for large matrices. The better implementation is to transpose matrix b
. The implementation will look like:
for (i = 0; i < n; ++i) // transpose