Here I compared the effect of different compiler optimizations in both Fortran and C for a program that multiplies a matrix with a vector. The results are below.
Options | C (loop) | Fortran (intrinsic) | Fortran (loop) |
---|---|---|---|
828 ms | 104 ms | 835 ms | |
-Ofast | 110 ms | 112 ms | 110 ms |
-O3 | 362 ms | 361 ms | 363 ms |
-O3 -march=native | 362 ms | 363 ms | 361 ms |
-O3 -march=native -ffast-math -funroll-loops | 90.3 ms | 92.8 ms | 89.5 ms |
-O3 -march=native -ffast-math -funroll-loops -fopenmp | 85.2 ms | 91.2 ms | 86.4 ms |
I wanted to understand what differences if any exist between C and Fortran and how the options impact them. And I wanted to know how Fortran's intrinsic matmul compared with manualy writing a loop, particularly for multiplying a vector by a matrix with is just a simple nested loop and easily parallelized. I'm working on a Fortran implementation of LLM inference and want to understand how it can be optimized.
The programs and a script to run them are attached. I ran it on my "Intel Core i7 vPro 9th Gen" Lenovo laptop with Ubuntu 20.04 and using gfortran-10 and gcc-10. I used a matrix size of 20,000x20,000 and 10 iterations.
The numbers presented are from a random run that I did and were not selected for any special reason. Having run it several times, there are ~2-3 ms of uncertainty in some of the results, so don't read too much into small differences.
Overall, C vs Fortran is a wash which shouldn't be a surprise. We get about the same performance for hand made matmul with all the options on with and without parallelization.
The most noteworthy thing is how much faster Fortran intrinsic matmul works out of the box without any optimization, and how that gets ruined with -O3
on. Naively I thought adding -O3 was a good generic way to get a speedup but clearly that's not true here. Otherwise, -Ofast
speeds up the hand implementations but still slightly degrades the Fortran intrinsic performance. And intrinsic performance is sped up by --fast-math
and --funroll-loops
.
The biggest loser here (other and -O3
) is parallelism. Running on 12 cores, parallelizing the loops is only knocking a few ms off of the time.
I'd like to know if there are other obvious optimizations here that could make this code faster. Currently I think I need to look elsewhere for speedup, but be careful of the compiler options I use.