(Training material on pytorch CPU performance optimization)
- Part II: Parallelization Techniques
- Part III: Vectorization Techniques
- Part IV: BFloat16 Kernel Optimization
Chinese version for this chapter, link.
(Training material on pytorch CPU performance optimization)
Chinese version for this chapter, link.
This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).
Matrix multiplication is a mathematical operation that defines the product of
#include <cblas.h> | |
#include <stdio.h> | |
void main() | |
{ | |
int i=0; | |
double A[6] = {1.0,2.0,1.0,-3.0,4.0,-1.0}; | |
double B[6] = {1.0,2.0,1.0,-3.0,4.0,-1.0}; | |
double C[9] = {.5,.5,.5,.5,.5,.5,.5,.5,.5}; | |
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasTrans,3,3,2,1,A, 3, B, 3,2,C,3); |