Thiago Maltempi maltempi

## part_1_memory_format_and_channels_last_optimization.md

      
              1 file
            
          
              3 forks
            
          
              0 comments
            
          
              16 stars
            
          
                mingfeima
                / part_1_memory_format_and_channels_last_optimization.md
            
            
              Last active
              June 28, 2024 11:03
            
              
                PyTorch CPU Performance Optimization Tutorial - Section I
              
          
    Part I: Memory Formats and Channels Last Optimization

(Training material on pytorch CPU performance optimization)

Part II: Parallelization Techniques
Part III: Vectorization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.

  
## Matrix.md

      
              7 files
            
          
              73 forks
            
          
              17 comments
            
          
              871 stars
            
          
                nadavrot
                / Matrix.md
            
            
              Last active
              July 16, 2024 10:05
            
              
                Efficient matrix multiplication
              
          
    High-Performance Matrix Multiplication

This is a short post that explains how to write a high-performance matrix
multiplication program on modern processors. In this tutorial I will use a
single core of the Skylake-client CPU with AVX2, but the principles in this post
also apply to other processors with different instruction sets (such as AVX512).
Intro

Matrix multiplication is a mathematical operation that defines the product of

  
## test_cblas_dgemm.c
#include <cblas.h>
#include <stdio.h>

void main()
{
  int i=0;
  double A[6] = {1.0,2.0,1.0,-3.0,4.0,-1.0};
  double B[6] = {1.0,2.0,1.0,-3.0,4.0,-1.0};
  double C[9] = {.5,.5,.5,.5,.5,.5,.5,.5,.5};
  cblas_dgemm(CblasColMajor, CblasNoTrans, CblasTrans,3,3,2,1,A, 3, B, 3,2,C,3);
	#include <cblas.h>
	#include <stdio.h>

	void main()
	{
	int i=0;
	double A[6] = {1.0,2.0,1.0,-3.0,4.0,-1.0};
	double B[6] = {1.0,2.0,1.0,-3.0,4.0,-1.0};
	double C[9] = {.5,.5,.5,.5,.5,.5,.5,.5,.5};
	cblas_dgemm(CblasColMajor, CblasNoTrans, CblasTrans,3,3,2,1,A, 3, B, 3,2,C,3);