neoblizz/SoL-SpMM-GEMM.md

## SoL-SpMM-GEMM.md

      
    Raw
  

              SoL-SpMM-GEMM.md
            
          
    Hardware Speed-of-Light Analysis

The following numbers are based on NVIDIA's Volta microarchitecture. To perform a similar analysis for a newer architecture, I recommend changing the numbers below based on device_query CUDA sample or wikipedia page.
CUDA Cores = SM * Cores per SM (SM = 80, Cores/SM = 64)
Maximum Clock Rate = Clock Rate (KHz) * 1e-6 (GHz)
Maximum Throughput (type == floats, doubles or half) =
    CUDA Cores * Maximum Clock Rate * Type Ratio (device properties) (GFLOP/s)

Maximum Memory Bandwidth = 
    Memory Clock Rate (KHz) * 1e-6 * Pump Rate * (Bus Width / 8) (GB/s)
Maximum Memory Bandwidth = 
    Frequency * Pump Rate * (Bus Width / 8) (GB/s)

Bus Width = 4096 bit
Pump Rate = 2
Memory Clock Rate (KHz) = 877 MHz
Maximum Memory Bandwidth = 898 GB/s ~= 900 GB/s

Dense Matrix Multiplication

Problem Dependent Analysis


A = {m × k}
B = {k × n}
C = {m × n}

Number of Operations = 2 (multadd operations) * m * n * k * 1e-9 (GFLOP)
Number of Reads/Writes = (m * k) + (k * n) + (m * n) * size of type (64/8 doubles or 32/8 floats) * 1e-9 (GByte)

Problem & Hardware Dependent

Speed of Light Elapsed Time (s) Compute Bound = 
	Number of Operations (GFLOP) / Maximum Throughput (GFLOP/s)

Speed of Light Elapsed Time (s) Memory Bound = 
	Number of Reads/Writes (GB) /  Maximum Memory Bandwidth (GB/s)

Speed of Light Elapsed Time (s) = max(Compute, Memory)

Speed of Light Throuput (GFLOP/s) = 
	Number of Operations (GFLOP) / Speed of Light Elapsed Time (s)

Sparse Matrix Multiplication

Problem Dependent Analysis (CSR)


A = {m × k} represented as {nnz, nzr}
B = {k × n}
C = {m × n}

Number of Operations =  2 (multadd) * nnz * n * 1e-9 (GFLOP)
Number of Reads/Writes += size of type (nzr) * nzr           // row offsets
                       += size of type(k) * nnz              // column indices
                       += size of type (values) * nnz        // values of A
                       += size of type (values) * (k * n)    // values of B
                       += size of type (values) * (m * n)    // values of C
                       *= 1e-9 (GB)
										 
m, n, k = 1024
sparsity = 90%, nnz = (m * k) * 10%
dense = 0.2741 (ms)
sparse = 0.0274 (ms)