The following numbers are based on NVIDIA's Volta microarchitecture. To perform a similar analysis for a newer architecture, I recommend changing the numbers below based on device_query
CUDA sample or wikipedia page.
CUDA Cores = SM * Cores per SM (SM = 80, Cores/SM = 64)
Maximum Clock Rate = Clock Rate (KHz) * 1e-6 (GHz)
Maximum Throughput (type == floats, doubles or half) =
CUDA Cores * Maximum Clock Rate * Type Ratio (device properties) (GFLOP/s)
Maximum Memory Bandwidth =
Memory Clock Rate (KHz) * 1e-6 * Pump Rate * (Bus Width / 8) (GB/s)
Maximum Memory Bandwidth =
Frequency * Pump Rate * (Bus Width / 8) (GB/s)
Bus Width = 4096 bit
Pump Rate = 2
Memory Clock Rate (KHz) = 877 MHz
Maximum Memory Bandwidth = 898 GB/s ~= 900 GB/s
- A = {m × k}
- B = {k × n}
- C = {m × n}
Number of Operations = 2 (multadd operations) * m * n * k * 1e-9 (GFLOP)
Number of Reads/Writes = (m * k) + (k * n) + (m * n) * size of type (64/8 doubles or 32/8 floats) * 1e-9 (GByte)
Speed of Light Elapsed Time (s) Compute Bound =
Number of Operations (GFLOP) / Maximum Throughput (GFLOP/s)
Speed of Light Elapsed Time (s) Memory Bound =
Number of Reads/Writes (GB) / Maximum Memory Bandwidth (GB/s)
Speed of Light Elapsed Time (s) = max(Compute, Memory)
Speed of Light Throuput (GFLOP/s) =
Number of Operations (GFLOP) / Speed of Light Elapsed Time (s)
- A = {m × k} represented as {nnz, nzr}
- B = {k × n}
- C = {m × n}
Number of Operations = 2 (multadd) * nnz * n * 1e-9 (GFLOP)
Number of Reads/Writes += size of type (nzr) * nzr // row offsets
+= size of type(k) * nnz // column indices
+= size of type (values) * nnz // values of A
+= size of type (values) * (k * n) // values of B
+= size of type (values) * (m * n) // values of C
*= 1e-9 (GB)
m, n, k = 1024
sparsity = 90%, nnz = (m * k) * 10%
dense = 0.2741 (ms)
sparse = 0.0274 (ms)