Highlighted notes on:
NVIDIA Tesla V100 GPU Architecture Whitepaper
While doing research work with Prof. Dip Sankar Banerjee, Prof. Kishore Kothapalli.
Here is a my short summary of NVIDIA Tesla GV100 (Volta) architecture from the whitepaper:
- 84 SMs, each with 64 independent FP, INT cores.
- Shared mem. size config. up to 96KB / SM.
- 4 512-bit mem. controllers (total 4096-bit).
- Upto 6 Bidirectional NVLink, 25 GB/s per direction (w/ IBM Power 9 CPUs).
- 4 dies / HBM stack, 4 stacks. 16 GB w/ 900 GB/s HBM2 (Samsung).
- 1 err. correcting, 2 err. detecting native/sideband ECC (HBM, REG, L1, L2) (1 bit / byte).
A few additional points:
- Each SM has 4 processing blocks (each handles 1 warp of 32 threads).
- L1 data cache is combined w/ shared mem. = 128 KB / SM (explicit caching not as imp.).
- Volta also supports write-caching (not just load, as prev. arch.).
- NVLink supports coherency allowing data reads from GPU mem. to be stored in CPU cache.
- Addr. Translation Serv. (ATS) allows GPU to access CPU page tables directly (malloc ptr).
- Copy engine dont need pinned memory (that's why i saw ~no speedup w/ pinned mem. in PR).
- Volta per-thread PC, call-stack, allows interleaved exec. of warp threads, ok fine-grained sync. (
__syncwarp()
). - Cooperative groups enable sync. between warps, grid-wide, multi-GPUs, cross-warp, sub-warp.