wolfram77/notes-nvidia-tesla-v100-gpu-architecture-whitepaper.md

## notes-nvidia-tesla-v100-gpu-architecture-whitepaper.md

      
    Raw
  

              notes-nvidia-tesla-v100-gpu-architecture-whitepaper.md
            
          
    Highlighted notes on:

NVIDIA Tesla V100 GPU Architecture Whitepaper

While doing research work with Prof. Dip Sankar Banerjee, Prof. Kishore Kothapalli.
Here is a my short summary of NVIDIA Tesla GV100 (Volta) architecture from the whitepaper:

84 SMs, each with 64 independent FP, INT cores.
Shared mem. size config. up to 96KB / SM.
4 512-bit mem. controllers (total 4096-bit).
Upto 6 Bidirectional NVLink, 25 GB/s per direction (w/ IBM Power 9 CPUs).
4 dies / HBM stack, 4 stacks. 16 GB w/ 900 GB/s HBM2 (Samsung).
1 err. correcting, 2 err. detecting native/sideband ECC (HBM, REG, L1, L2) (1 bit / byte).

A few additional points:

Each SM has 4 processing blocks (each handles 1 warp of 32 threads).
L1 data cache is combined w/ shared mem. = 128 KB / SM (explicit caching not as imp.).
Volta also supports write-caching (not just load, as prev. arch.).
NVLink supports coherency allowing data reads from GPU mem. to be stored in CPU cache.
Addr. Translation Serv. (ATS) allows GPU to access CPU page tables directly (malloc ptr).
Copy engine dont need pinned memory (that's why i saw ~no speedup w/ pinned mem. in PR).
Volta per-thread PC, call-stack, allows interleaved exec. of warp threads, ok fine-grained sync. (__syncwarp()).
Cooperative groups enable sync. between warps, grid-wide, multi-GPUs, cross-warp, sub-warp.


## notes-nvidia-tesla-v100-gpu-architecture-whitepaper.pdf

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              notes-nvidia-tesla-v100-gpu-architecture-whitepaper.pdf
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.