wolfram77/report-parallelizing-pagerank-for-a-volta-gpu.md

## report-parallelizing-pagerank-for-a-volta-gpu.md

      
    Raw
  

              report-parallelizing-pagerank-for-a-volta-gpu.md
            
          
    Parallelizing PageRank for a Volta GPU


While doing research work with Prof. Kishore Kothapalli, and Prof. Dip Sankar Banerjee.

Abstract — GPUs are usually optimized for chunked memory access patterns and high compute. While the PageRank algorithm has a unchunked-gather memory access pattern and low compute, implementing it on a CUDA-enabled GPU, such as the Volta GV100, can bring a significant performance improvement. The first step of the parallelization process used here involved independently parallelizing map and reduce-like operations within the algorithm, to obtain a suitable implementation and launch config. The second step involved parallelizing the rank computation process using a switched thread/block-per-vertex approach after the vertices have been partitioned by in-degree at a suitable switch point, and with apt launch configs for each partition. Compared to nvGraph PageRank on fixed and temporal graphs, results indicate a small performance improvement. It is however noted that the CUDA implementation developed here uses L1-norm for error measurement, while nvGraph PageRank uses L2-norm for error measurement, along with a per iteration L2-norm rank scaling followed by an L1-norm rank scaling after the final iteration. It was also observed that sequential CPU-based vector element sum with 32-bit floating point values suffered from precision issues, while the CUDA-based did not. (🎞️ slides)
Index terms — PageRank algorithm, Volta GV100 GPU, CUDA implementation, Map/Reduce-like operations, Switched thread/block-per-vertex approach.


Experiments


Map operations
launch


Reduce operations
memcpy launch
in-place launch
vs


Switched thread/block
sort vertices/edges
switch-point
block launch
thread launch


Fixed graphs
static PageRank
nvGraph vs CUDA vs sequential


Temporal graphs
static PageRank
nvGraph vs CUDA


Temporal graphs
incremental PageRank
nvGraph vs CUDA


Comparing various launch configs for CUDA based vector multiply.
Comparing various launch configs for CUDA based vector element sum (memcpy).
Comparing various launch configs for CUDA based vector element sum (in-place).
Performance of memcpy vs in-place based CUDA based vector element sum.
Experimenting the effect of sorting vertices and/or edges by in-degree for CUDA switched-per-vertex based PageRank ...
Comparing various switch points for CUDA switched-per-vertex based PageRank (pull, CSR, switched-partition).
Comparing various launch configs for CUDA switched-per-vertex based PageRank, focusing on block approach ...
Comparing various launch configs for CUDA switched-per-vertex based PageRank, focusing on thread approach ...
Performance of nvGraph based vs CUDA based PageRank (pull, CSR).
Performance of static vs incremental CUDA based PageRank (pull, CSR, scaled-fill).


Related Experiments


Map operations
duty


Reduce operations
memcpy duty


Block per vertex
launch
sort vertices/edges


Thread per vertex
launch
sort vertices/edges


Comparing various per-thread duty numbers for CUDA based vector multiply.
Comparing various per-thread duty numbers for CUDA based vector element sum (memcpy).
Comparing various launch configs for CUDA block-per-vertex based PageRank (pull, CSR).
Experimenting the effect of sorting vertices and/or edges by in-degree for CUDA block-per-vertex based PageRank ...
Comparing various launch configs for CUDA thread-per-vertex based PageRank (pull, CSR).
Experimenting the effect of sorting vertices and/or edges by in-degree for CUDA thread-per-vertex based PageRank ...


## report-parallelizing-pagerank-for-a-volta-gpu.pdf

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              report-parallelizing-pagerank-for-a-volta-gpu.pdf
            
          
      Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## slides-parallelizing-pagerank-for-a-volta-gpu.pdf

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              slides-parallelizing-pagerank-for-a-volta-gpu.pdf
            
          
      Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
Map operations	launch
Reduce operations	memcpy launch	in-place launch	vs
Switched thread/block	sort vertices/edges	switch-point	block launch	thread launch
Fixed graphs	static PageRank	nvGraph vs CUDA vs sequential
Temporal graphs	static PageRank	nvGraph vs CUDA
Temporal graphs	incremental PageRank	nvGraph vs CUDA
Map operations	duty
Reduce operations	memcpy duty
Block per vertex	launch	sort vertices/edges
Thread per vertex	launch	sort vertices/edges