Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save wolfram77/4ef16ab9699ac03a617b8731dd240e1f to your computer and use it in GitHub Desktop.
Save wolfram77/4ef16ab9699ac03a617b8731dd240e1f to your computer and use it in GitHub Desktop.
Parallelizing PageRank for a Volta GPU : REPORT

Parallelizing PageRank for a Volta GPU

While doing research work with Prof. Kishore Kothapalli, and Prof. Dip Sankar Banerjee.

Abstract — GPUs are usually optimized for chunked memory access patterns and high compute. While the PageRank algorithm has a unchunked-gather memory access pattern and low compute, implementing it on a CUDA-enabled GPU, such as the Volta GV100, can bring a significant performance improvement. The first step of the parallelization process used here involved independently parallelizing map and reduce-like operations within the algorithm, to obtain a suitable implementation and launch config. The second step involved parallelizing the rank computation process using a switched thread/block-per-vertex approach after the vertices have been partitioned by in-degree at a suitable switch point, and with apt launch configs for each partition. Compared to nvGraph PageRank on fixed and temporal graphs, results indicate a small performance improvement. It is however noted that the CUDA implementation developed here uses L1-norm for error measurement, while nvGraph PageRank uses L2-norm for error measurement, along with a per iteration L2-norm rank scaling followed by an L1-norm rank scaling after the final iteration. It was also observed that sequential CPU-based vector element sum with 32-bit floating point values suffered from precision issues, while the CUDA-based did not. (🎞️ slides)

Index terms — PageRank algorithm, Volta GV100 GPU, CUDA implementation, Map/Reduce-like operations, Switched thread/block-per-vertex approach.


Experiments

Map operations launch
Reduce operations memcpy launch in-place launch vs
Switched thread/block sort vertices/edges switch-point block launch thread launch
Fixed graphs static PageRank nvGraph vs CUDA vs sequential
Temporal graphs static PageRank nvGraph vs CUDA
Temporal graphs incremental PageRank nvGraph vs CUDA
  1. Comparing various launch configs for CUDA based vector multiply.
  2. Comparing various launch configs for CUDA based vector element sum (memcpy).
  3. Comparing various launch configs for CUDA based vector element sum (in-place).
  4. Performance of memcpy vs in-place based CUDA based vector element sum.
  5. Experimenting the effect of sorting vertices and/or edges by in-degree for CUDA switched-per-vertex based PageRank ...
  6. Comparing various switch points for CUDA switched-per-vertex based PageRank (pull, CSR, switched-partition).
  7. Comparing various launch configs for CUDA switched-per-vertex based PageRank, focusing on block approach ...
  8. Comparing various launch configs for CUDA switched-per-vertex based PageRank, focusing on thread approach ...
  9. Performance of nvGraph based vs CUDA based PageRank (pull, CSR).
  10. Performance of static vs incremental CUDA based PageRank (pull, CSR, scaled-fill).

Related Experiments

Map operations duty
Reduce operations memcpy duty
Block per vertex launch sort vertices/edges
Thread per vertex launch sort vertices/edges
  1. Comparing various per-thread duty numbers for CUDA based vector multiply.
  2. Comparing various per-thread duty numbers for CUDA based vector element sum (memcpy).
  3. Comparing various launch configs for CUDA block-per-vertex based PageRank (pull, CSR).
  4. Experimenting the effect of sorting vertices and/or edges by in-degree for CUDA block-per-vertex based PageRank ...
  5. Comparing various launch configs for CUDA thread-per-vertex based PageRank (pull, CSR).
  6. Experimenting the effect of sorting vertices and/or edges by in-degree for CUDA thread-per-vertex based PageRank ...
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment