While doing research work with Prof. Kishore Kothapalli, and Prof. Dip Sankar Banerjee.
Abstract — GPUs are usually optimized for chunked memory access patterns and high compute. While the PageRank algorithm has a unchunked-gather memory access pattern and low compute, implementing it on a CUDA-enabled GPU, such as the Volta GV100, can bring a significant performance improvement. The first step of the parallelization process used here involved independently parallelizing map and reduce-like operations within the algorithm, to obtain a suitable implementation and launch config. The second step involved parallelizing the rank computation process using a switched thread/block-per-vertex approach after the vertices have been partitioned by in-degree at a suitable switch point, and with apt launch configs for each partition. Compared to nvGraph PageRank on fixed and temporal graphs, results indicate a small performance improvement. It is however noted that the CUDA implementation developed here uses L1-norm for error measurement, while nvGraph PageRank uses L2-norm for error measurement, along with a per iteration L2-norm rank scaling followed by an L1-norm rank scaling after the final iteration. It was also observed that sequential CPU-based vector element sum with 32-bit floating point values suffered from precision issues, while the CUDA-based did not. (🎞️ slides)
Index terms — PageRank algorithm, Volta GV100 GPU, CUDA implementation, Map/Reduce-like operations, Switched thread/block-per-vertex approach.
Map operations | launch | |||
---|---|---|---|---|
Reduce operations | memcpy launch | in-place launch | vs | |
Switched thread/block | sort vertices/edges | switch-point | block launch | thread launch |
Fixed graphs | static PageRank | nvGraph vs CUDA vs sequential |
---|---|---|
Temporal graphs | static PageRank | nvGraph vs CUDA |
Temporal graphs | incremental PageRank | nvGraph vs CUDA |
- Comparing various launch configs for CUDA based vector multiply.
- Comparing various launch configs for CUDA based vector element sum (memcpy).
- Comparing various launch configs for CUDA based vector element sum (in-place).
- Performance of memcpy vs in-place based CUDA based vector element sum.
- Experimenting the effect of sorting vertices and/or edges by in-degree for CUDA switched-per-vertex based PageRank ...
- Comparing various switch points for CUDA switched-per-vertex based PageRank (pull, CSR, switched-partition).
- Comparing various launch configs for CUDA switched-per-vertex based PageRank, focusing on block approach ...
- Comparing various launch configs for CUDA switched-per-vertex based PageRank, focusing on thread approach ...
- Performance of nvGraph based vs CUDA based PageRank (pull, CSR).
- Performance of static vs incremental CUDA based PageRank (pull, CSR, scaled-fill).
Map operations | duty | |
---|---|---|
Reduce operations | memcpy duty | |
Block per vertex | launch | sort vertices/edges |
Thread per vertex | launch | sort vertices/edges |
- Comparing various per-thread duty numbers for CUDA based vector multiply.
- Comparing various per-thread duty numbers for CUDA based vector element sum (memcpy).
- Comparing various launch configs for CUDA block-per-vertex based PageRank (pull, CSR).
- Experimenting the effect of sorting vertices and/or edges by in-degree for CUDA block-per-vertex based PageRank ...
- Comparing various launch configs for CUDA thread-per-vertex based PageRank (pull, CSR).
- Experimenting the effect of sorting vertices and/or edges by in-degree for CUDA thread-per-vertex based PageRank ...