Skip to content

Instantly share code, notes, and snippets.

@aokomoriuta
Last active December 21, 2015 05:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aokomoriuta/6255535 to your computer and use it in GitHub Desktop.
Save aokomoriuta/6255535 to your computer and use it in GitHub Desktop.
ViennaCL http://viennacl.sourceforge.net/ のblas3ベンチマーク https://github.com/viennacl/viennacl-dev/blob/master/examples/benchmarks/blas3.cpp (密行列×密行列)をGeForce Titanで走らせてみた結果、CUDAよりOpenCLの方が速いことが分かった
----------------------------------------------
Device Info
----------------------------------------------
----------------------------------------------
----------------------------------------------
## Benchmark :: Dense Matrix-Matrix product
----------------------------------------------
-------------------------------
# benchmarking single-precision
-------------------------------
------ Benchmark 1: Matrix-Matrix product ------
- Execution time on device (no setup time included): 0.082591
- GFLOPs (counting multiply&add as separate operations): 208.011
------ Benchmark 2: Matrix-Matrix product using ranges ------
- Execution time on device (no setup time included): 0.010358
- GFLOPs (counting multiply&add as separate operations): 207.326
------ Benchmark 3: Matrix-Matrix product using slices ------
- Execution time on device (no setup time included): 0.010391
- GFLOPs (counting multiply&add as separate operations): 206.668
------ Benchmark 4: LU factorization ------
- Execution time on device (no setup time included): 0.775203
- GFLOPs (counting multiply&add as separate operations): 22.1618
-------------------------------
# benchmarking double-precision
-------------------------------
------ Benchmark 1: Matrix-Matrix product ------
- Execution time on device (no setup time included): 0.082417
- GFLOPs (counting multiply&add as separate operations): 208.451
------ Benchmark 2: Matrix-Matrix product using ranges ------
- Execution time on device (no setup time included): 0.010409
- GFLOPs (counting multiply&add as separate operations): 206.31
------ Benchmark 3: Matrix-Matrix product using slices ------
- Execution time on device (no setup time included): 0.012321
- GFLOPs (counting multiply&add as separate operations): 174.295
------ Benchmark 4: LU factorization ------
- Execution time on device (no setup time included): 0.823958
- GFLOPs (counting multiply&add as separate operations): 20.8504
----------------------------------------------
Device Info
----------------------------------------------
CL Device Vendor ID: 4318
CL Device Name: GeForce GTX TITAN
CL Driver Version: 319.37
--------------------------------
CL Device Max Compute Units: 14
CL Device Max Work Group Size: 1024
CL Device Global Mem Size: 6441730048
CL Device Local Mem Size: 49152
----------------------------------------------
----------------------------------------------
## Benchmark :: Dense Matrix-Matrix product
----------------------------------------------
-------------------------------
# benchmarking single-precision
-------------------------------
------ Benchmark 1: Matrix-Matrix product ------
- Device Name: GeForce GTX TITAN
- Execution time on device (no setup time included): 0.024256
- GFLOPs (counting multiply&add as separate operations): 708.273
------ Benchmark 2: Matrix-Matrix product using ranges ------
- Device Name: GeForce GTX TITAN
- Execution time on device (no setup time included): 0.003088
- GFLOPs (counting multiply&add as separate operations): 695.429
------ Benchmark 3: Matrix-Matrix product using slices ------
- Device Name: GeForce GTX TITAN
- Execution time on device (no setup time included): 0.003666
- GFLOPs (counting multiply&add as separate operations): 585.784
------ Benchmark 4: LU factorization ------
- Device Name: GeForce GTX TITAN
- Execution time on device (no setup time included): 0.775367
- GFLOPs (counting multiply&add as separate operations): 22.1571
-------------------------------
# benchmarking double-precision
-------------------------------
------ Benchmark 1: Matrix-Matrix product ------
- Device Name: GeForce GTX TITAN
- Execution time on device (no setup time included): 0.045137
- GFLOPs (counting multiply&add as separate operations): 380.616
------ Benchmark 2: Matrix-Matrix product using ranges ------
- Device Name: GeForce GTX TITAN
- Execution time on device (no setup time included): 0.005956
- GFLOPs (counting multiply&add as separate operations): 360.558
------ Benchmark 3: Matrix-Matrix product using slices ------
- Device Name: GeForce GTX TITAN
- Execution time on device (no setup time included): 0.007147
- GFLOPs (counting multiply&add as separate operations): 300.473
------ Benchmark 4: LU factorization ------
- Device Name: GeForce GTX TITAN
- Execution time on device (no setup time included): 0.825527
- GFLOPs (counting multiply&add as separate operations): 20.8108
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment