Skip to content

Instantly share code, notes, and snippets.

@gwtaylor
Created March 29, 2012 21:16
Show Gist options
  • Save gwtaylor/2243884 to your computer and use it in GitHub Desktop.
Save gwtaylor/2243884 to your computer and use it in GitHub Desktop.
output of check_blas.py using various BLAS configurations
(x86_64-apple-darwin10.8.0)dhcp0-176:Theano gwtaylor$ THEANO_FLAGS=blas.ldflags=-lblas python theano/misc/check_blas.py
Some results that you can compare against. They were 10 executions
of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
All memory layout was in C order.
CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
Core i7 950(3.07GHz, hyper-threads enabled)
Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)
Libraries tested:
* numpy with ATLAS from distribution (FC9) package (1 thread)
* manually compiled numpy and ATLAS with 2 threads
* goto 1.26 with 1, 2, 4 and 8 threads
* goto2 1.13 compiled with multiple threads enabled
Xeon Xeon Xeon Core2 i7 i7 Xeon Xeon
lib/nb threads E5345 E5430 E5450 E8500 930 950 X5560 X5550
numpy 1.3.0 blas 775.92s
numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s
goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s
numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s
goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s
goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s
goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s
openblas/1 14.04s
openblas/2 7.16s
openblas/4 3.71s
openblas/8 3.70s
mkl 11.0.083/1 7.97s
mkl 10.2.2.025/1 13.7s
mkl 10.2.2.025/2 7.6s
mkl 10.2.2.025/4 4.0s
mkl 10.2.2.025/8 2.0s
goto2 1.13/1 14.37s
goto2 1.13/2 7.26s
goto2 1.13/4 3.70s
goto2 1.13/8 1.94s
goto2 1.13/16 3.16s
Test time in float32 with cuda 3.0.14
(cuda version 3.2RC and up have a faster gemm on the Fermi/GTX[45]??)
gpu/cuda version
GTX580/3.2 0.20s
GTX480/3.2 0.24s
GTX480/3.0 0.27s
GTX470/3.2 0.29s
M2070/3.2 0.32s
GTX470/3.0 0.34s
GTX285/3.0 0.40s
C1060/3.2 0.46s
GTX550Ti/4.0 0.57s
GT220/3.2RC 3.80s
GT210/4.0 6.35s
8500GT/3.0 10.68s
Some Theano flags:
blas.ldflags= -lblas
compiledir= /Users/gwtaylor/.theano/compiledir_Darwin-10.8.0-i386-64bit-i386-2.7.2
floatX= float64
Some environment variables:
MKL_NUM_THREADS= 2
OMP_NUM_THREADS= None
GOTO_NUM_THREADS= None
Numpy config: (used when the Theano flag "blas.ldflags" is empty)
lapack_opt_info:
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_opt_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
lapack_mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
Numpy dot module: numpy.core._dotblas
Numpy location: /Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/site-packages/numpy/__init__.py
Numpy version: 1.6.1
We executed 10 calls to gemm with a and b matrices of shapes (2000, 2000) and (2000, 2000).
Total execution time: 10.81s on CPU (with direct Theano binding to blas).
Try to run this script a few times. Experience shows that the first time is not as fast as followings calls. The difference is not big, but consistent.
(x86_64-apple-darwin10.8.0)dhcp0-176:Theano gwtaylor$ THEANO_FLAGS=blas.ldflags= python theano/misc/check_blas.py
Some results that you can compare against. They were 10 executions
of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
All memory layout was in C order.
CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
Core i7 950(3.07GHz, hyper-threads enabled)
Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)
Libraries tested:
* numpy with ATLAS from distribution (FC9) package (1 thread)
* manually compiled numpy and ATLAS with 2 threads
* goto 1.26 with 1, 2, 4 and 8 threads
* goto2 1.13 compiled with multiple threads enabled
Xeon Xeon Xeon Core2 i7 i7 Xeon Xeon
lib/nb threads E5345 E5430 E5450 E8500 930 950 X5560 X5550
numpy 1.3.0 blas 775.92s
numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s
goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s
numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s
goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s
goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s
goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s
openblas/1 14.04s
openblas/2 7.16s
openblas/4 3.71s
openblas/8 3.70s
mkl 11.0.083/1 7.97s
mkl 10.2.2.025/1 13.7s
mkl 10.2.2.025/2 7.6s
mkl 10.2.2.025/4 4.0s
mkl 10.2.2.025/8 2.0s
goto2 1.13/1 14.37s
goto2 1.13/2 7.26s
goto2 1.13/4 3.70s
goto2 1.13/8 1.94s
goto2 1.13/16 3.16s
Test time in float32 with cuda 3.0.14
(cuda version 3.2RC and up have a faster gemm on the Fermi/GTX[45]??)
gpu/cuda version
GTX580/3.2 0.20s
GTX480/3.2 0.24s
GTX480/3.0 0.27s
GTX470/3.2 0.29s
M2070/3.2 0.32s
GTX470/3.0 0.34s
GTX285/3.0 0.40s
C1060/3.2 0.46s
GTX550Ti/4.0 0.57s
GT220/3.2RC 3.80s
GT210/4.0 6.35s
8500GT/3.0 10.68s
Some Theano flags:
blas.ldflags=
compiledir= /Users/gwtaylor/.theano/compiledir_Darwin-10.8.0-i386-64bit-i386-2.7.2
floatX= float64
Some environment variables:
MKL_NUM_THREADS= 2
OMP_NUM_THREADS= None
GOTO_NUM_THREADS= None
Numpy config: (used when the Theano flag "blas.ldflags" is empty)
lapack_opt_info:
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_opt_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
lapack_mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
Numpy dot module: numpy.core._dotblas
Numpy location: /Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/site-packages/numpy/__init__.py
Numpy version: 1.6.1
We executed 10 calls to gemm with a and b matrices of shapes (2000, 2000) and (2000, 2000).
Total execution time: 8.10s on CPU (without direct Theano binding to blas but with numpy/scipy binding to blas).
Try to run this script a few times. Experience shows that the first time is not as fast as followings calls. The difference is not big, but consistent.
(x86_64-apple-darwin10.8.0)dhcp0-176:Theano gwtaylor$ DYLD_FALLBACK_LIBRARY_PATH=/Library/Frameworks/EPD64.framework/Versions/Current/lib python theano/misc/check_blas.py
Some results that you can compare against. They were 10 executions
of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
All memory layout was in C order.
CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
Core i7 950(3.07GHz, hyper-threads enabled)
Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)
Libraries tested:
* numpy with ATLAS from distribution (FC9) package (1 thread)
* manually compiled numpy and ATLAS with 2 threads
* goto 1.26 with 1, 2, 4 and 8 threads
* goto2 1.13 compiled with multiple threads enabled
Xeon Xeon Xeon Core2 i7 i7 Xeon Xeon
lib/nb threads E5345 E5430 E5450 E8500 930 950 X5560 X5550
numpy 1.3.0 blas 775.92s
numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s
goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s
numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s
goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s
goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s
goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s
openblas/1 14.04s
openblas/2 7.16s
openblas/4 3.71s
openblas/8 3.70s
mkl 11.0.083/1 7.97s
mkl 10.2.2.025/1 13.7s
mkl 10.2.2.025/2 7.6s
mkl 10.2.2.025/4 4.0s
mkl 10.2.2.025/8 2.0s
goto2 1.13/1 14.37s
goto2 1.13/2 7.26s
goto2 1.13/4 3.70s
goto2 1.13/8 1.94s
goto2 1.13/16 3.16s
Test time in float32 with cuda 3.0.14
(cuda version 3.2RC and up have a faster gemm on the Fermi/GTX[45]??)
gpu/cuda version
GTX580/3.2 0.20s
GTX480/3.2 0.24s
GTX480/3.0 0.27s
GTX470/3.2 0.29s
M2070/3.2 0.32s
GTX470/3.0 0.34s
GTX285/3.0 0.40s
C1060/3.2 0.46s
GTX550Ti/4.0 0.57s
GT220/3.2RC 3.80s
GT210/4.0 6.35s
8500GT/3.0 10.68s
Some Theano flags:
blas.ldflags= -L/Users/gwtaylor/virtualenvs/x86_64-apple-darwin10.8.0/bin/../lib -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_mc -lmkl_mc3 -lpthread
compiledir= /Users/gwtaylor/.theano/compiledir_Darwin-10.8.0-i386-64bit-i386-2.7.2
floatX= float64
Some environment variables:
MKL_NUM_THREADS= 2
OMP_NUM_THREADS= None
GOTO_NUM_THREADS= None
Numpy config: (used when the Theano flag "blas.ldflags" is empty)
lapack_opt_info:
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_opt_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
lapack_mkl_info:
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
mkl_info:
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
define_macros = [('SCIPY_MKL_H', None)]
include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
Numpy dot module: numpy.core._dotblas
Numpy location: /Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/site-packages/numpy/__init__.py
Numpy version: 1.6.1
We executed 10 calls to gemm with a and b matrices of shapes (2000, 2000) and (2000, 2000).
Total execution time: 7.50s on CPU (with direct Theano binding to blas).
Try to run this script a few times. Experience shows that the first time is not as fast as followings calls. The difference is not big, but consistent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment