Created
May 3, 2017 22:06
-
-
Save darthdeus/d96cd056ab7725d42676f2882dc3918b to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(tf-35) c:\dev>python check_blas.py | |
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL: | |
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29 | |
Using gpu device 0: GeForce GTX 1080 Ti (CNMeM is disabled, cuDNN 6021) | |
C:\dev\Anaconda2\envs\tf-35\lib\site-packages\theano\sandbox\cuda\__init__.py:631: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.1. | |
warnings.warn(warn) | |
Some results that you can compare against. They were 10 executions | |
of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000). | |
All memory layout was in C order. | |
CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB), | |
Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB), | |
Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB), | |
Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?) | |
Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled), | |
Core i7 950(3.07GHz, hyper-threads enabled) | |
Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled) | |
Libraries tested: | |
* numpy with ATLAS from distribution (FC9) package (1 thread) | |
* manually compiled numpy and ATLAS with 2 threads | |
* goto 1.26 with 1, 2, 4 and 8 threads | |
* goto2 1.13 compiled with multiple threads enabled | |
Xeon Xeon Xeon Core2 i7 i7 Xeon Xeon | |
lib/nb threads E5345 E5430 E5450 E8500 930 950 X5560 X5550 | |
numpy 1.3.0 blas 775.92s | |
numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s | |
goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s | |
numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s | |
goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s | |
goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s | |
goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s | |
openblas/1 14.04s | |
openblas/2 7.16s | |
openblas/4 3.71s | |
openblas/8 3.70s | |
mkl 11.0.083/1 7.97s | |
mkl 10.2.2.025/1 13.7s | |
mkl 10.2.2.025/2 7.6s | |
mkl 10.2.2.025/4 4.0s | |
mkl 10.2.2.025/8 2.0s | |
goto2 1.13/1 14.37s | |
goto2 1.13/2 7.26s | |
goto2 1.13/4 3.70s | |
goto2 1.13/8 1.94s | |
goto2 1.13/16 3.16s | |
Test time in float32. There were 10 executions of gemm in | |
float32 with matrices of shape 5000x5000 (M=N=K=5000) | |
All memory layout was in C order. | |
cuda version 8.0 7.5 7.0 | |
gpu | |
M40 0.45s 0.47s | |
k80 0.92s 0.96s | |
K6000/NOECC 0.71s 0.69s | |
P6000/NOECC 0.25s | |
Titan X (Pascal) 0.28s | |
GTX Titan X 0.45s 0.45s 0.47s | |
GTX Titan Black 0.66s 0.64s 0.64s | |
GTX 1080 0.35s | |
GTX 980 Ti 0.41s | |
GTX 970 0.66s | |
GTX 680 1.57s | |
GTX 750 Ti 2.01s 2.01s | |
GTX 750 2.46s 2.37s | |
GTX 660 2.32s 2.32s | |
GTX 580 2.42s | |
GTX 480 2.87s | |
TX1 7.6s (float32 storage and computation) | |
GT 610 33.5s | |
Some Theano flags: | |
blas.ldflags= -LC:\dev\Anaconda2\Library\bin -lmkl_rt | |
compiledir= C:\Users\darth\AppData\Local\Theano\compiledir_Windows-10-10.0.14393-SP0-Intel64_Family_6_Model_63_Stepping_2_GenuineIntel-3.5.3-64 | |
floatX= float32 | |
device= gpu | |
Some OS information: | |
sys.platform= win32 | |
sys.version= 3.5.3 |Continuum Analytics, Inc.| (default, Feb 22 2017, 21:28:42) [MSC v.1900 64 bit (AMD64)] | |
sys.prefix= C:\dev\Anaconda2\envs\tf-35 | |
Some environment variables: | |
MKL_NUM_THREADS= None | |
OMP_NUM_THREADS= None | |
GOTO_NUM_THREADS= None | |
Numpy config: (used when the Theano flag "blas.ldflags" is empty) | |
lapack_mkl_info: | |
libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll'] | |
include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include'] | |
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] | |
library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib'] | |
lapack_opt_info: | |
libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll'] | |
include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include'] | |
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] | |
library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib'] | |
blas_mkl_info: | |
libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll'] | |
include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include'] | |
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] | |
library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib'] | |
blas_opt_info: | |
libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll'] | |
include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include'] | |
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] | |
library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib'] | |
Numpy dot module: numpy.core.multiarray | |
Numpy location: C:\dev\Anaconda2\envs\tf-35\lib\site-packages\numpy\__init__.py | |
Numpy version: 1.12.1 | |
We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000). | |
Total execution time: 0.00s on GPU. | |
Try to run this script a few times. Experience shows that the first time is not as fast as followings calls. The difference is not big, but consistent. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment