darthdeus/gist:d96cd056ab7725d42676f2882dc3918b

## gistfile1.txt
(tf-35) c:\dev>python check_blas.py
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1080 Ti (CNMeM is disabled, cuDNN 6021)
C:\dev\Anaconda2\envs\tf-35\lib\site-packages\theano\sandbox\cuda\__init__.py:631: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn(warn)

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.


        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s

Some Theano flags:
    blas.ldflags= -LC:\dev\Anaconda2\Library\bin -lmkl_rt
    compiledir= C:\Users\darth\AppData\Local\Theano\compiledir_Windows-10-10.0.14393-SP0-Intel64_Family_6_Model_63_Stepping_2_GenuineIntel-3.5.3-64
    floatX= float32
    device= gpu
Some OS information:
    sys.platform= win32
    sys.version= 3.5.3 |Continuum Analytics, Inc.| (default, Feb 22 2017, 21:28:42) [MSC v.1900 64 bit (AMD64)]
    sys.prefix= C:\dev\Anaconda2\envs\tf-35
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the Theano flag "blas.ldflags" is empty)
lapack_mkl_info:
    libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll']
    include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib']
lapack_opt_info:
    libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll']
    include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib']
blas_mkl_info:
    libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll']
    include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib']
blas_opt_info:
    libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll']
    include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib']
Numpy dot module: numpy.core.multiarray
Numpy location: C:\dev\Anaconda2\envs\tf-35\lib\site-packages\numpy\__init__.py
Numpy version: 1.12.1

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 0.00s on GPU.

Try to run this script a few times. Experience shows that the first time is not as fast as followings calls. The difference is not big, but consistent.
	(tf-35) c:\dev>python check_blas.py
	WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
	https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

	Using gpu device 0: GeForce GTX 1080 Ti (CNMeM is disabled, cuDNN 6021)
	C:\dev\Anaconda2\envs\tf-35\lib\site-packages\theano\sandbox\cuda\__init__.py:631: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.1.
	warnings.warn(warn)

	Some results that you can compare against. They were 10 executions
	of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
	All memory layout was in C order.

	CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
	Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
	Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
	Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
	Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
	Core i7 950(3.07GHz, hyper-threads enabled)
	Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


	Libraries tested:
	* numpy with ATLAS from distribution (FC9) package (1 thread)
	* manually compiled numpy and ATLAS with 2 threads
	* goto 1.26 with 1, 2, 4 and 8 threads
	* goto2 1.13 compiled with multiple threads enabled

	Xeon Xeon Xeon Core2 i7 i7 Xeon Xeon
	lib/nb threads E5345 E5430 E5450 E8500 930 950 X5560 X5550

	numpy 1.3.0 blas 775.92s
	numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s
	goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s
	numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s
	goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s
	goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s
	goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s
	openblas/1 14.04s
	openblas/2 7.16s
	openblas/4 3.71s
	openblas/8 3.70s
	mkl 11.0.083/1 7.97s
	mkl 10.2.2.025/1 13.7s
	mkl 10.2.2.025/2 7.6s
	mkl 10.2.2.025/4 4.0s
	mkl 10.2.2.025/8 2.0s
	goto2 1.13/1 14.37s
	goto2 1.13/2 7.26s
	goto2 1.13/4 3.70s
	goto2 1.13/8 1.94s
	goto2 1.13/16 3.16s

	Test time in float32. There were 10 executions of gemm in
	float32 with matrices of shape 5000x5000 (M=N=K=5000)
	All memory layout was in C order.


	cuda version 8.0 7.5 7.0
	gpu
	M40 0.45s 0.47s
	k80 0.92s 0.96s
	K6000/NOECC 0.71s 0.69s
	P6000/NOECC 0.25s

	Titan X (Pascal) 0.28s
	GTX Titan X 0.45s 0.45s 0.47s
	GTX Titan Black 0.66s 0.64s 0.64s
	GTX 1080 0.35s
	GTX 980 Ti 0.41s
	GTX 970 0.66s
	GTX 680 1.57s
	GTX 750 Ti 2.01s 2.01s
	GTX 750 2.46s 2.37s
	GTX 660 2.32s 2.32s
	GTX 580 2.42s
	GTX 480 2.87s
	TX1 7.6s (float32 storage and computation)
	GT 610 33.5s

	Some Theano flags:
	blas.ldflags= -LC:\dev\Anaconda2\Library\bin -lmkl_rt
	compiledir= C:\Users\darth\AppData\Local\Theano\compiledir_Windows-10-10.0.14393-SP0-Intel64_Family_6_Model_63_Stepping_2_GenuineIntel-3.5.3-64
	floatX= float32
	device= gpu
	Some OS information:
	sys.platform= win32
	sys.version= 3.5.3 \|Continuum Analytics, Inc.\| (default, Feb 22 2017, 21:28:42) [MSC v.1900 64 bit (AMD64)]
	sys.prefix= C:\dev\Anaconda2\envs\tf-35
	Some environment variables:
	MKL_NUM_THREADS= None
	OMP_NUM_THREADS= None
	GOTO_NUM_THREADS= None

	Numpy config: (used when the Theano flag "blas.ldflags" is empty)
	lapack_mkl_info:
	libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll']
	include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include']
	define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
	library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib']
	lapack_opt_info:
	libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll']
	include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include']
	define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
	library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib']
	blas_mkl_info:
	libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll']
	include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include']
	define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
	library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib']
	blas_opt_info:
	libraries = ['mkl_core_dll', 'mkl_intel_lp64_dll', 'mkl_intel_thread_dll']
	include_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\include']
	define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
	library_dirs = ['C:/dev/Anaconda2/envs/tf-35\\Library\\lib']
	Numpy dot module: numpy.core.multiarray
	Numpy location: C:\dev\Anaconda2\envs\tf-35\lib\site-packages\numpy\__init__.py
	Numpy version: 1.12.1

	We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

	Total execution time: 0.00s on GPU.

	Try to run this script a few times. Experience shows that the first time is not as fast as followings calls. The difference is not big, but consistent.