gwtaylor/Apple BLAS

## Apple BLAS
(x86_64-apple-darwin10.8.0)dhcp0-176:Theano gwtaylor$ THEANO_FLAGS=blas.ldflags=-lblas python theano/misc/check_blas.py

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32 with cuda 3.0.14
        (cuda version 3.2RC and up have a faster gemm on the Fermi/GTX[45]??)

        gpu/cuda version
        GTX580/3.2        0.20s
        GTX480/3.2        0.24s
        GTX480/3.0        0.27s
        GTX470/3.2        0.29s
        M2070/3.2         0.32s
        GTX470/3.0        0.34s
        GTX285/3.0        0.40s
        C1060/3.2         0.46s
        GTX550Ti/4.0      0.57s
        GT220/3.2RC       3.80s
        GT210/4.0         6.35s
        8500GT/3.0       10.68s

Some Theano flags:
    blas.ldflags= -lblas
    compiledir= /Users/gwtaylor/.theano/compiledir_Darwin-10.8.0-i386-64bit-i386-2.7.2
    floatX= float64
Some environment variables:
    MKL_NUM_THREADS= 2
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the Theano flag "blas.ldflags" is empty)
lapack_opt_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_opt_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
lapack_mkl_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
Numpy dot module: numpy.core._dotblas
Numpy location: /Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/site-packages/numpy/__init__.py
Numpy version: 1.6.1


We executed 10 calls to gemm with a and b matrices of shapes (2000, 2000) and (2000, 2000).

Total execution time: 10.81s on CPU (with direct Theano binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as followings calls. The difference is not big, but consistent.

## Fallback to NUMPY (compiled against MKL)
(x86_64-apple-darwin10.8.0)dhcp0-176:Theano gwtaylor$ THEANO_FLAGS=blas.ldflags= python theano/misc/check_blas.py

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32 with cuda 3.0.14
        (cuda version 3.2RC and up have a faster gemm on the Fermi/GTX[45]??)

        gpu/cuda version
        GTX580/3.2        0.20s
        GTX480/3.2        0.24s
        GTX480/3.0        0.27s
        GTX470/3.2        0.29s
        M2070/3.2         0.32s
        GTX470/3.0        0.34s
        GTX285/3.0        0.40s
        C1060/3.2         0.46s
        GTX550Ti/4.0      0.57s
        GT220/3.2RC       3.80s
        GT210/4.0         6.35s
        8500GT/3.0       10.68s

Some Theano flags:
    blas.ldflags=
    compiledir= /Users/gwtaylor/.theano/compiledir_Darwin-10.8.0-i386-64bit-i386-2.7.2
    floatX= float64
Some environment variables:
    MKL_NUM_THREADS= 2
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the Theano flag "blas.ldflags" is empty)
lapack_opt_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_opt_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
lapack_mkl_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
Numpy dot module: numpy.core._dotblas
Numpy location: /Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/site-packages/numpy/__init__.py
Numpy version: 1.6.1


We executed 10 calls to gemm with a and b matrices of shapes (2000, 2000) and (2000, 2000).

Total execution time: 8.10s on CPU (without direct Theano binding to blas but with numpy/scipy binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as followings calls. The difference is not big, but consistent.

## MKL BLAS
(x86_64-apple-darwin10.8.0)dhcp0-176:Theano gwtaylor$ DYLD_FALLBACK_LIBRARY_PATH=/Library/Frameworks/EPD64.framework/Versions/Current/lib python theano/misc/check_blas.py

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32 with cuda 3.0.14
        (cuda version 3.2RC and up have a faster gemm on the Fermi/GTX[45]??)

        gpu/cuda version
        GTX580/3.2        0.20s
        GTX480/3.2        0.24s
        GTX480/3.0        0.27s
        GTX470/3.2        0.29s
        M2070/3.2         0.32s
        GTX470/3.0        0.34s
        GTX285/3.0        0.40s
        C1060/3.2         0.46s
        GTX550Ti/4.0      0.57s
        GT220/3.2RC       3.80s
        GT210/4.0         6.35s
        8500GT/3.0       10.68s

Some Theano flags:
    blas.ldflags= -L/Users/gwtaylor/virtualenvs/x86_64-apple-darwin10.8.0/bin/../lib -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_mc -lmkl_mc3 -lpthread
    compiledir= /Users/gwtaylor/.theano/compiledir_Darwin-10.8.0-i386-64bit-i386-2.7.2
    floatX= float64
Some environment variables:
    MKL_NUM_THREADS= 2
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the Theano flag "blas.ldflags" is empty)
lapack_opt_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_opt_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
lapack_mkl_info:
    libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
blas_mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
    library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
    define_macros = [('SCIPY_MKL_H', None)]
    include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
Numpy dot module: numpy.core._dotblas
Numpy location: /Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/site-packages/numpy/__init__.py
Numpy version: 1.6.1


We executed 10 calls to gemm with a and b matrices of shapes (2000, 2000) and (2000, 2000).

Total execution time: 7.50s on CPU (with direct Theano binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as followings calls. The difference is not big, but consistent.
	(x86_64-apple-darwin10.8.0)dhcp0-176:Theano gwtaylor$ THEANO_FLAGS=blas.ldflags=-lblas python theano/misc/check_blas.py

	Some results that you can compare against. They were 10 executions
	of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
	All memory layout was in C order.

	CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
	Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
	Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
	Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
	Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
	Core i7 950(3.07GHz, hyper-threads enabled)
	Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


	Libraries tested:
	* numpy with ATLAS from distribution (FC9) package (1 thread)
	* manually compiled numpy and ATLAS with 2 threads
	* goto 1.26 with 1, 2, 4 and 8 threads
	* goto2 1.13 compiled with multiple threads enabled

	Xeon Xeon Xeon Core2 i7 i7 Xeon Xeon
	lib/nb threads E5345 E5430 E5450 E8500 930 950 X5560 X5550

	numpy 1.3.0 blas 775.92s
	numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s
	goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s
	numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s
	goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s
	goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s
	goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s
	openblas/1 14.04s
	openblas/2 7.16s
	openblas/4 3.71s
	openblas/8 3.70s
	mkl 11.0.083/1 7.97s
	mkl 10.2.2.025/1 13.7s
	mkl 10.2.2.025/2 7.6s
	mkl 10.2.2.025/4 4.0s
	mkl 10.2.2.025/8 2.0s
	goto2 1.13/1 14.37s
	goto2 1.13/2 7.26s
	goto2 1.13/4 3.70s
	goto2 1.13/8 1.94s
	goto2 1.13/16 3.16s

	Test time in float32 with cuda 3.0.14
	(cuda version 3.2RC and up have a faster gemm on the Fermi/GTX[45]??)

	gpu/cuda version
	GTX580/3.2 0.20s
	GTX480/3.2 0.24s
	GTX480/3.0 0.27s
	GTX470/3.2 0.29s
	M2070/3.2 0.32s
	GTX470/3.0 0.34s
	GTX285/3.0 0.40s
	C1060/3.2 0.46s
	GTX550Ti/4.0 0.57s
	GT220/3.2RC 3.80s
	GT210/4.0 6.35s
	8500GT/3.0 10.68s

	Some Theano flags:
	blas.ldflags= -lblas
	compiledir= /Users/gwtaylor/.theano/compiledir_Darwin-10.8.0-i386-64bit-i386-2.7.2
	floatX= float64
	Some environment variables:
	MKL_NUM_THREADS= 2
	OMP_NUM_THREADS= None
	GOTO_NUM_THREADS= None

	Numpy config: (used when the Theano flag "blas.ldflags" is empty)
	lapack_opt_info:
	libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
	library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
	define_macros = [('SCIPY_MKL_H', None)]
	include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
	blas_opt_info:
	libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
	library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
	define_macros = [('SCIPY_MKL_H', None)]
	include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
	lapack_mkl_info:
	libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
	library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
	define_macros = [('SCIPY_MKL_H', None)]
	include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
	blas_mkl_info:
	libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
	library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
	define_macros = [('SCIPY_MKL_H', None)]
	include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
	mkl_info:
	libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'mkl_mc', 'mkl_mc3', 'pthread']
	library_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/lib']
	define_macros = [('SCIPY_MKL_H', None)]
	include_dirs = ['/Library/Frameworks/EPD64.framework/Versions/1.4.2/include']
	Numpy dot module: numpy.core._dotblas
	Numpy location: /Library/Frameworks/EPD64.framework/Versions/7.2/lib/python2.7/site-packages/numpy/__init__.py
	Numpy version: 1.6.1


	We executed 10 calls to gemm with a and b matrices of shapes (2000, 2000) and (2000, 2000).

	Total execution time: 10.81s on CPU (with direct Theano binding to blas).

	Try to run this script a few times. Experience shows that the first time is not as fast as followings calls. The difference is not big, but consistent.