rmcgibbo/atlas.md

## atlas.md

      
    Raw
  

              atlas.md
            
          
    Compiling an optimized ATLAS on Ubuntu 12.04


Download the latest ATLAS source from http://sourceforge.net/projects/math-atlas/files/. I'm using version 3.10.1
Download the latest Netlib LAPACK from http://www.netlib.org/lapack/. I'm using version 3.4.2
Turn off frequency scalings on your chip so that you can get reliable timings. This is essential to get a good ATLAS build

  sudo apt-get install cpufreq-info cpuspeed cpufrequtils sysfsutils
  # set each core to the "performance" governor, so that the clock frequency doesn't go down when idle
  # I have 8 cores, which is why I need to do this 8 times
  sudo cpufreq-selector -c 0 -g performance
  sudo cpufreq-selector -c 1 -g performance
  sudo cpufreq-selector -c 2 -g performance
  sudo cpufreq-selector -c 3 -g performance
  sudo cpufreq-selector -c 4 -g performance
  sudo cpufreq-selector -c 5 -g performance
  sudo cpufreq-selector -c 6 -g performance
  sudo cpufreq-selector -c 7 -g performance
  
  # check to make sure the scaling was set correctly
  sudo cat  /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor


Compile ATLAS.

tar -xjvf atlas3.10.1.tar.bz2
cd ATLAS
mkdir build
cd build
../configure -Fa alg '-fPIC' --with-netlib-lapack-tarfile=<PATH_TO_NETLIB_LAPACK_TARBALL> --prefix=$HOME/opt/atlas --shared
make
make test
make install

export LD_LIBRARY_PATH=$HOME/opt/atlas/lib:$LD_LIBRARY_PATH

In ATLAS 3.10.1, the two shared libraries that get compiled are named libsatlas.so and libtatlas.so. As configured above, they both contain a full (C)BLAS+LAPACK interface. The differences is that the first is serial and the second is threaded. If you find that confusing and want to make linking against -latlas or -lcblas possible, then go into the install directory and install some symlinks.
cd $HOME/opt/atlas/lib
ln -s libtatlas.so libatlas.so  # make "libatlas" point to the threaded library
ln -s libtatlas.so libcblas.so  # make "libcblas" point to the threaded atlas

Linking numpy/scipy to your optimized ATLAS

Get the numpy source distribution. Move site.cfg.example to site.cfg, and set the following entries in site.cfg
[DEFAULT]
library_dirs = <YOUR_HOME_DIRECTORY>/opt/atlas/lib
include_dirs = <YOUR_HOME_DIRECTORY>//opt/atlas/include

[blas_opt]
libraries = tatlas

[lapack_opt]
libraries = tatlas

Now run setup.py install. Compile scipy from source as well, and it will automatically use the same ATLAS to build, since it detects its build configuration using numpy.distutils.
Timing your BLAS implementation.

Download the file time_dgemm.c from this gist. It does a big matrix multiply via the cblas_dgemm function. You can try linking the program against different BLAS implementations to test your speed.
# link against the default system cblas.
$ gcc time_dgemm.c -lcblas && ./a.out
10.997839 s

You can see that this is linked against the system cblas/atlas, installed throug the package manager.
$ ldd a.out
    linux-vdso.so.1 =>  (0x00007fffbddff000)
    libcblas.so.3gf => /usr/lib/libcblas.so.3gf (0x00007f5aa9498000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5aa90d8000)
    libatlas.so.3gf => /usr/lib/libatlas.so.3gf (0x00007f5aa8ba7000)
    libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f5aa8890000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f5aa8679000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5aa845c000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f5aa8160000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f5aa96d1000)
    libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f5aa7f29000)

Linking the same code against our threaded blas, I get a ~10x speedup.
$ gcc time_dgemm.c -L$HOME/opt/atlas/lib -ltatlas && ./a.out
1.302789 s

The single threaded optimized version gets a ~3x speedup too.
$ gcc time_dgemm.c -L$HOME/opt/atlas/lib -lsatlas && ./a.out
3.237809 s


## openblas.md

      
    Raw
  

              openblas.md
            
          
    Compiling OpenBLAS


Download the latest Op/enBLAS from http://xianyi.github.io/OpenBLAS/
Untar the package, and compile it.

$ tar -xvf v0.2.8
$ cd xianyi-OpenBLAS-9c51cdf
$ make
$ make PREFIX=$HOME/opt/openblas-0.2.8 install

Using OpenBLAS 0.2.8, I get better performance than ATLAS.
$ gcc time_dgemm.c -L$HOME/opt/openblas-0.2.8/lib -lopenblas && ./a.out
1.031470 s

Timing MKL

Using (nonfree) MKL, the performance is pretty similar. But the link line is much more complex.
$ gcc -fopenmp -m64 -I$MKLROOT/include time_dgemm.c -L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -ldl -lpthread -lm && ./a.out
0.951390 s


## time_dgemm.c
#define min(x,y) (((x) < (y)) ? (x) : (y))

#include <stdio.h>
#include <stdlib.h>
#include "cblas.h"
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>

double get_time() {
  struct timeval t;
  struct timezone tzp;
  gettimeofday(&t, &tzp);
  return t.tv_sec + t.tv_usec*1e-6;
}

int main()
{
  double *A, *B, *C;
  int m, n, k, i, j;
  double alpha, beta;
  double start, end;

  m = 20000, k = 2000, n = 1000;

  alpha = 1.0; beta = 0.0;

  posix_memalign((void**) &A, 64, m*k*sizeof( double ));
  posix_memalign((void**) &B, 64, k*n*sizeof( double ));
  posix_memalign((void**) &C, 64, m*n*sizeof( double ));

  if (A == NULL || B == NULL || C == NULL) {
    printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
    free(A);
    free(B);
    free(C);
    return 1;
  }

  for (i = 0; i < (m*k); i++) {
    A[i] = (double)(i+1);
  }

  for (i = 0; i < (k*n); i++) {
    B[i] = (double)(-i-1);
  }

  for (i = 0; i < (m*n); i++) {
    C[i] = 0.0;
  }

  start = get_time();
  cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
        m, n, k, alpha, A, k, B, n, beta, C, n);
  end = get_time();

  printf("%f s\n",  (end - start));


  free(A);
  free(B);
  free(C);
  return 0;
}
	#define min(x,y) (((x) < (y)) ? (x) : (y))

	#include <stdio.h>
	#include <stdlib.h>
	#include "cblas.h"
	#include <time.h>
	#include <sys/time.h>
	#include <sys/resource.h>

	double get_time() {
	struct timeval t;
	struct timezone tzp;
	gettimeofday(&t, &tzp);
	return t.tv_sec + t.tv_usec*1e-6;
	}

	int main()
	{
	double A, B, *C;
	int m, n, k, i, j;
	double alpha, beta;
	double start, end;

	m = 20000, k = 2000, n = 1000;

	alpha = 1.0; beta = 0.0;

	posix_memalign((void*) &A, 64, mk*sizeof( double ));
	posix_memalign((void*) &B, 64, kn*sizeof( double ));
	posix_memalign((void*) &C, 64, mn*sizeof( double ));

	if (A == NULL \|\| B == NULL \|\| C == NULL) {
	printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
	free(A);
	free(B);
	free(C);
	return 1;
	}

	for (i = 0; i < (m*k); i++) {
	A[i] = (double)(i+1);
	}

	for (i = 0; i < (k*n); i++) {
	B[i] = (double)(-i-1);
	}

	for (i = 0; i < (m*n); i++) {
	C[i] = 0.0;
	}

	start = get_time();
	cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
	m, n, k, alpha, A, k, B, n, beta, C, n);
	end = get_time();

	printf("%f s\n", (end - start));


	free(A);
	free(B);
	free(C);
	return 0;
	}