Jutho/TensorOperations efficiency

## TensorOperations efficiency
{
 "metadata": {
  "language": "Julia",
  "name": "TensorOperations"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": "Efficiency of TensorOperations versus Base algorithms"
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": "Comparison of the efficiency between some standard functionality in Base and similar functionality in TensorOperations.\n\nThe algorithms in TensorOperations where developed in a cache friendly manner by blocking data into blocks that fit into cache. The size of the blocks along the several dimension is even chosen such that, if the stride along that dimension is small (e.g. one or two) and a given number of elements are expected to fit in a single cache line, the block size along this dimension is chosen to be an integer times that number. This way, the blocks are expected to fit in the cache in a nice manner.\n\nSome of the algorithms in Base use blocking but without a precarious choice of blocking dimensions (tranpose!, general_matmatmul), whereas other don't use blocking at all (permutedims!).\n\nThe functionality in TensorOperations is often more general or formulated differently and therefore has some overhead for small data sizes. "
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "using TensorOperations",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": "Permutations and matrix transpose"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "function worstcasepermute(D::Int,N::Int)\n    dim=ntuple(n->D,N);\n    A=rand(dim);\n    B1=similar(A);\n    B2=similar(A);\n    p=N:-1:1\n    print(\"permutedims!: \")\n    @time Base.permutedims!(B1,A,p);\n    print(\"tensorcopy! : \")\n    @time TensorOperations.tensorcopy!(A,1:N,B2,p);\nend",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": "worstcasepermute (generic function with 1 method)"
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "# warm up\nfor N=1:8;worstcasepermute(1,N);end",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "permutedims!: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.022153291 seconds (658308 bytes allocated)\ntensorcopy! : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.09835909 seconds (2601808 bytes allocated)\npermutedims!: elapsed time: 0.02009878 seconds (693684 bytes allocated)\ntensorcopy! : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.09171668 seconds (2406376 bytes allocated)\npermutedims!: elapsed time: 0.029229126 seconds (949192 bytes allocated)\ntensorcopy! : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.085342752 seconds (2982788 bytes allocated)\npermutedims!: elapsed time: 0.036884769 seconds (1143696 bytes allocated)\ntensorcopy! : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.126020709 seconds (3453528 bytes allocated)\npermutedims!: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.112746523 seconds (3729668 bytes allocated)\ntensorcopy! : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.136430573 seconds (5547188 bytes allocated)\npermutedims!: elapsed time: 0.057547326 seconds (2258872 bytes allocated)\ntensorcopy! : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.158325483 seconds (6449384 bytes allocated)\npermutedims!: elapsed time: 0.066185252 seconds (2546336 bytes allocated)\ntensorcopy! : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.198973563 seconds (7476564 bytes allocated)\npermutedims!: elapsed time: 0.072756989 seconds (2839544 bytes allocated)\ntensorcopy! : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.228635707 seconds (8394504 bytes allocated)\n"
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "# example\nworstcasepermute(50,4);\nworstcasepermute(20,6);\nworstcasepermute(10,8);",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "permutedims!: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.123536414 seconds (248 bytes allocated)\ntensorcopy! : elapsed time: 0.054885578 seconds (2264 bytes allocated)\npermutedims!: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.712432803 seconds (280 bytes allocated)\ntensorcopy! : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.544470776 seconds (2824 bytes allocated)\npermutedims!: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "3.238086901 seconds (312 bytes allocated)\ntensorcopy! : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.179742243 seconds (3544 bytes allocated)\n"
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "function averagepermute(D::Int,N::Int,numsamples::Int)\n    dim=ntuple(n->D,N);\n    A=rand(dim);\n    B1=similar(A);\n    B2=similar(A);\n    \n    timings1=zeros(Float64,numsamples)\n    timings2=zeros(Float64,numsamples)\n    for i=1:numsamples\n        p=randperm(N)\n        timings1[i]=@elapsed Base.permutedims!(B1,A,p)\n        timings2[i]=@elapsed TensorOperations.tensorcopy!(A,1:N,B2,p)\n    end\n    return timings1,timings2\nend",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": "averagepermute (generic function with 1 method)"
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "timingspermutedims,timingstensorcopy=averagepermute(10,6,100);\nprintln(\"permutedims: $(mean(timingspermutedims)) +- $(std(timingspermutedims))\")\nprintln(\"tensorcopy : $(mean(timingstensorcopy)) +- $(std(timingstensorcopy))\")",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "permutedims: 0.004075591729999999 +- 0.005327664609953273"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\ntensorcopy : 0.0032746996699999998 +- 0.002292183078064184\n"
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "timingspermutedims,timingstensorcopy=averagepermute(6,8,100);\nprintln(\"permutedims: $(mean(timingspermutedims)) +- $(std(timingspermutedims))\")\nprintln(\"tensorcopy : $(mean(timingstensorcopy)) +- $(std(timingstensorcopy))\")",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "permutedims: 0.0064732256 +- 0.006754916836422047"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\ntensorcopy : 0.0054266709699999985 +- 0.001835678005478869\n"
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "function matrixtranspose(D::Int)\n    A=rand(D,D);\n    B1=similar(A);\n    B2=similar(A);\n    print(\"tranpose!  : \")\n    @time Base.transpose!(B1,A);\n    print(\"tensorcopy!: \")\n    @time TensorOperations.tensorcopy!(A,[:a,:b],B2,[:b,:a]);\nend",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": "matrixtranspose (generic function with 1 method)"
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "# some random dimension between 500 and 2000 to show that it is not for specific values\n# there is of course some overhead in the tensorcopy function that could be avoided for\n# the specific case of tranpose, such that it would also win for smaller dimensions\nfor i=1:10;matrixtranspose(rand(1000:2000));println(\"---------------\");end",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "tranpose!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.018669895 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.016927011 seconds (1856 bytes allocated)\n---------------\ntranpose!  : "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "elapsed time: 0.019956547 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.016011333 seconds (1856 bytes allocated)\n---------------\ntranpose!  : elapsed time: 0.016562626 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.012454475 seconds (1856 bytes allocated)\n---------------\ntranpose!  : elapsed time: 0.007284891 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.006009112 seconds (1856 bytes allocated)\n---------------\ntranpose!  : elapsed time: 0.010966287 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.008998863 seconds (1856 bytes allocated)\n---------------\ntranpose!  : "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "elapsed time: 0.007626615 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.008822935 seconds (1856 bytes allocated)\n---------------\ntranpose!  : elapsed time: 0.007161113 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.006069167 seconds (1856 bytes allocated)\n---------------\ntranpose!  : elapsed time: 0.00571354 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.005225507 seconds (1856 bytes allocated)\n---------------\ntranpose!  : elapsed time: 0.007529222 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.006764039 seconds (1856 bytes allocated)\n---------------\ntranpose!  : elapsed time: 0.014858091 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.015782425 seconds (1856 bytes allocated)\n---------------\n"
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "# a particularly big value\nmatrixtranspose(10000);",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "tranpose!  : "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.829178508 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.686677353 seconds (1856 bytes allocated)\n"
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": "Matrix multiplication"
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": "Comparison of matrix multiplication with tensorcontract's native method and Base.LinAlg.generic_matmatmul. The method generic_matmatmul uses blocking but first copies every block to a seperate statically allocated memory buffer. Tensorcontract just accesses the data in blocks, but doesn't copy it. Tensorcontract reaches comparable speed (see below) despite a more general implementation and without the copying step of generic_matmatmul."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "function matrixmultiplication(D::Int)\n    A=rand(D,D);\n    B=rand(D,D);\n    C1=similar(A);\n    C2=similar(A);\n    print(\"generic_matmatmul: \")\n    @time Base.LinAlg.generic_matmatmul(C1,'N','N',A,B);\n    print(\"tensorcontract!  : \")\n    @time TensorOperations.tensorcontract!(1.0,A,[:a,:b],'N',B,[:b,:c],'N',0.0,C2,[:a,:c];method=:native);\nend",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 9,
       "text": "matrixmultiplication (generic function with 1 method)"
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "# warm up\nmatrixmultiplication(10);",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "generic_matmatmul: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "elapsed time: 4.344e-6 seconds (160 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.37352808 seconds (74575836 bytes allocated)\n"
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "for i=1:10;matrixmultiplication(rand(500:1500));println(\"----------------------------\");end",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "generic_matmatmul: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.145732128 seconds (240 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.130812397 seconds (8240 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "2.032037738 seconds (240 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.890627482 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.804559917 seconds (240 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "0.826860556 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.632195851 seconds (240 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.386213196 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.967131335 seconds (240 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.827796812 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.225873973 seconds (240 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.067422735 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "2.791405456 seconds (240 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "2.692359793 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "3.691674764 seconds (240 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "3.62659834 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.86821728 seconds (240 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1.526435759 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "2.501729686 seconds (240 bytes allocated)\ntensorcontract!  : elapsed time: "
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "2.260456268 seconds (8720 bytes allocated)\n----------------------------\n"
      }
     ],
     "prompt_number": 10
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"language": "Julia",
	"name": "TensorOperations"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "heading",
	"level": 1,
	"metadata": {},
	"source": "Efficiency of TensorOperations versus Base algorithms"
	},
	{
	"cell_type": "raw",
	"metadata": {},
	"source": "Comparison of the efficiency between some standard functionality in Base and similar functionality in TensorOperations.\n\nThe algorithms in TensorOperations where developed in a cache friendly manner by blocking data into blocks that fit into cache. The size of the blocks along the several dimension is even chosen such that, if the stride along that dimension is small (e.g. one or two) and a given number of elements are expected to fit in a single cache line, the block size along this dimension is chosen to be an integer times that number. This way, the blocks are expected to fit in the cache in a nice manner.\n\nSome of the algorithms in Base use blocking but without a precarious choice of blocking dimensions (tranpose!, general_matmatmul), whereas other don't use blocking at all (permutedims!).\n\nThe functionality in TensorOperations is often more general or formulated differently and therefore has some overhead for small data sizes. "
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "using TensorOperations",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "heading",
	"level": 3,
	"metadata": {},
	"source": "Permutations and matrix transpose"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "function worstcasepermute(D::Int,N::Int)\n dim=ntuple(n->D,N);\n A=rand(dim);\n B1=similar(A);\n B2=similar(A);\n p=N:-1:1\n print(\"permutedims!: \")\n @time Base.permutedims!(B1,A,p);\n print(\"tensorcopy! : \")\n @time TensorOperations.tensorcopy!(A,1:N,B2,p);\nend",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 2,
	"text": "worstcasepermute (generic function with 1 method)"
	}
	],
	"prompt_number": 2
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "# warm up\nfor N=1:8;worstcasepermute(1,N);end",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "permutedims!: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.022153291 seconds (658308 bytes allocated)\ntensorcopy! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.09835909 seconds (2601808 bytes allocated)\npermutedims!: elapsed time: 0.02009878 seconds (693684 bytes allocated)\ntensorcopy! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.09171668 seconds (2406376 bytes allocated)\npermutedims!: elapsed time: 0.029229126 seconds (949192 bytes allocated)\ntensorcopy! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.085342752 seconds (2982788 bytes allocated)\npermutedims!: elapsed time: 0.036884769 seconds (1143696 bytes allocated)\ntensorcopy! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.126020709 seconds (3453528 bytes allocated)\npermutedims!: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.112746523 seconds (3729668 bytes allocated)\ntensorcopy! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.136430573 seconds (5547188 bytes allocated)\npermutedims!: elapsed time: 0.057547326 seconds (2258872 bytes allocated)\ntensorcopy! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.158325483 seconds (6449384 bytes allocated)\npermutedims!: elapsed time: 0.066185252 seconds (2546336 bytes allocated)\ntensorcopy! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.198973563 seconds (7476564 bytes allocated)\npermutedims!: elapsed time: 0.072756989 seconds (2839544 bytes allocated)\ntensorcopy! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.228635707 seconds (8394504 bytes allocated)\n"
	}
	],
	"prompt_number": 3
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "# example\nworstcasepermute(50,4);\nworstcasepermute(20,6);\nworstcasepermute(10,8);",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "permutedims!: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.123536414 seconds (248 bytes allocated)\ntensorcopy! : elapsed time: 0.054885578 seconds (2264 bytes allocated)\npermutedims!: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.712432803 seconds (280 bytes allocated)\ntensorcopy! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.544470776 seconds (2824 bytes allocated)\npermutedims!: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "3.238086901 seconds (312 bytes allocated)\ntensorcopy! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.179742243 seconds (3544 bytes allocated)\n"
	}
	],
	"prompt_number": 3
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "function averagepermute(D::Int,N::Int,numsamples::Int)\n dim=ntuple(n->D,N);\n A=rand(dim);\n B1=similar(A);\n B2=similar(A);\n \n timings1=zeros(Float64,numsamples)\n timings2=zeros(Float64,numsamples)\n for i=1:numsamples\n p=randperm(N)\n timings1[i]=@elapsed Base.permutedims!(B1,A,p)\n timings2[i]=@elapsed TensorOperations.tensorcopy!(A,1:N,B2,p)\n end\n return timings1,timings2\nend",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 4,
	"text": "averagepermute (generic function with 1 method)"
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "timingspermutedims,timingstensorcopy=averagepermute(10,6,100);\nprintln(\"permutedims: $(mean(timingspermutedims)) +- $(std(timingspermutedims))\")\nprintln(\"tensorcopy : $(mean(timingstensorcopy)) +- $(std(timingstensorcopy))\")",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "permutedims: 0.004075591729999999 +- 0.005327664609953273"
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\ntensorcopy : 0.0032746996699999998 +- 0.002292183078064184\n"
	}
	],
	"prompt_number": 5
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "timingspermutedims,timingstensorcopy=averagepermute(6,8,100);\nprintln(\"permutedims: $(mean(timingspermutedims)) +- $(std(timingspermutedims))\")\nprintln(\"tensorcopy : $(mean(timingstensorcopy)) +- $(std(timingstensorcopy))\")",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "permutedims: 0.0064732256 +- 0.006754916836422047"
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "\ntensorcopy : 0.0054266709699999985 +- 0.001835678005478869\n"
	}
	],
	"prompt_number": 6
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "function matrixtranspose(D::Int)\n A=rand(D,D);\n B1=similar(A);\n B2=similar(A);\n print(\"tranpose! : \")\n @time Base.transpose!(B1,A);\n print(\"tensorcopy!: \")\n @time TensorOperations.tensorcopy!(A,[:a,:b],B2,[:b,:a]);\nend",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 7,
	"text": "matrixtranspose (generic function with 1 method)"
	}
	],
	"prompt_number": 7
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "# some random dimension between 500 and 2000 to show that it is not for specific values\n# there is of course some overhead in the tensorcopy function that could be avoided for\n# the specific case of tranpose, such that it would also win for smaller dimensions\nfor i=1:10;matrixtranspose(rand(1000:2000));println(\"---------------\");end",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "tranpose! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.018669895 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.016927011 seconds (1856 bytes allocated)\n---------------\ntranpose! : "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "elapsed time: 0.019956547 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.016011333 seconds (1856 bytes allocated)\n---------------\ntranpose! : elapsed time: 0.016562626 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.012454475 seconds (1856 bytes allocated)\n---------------\ntranpose! : elapsed time: 0.007284891 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.006009112 seconds (1856 bytes allocated)\n---------------\ntranpose! : elapsed time: 0.010966287 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.008998863 seconds (1856 bytes allocated)\n---------------\ntranpose! : "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "elapsed time: 0.007626615 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.008822935 seconds (1856 bytes allocated)\n---------------\ntranpose! : elapsed time: 0.007161113 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.006069167 seconds (1856 bytes allocated)\n---------------\ntranpose! : elapsed time: 0.00571354 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.005225507 seconds (1856 bytes allocated)\n---------------\ntranpose! : elapsed time: 0.007529222 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.006764039 seconds (1856 bytes allocated)\n---------------\ntranpose! : elapsed time: 0.014858091 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: 0.015782425 seconds (1856 bytes allocated)\n---------------\n"
	}
	],
	"prompt_number": 8
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "# a particularly big value\nmatrixtranspose(10000);",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "tranpose! : "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.829178508 seconds (128 bytes allocated)\ntensorcopy!: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.686677353 seconds (1856 bytes allocated)\n"
	}
	],
	"prompt_number": 8
	},
	{
	"cell_type": "heading",
	"level": 3,
	"metadata": {},
	"source": "Matrix multiplication"
	},
	{
	"cell_type": "raw",
	"metadata": {},
	"source": "Comparison of matrix multiplication with tensorcontract's native method and Base.LinAlg.generic_matmatmul. The method generic_matmatmul uses blocking but first copies every block to a seperate statically allocated memory buffer. Tensorcontract just accesses the data in blocks, but doesn't copy it. Tensorcontract reaches comparable speed (see below) despite a more general implementation and without the copying step of generic_matmatmul."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "function matrixmultiplication(D::Int)\n A=rand(D,D);\n B=rand(D,D);\n C1=similar(A);\n C2=similar(A);\n print(\"generic_matmatmul: \")\n @time Base.LinAlg.generic_matmatmul(C1,'N','N',A,B);\n print(\"tensorcontract! : \")\n @time TensorOperations.tensorcontract!(1.0,A,[:a,:b],'N',B,[:b,:c],'N',0.0,C2,[:a,:c];method=:native);\nend",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 9,
	"text": "matrixmultiplication (generic function with 1 method)"
	}
	],
	"prompt_number": 9
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "# warm up\nmatrixmultiplication(10);",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "generic_matmatmul: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "elapsed time: 4.344e-6 seconds (160 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.37352808 seconds (74575836 bytes allocated)\n"
	}
	],
	"prompt_number": 9
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "for i=1:10;matrixmultiplication(rand(500:1500));println(\"----------------------------\");end",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "generic_matmatmul: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.145732128 seconds (240 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.130812397 seconds (8240 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "2.032037738 seconds (240 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.890627482 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.804559917 seconds (240 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "0.826860556 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.632195851 seconds (240 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.386213196 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.967131335 seconds (240 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.827796812 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.225873973 seconds (240 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.067422735 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "2.791405456 seconds (240 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "2.692359793 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "3.691674764 seconds (240 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "3.62659834 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.86821728 seconds (240 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1.526435759 seconds (8720 bytes allocated)\n----------------------------\ngeneric_matmatmul: elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "2.501729686 seconds (240 bytes allocated)\ntensorcontract! : elapsed time: "
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "2.260456268 seconds (8720 bytes allocated)\n----------------------------\n"
	}
	],
	"prompt_number": 10
	}
	],
	"metadata": {}
	}
	]
	}