eric-tramel/Testing CUBLAS.ipynb.json

## Testing CUBLAS.ipynb.json
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Testing CuBLAS and CUDArt for Julia\n",
    "After finally getting NVCC to work on OSX, we can start using the CUDA-themed BLAS packages written for Julia. In this notebook we will document how to utilize the necessary datatypes and show comparisons between the CPU and GPU implementations of common BLAS functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## I. Calling and using the Libraries\n",
    "Lets first make sure that we have updated and built the libraries. Because of the recent changes in Julia between `v0.3` and `v0.4`, we expect quite a number of warnings, and even errors, to pop up during the testing phase. However, the core functionality of the packges should be there."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: Updating METADATA...\n",
      "INFO: Updating cache of IJulia...\n",
      "INFO: Updating cache of Gtk...\n",
      "INFO: Updating cache of Grid...\n",
      "INFO: Updating cache of ProfileView...\n",
      "INFO: Updating cache of ImageView...\n",
      "INFO: Updating cache of Zlib...\n",
      "INFO: Updating cache of Calculus...\n",
      "INFO: Updating cache of Nettle...\n",
      "INFO: Updating cache of CUDArt...\n",
      "INFO: Updating cache of DualNumbers...\n",
      "INFO: Updating CUBLAS...\n",
      "INFO: Updating Winston...\n",
      "INFO: Updating Devectorize...\n",
      "INFO: Updating Images...\n",
      "INFO: Updating IJulia...\n",
      "INFO: Updating DataArrays...\n",
      "INFO: Updating Boltzmann...\n",
      "INFO: Updating ImageView...\n",
      "INFO: Updating DataFrames...\n",
      "INFO: Updating MNIST...\n",
      "INFO: Updating LowRAMP...\n",
      "INFO: Computing changes...\n",
      "INFO: Upgrading CUDArt: v0.2.0 => v0.2.1\n",
      "INFO: Upgrading Calculus: v0.1.11 => v0.1.12\n",
      "INFO: Upgrading ColorTypes: v0.1.5 => v0.1.6\n",
      "INFO: Upgrading DualNumbers: v0.1.3 => v0.1.4\n",
      "INFO: Upgrading Grid: v0.3.11 => v0.4.0\n",
      "INFO: Upgrading Gtk: v0.9.0 => v0.9.2\n",
      "INFO: Upgrading Nettle: v0.1.10 => v0.2.0\n",
      "INFO: Upgrading ProfileView: v0.1.0 => v0.1.1\n",
      "INFO: Upgrading Zlib: v0.1.9 => v0.1.10\n",
      "INFO: Building CUDArt\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "make: Nothing to be done for `all'.\n",
      "make: Nothing to be done for `all'.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: Building Homebrew\n",
      "INFO: Recompiling stale cache file /Users/tramel/.julia/lib/v0.4/URIParser.ji for module URIParser.\n",
      "INFO: Recompiling stale cache file /Users/tramel/.julia/lib/v0.4/SHA.ji for module SHA.\n",
      "From https://github.com/Homebrew/homebrew\n",
      "   4257c3d..433aa72  master     -> origin/master\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HEAD is now at 433aa72 pyqt: update 4.11.3_1 bottle.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "From https://github.com/staticfloat/homebrew-juliadeps\n",
      "   73fce5f..b359d09  master     -> origin/master\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HEAD is now at b359d09 gnome-icon-theme needs to be mirrored by us to avoid pkg-config problems\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: Building Cairo\n",
      "INFO: Building Gtk\n",
      "INFO: Building Nettle\n",
      "INFO: Building CUDArt\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "make: Nothing to be done for `all'.\n",
      "make: Nothing to be done for `all'.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: Building CUDArt\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "make: Nothing to be done for `all'.\n",
      "make: Nothing to be done for `all'.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: Building CUDArt\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "make: Nothing to be done for `all'.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: Testing CUDArt\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "make: Nothing to be done for `all'.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: Recompiling stale cache file /Users/tramel/.julia/lib/v0.4/CUDArt.ji for module CUDArt.\n",
      "WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
      "WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
      "WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
      "WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
      "in anonymous at null:-1\n",
      "WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
      "in anonymous at null:-1\n",
      "WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
      "in anonymous at null:-1\n",
      "WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
      "in anonymous at null:-1\n",
      "ERROR: LoadError: LoadError: test failed: !(isempty(CUDArt.cuda_ptrs))\n",
      " in expression: !(isempty(CUDArt.cuda_ptrs))\n",
      " in error at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
      " in default_handler at test.jl:30\n",
      " in do_test at test.jl:53\n",
      " in include at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
      " in include_from_node1 at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
      " in include at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
      " in include_from_node1 at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
      " in process_options at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
      " in _start at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
      "while loading /Users/tramel/.julia/v0.4/CUDArt/test/gc.jl, in expression starting on line 57\n",
      "while loading /Users/tramel/.julia/v0.4/CUDArt/test/runtests.jl, in expression starting on line 1\n",
      "===============================[ ERROR: CUDArt ]================================\n",
      "\n",
      "failed process: Process(`/Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/bin/julia --check-bounds=yes --code-coverage=none --color=no /Users/tramel/.julia/v0.4/CUDArt/test/runtests.jl`, ProcessExited(1)) [1]\n"
     ]
    },
    {
     "ename": "LoadError",
     "evalue": "LoadError: CUDArt had test errors\nwhile loading In[14], in expression starting on line 6",
     "output_type": "error",
     "traceback": [
      "LoadError: CUDArt had test errors\nwhile loading In[14], in expression starting on line 6",
      "",
      " in error at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib",
      " in test at pkg/entry.jl:753",
      " in anonymous at pkg/dir.jl:31",
      " in cd at file.jl:22",
      " in cd at pkg/dir.jl:31",
      " in test at pkg.jl:71"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "================================================================================\n"
     ]
    }
   ],
   "source": [
    "# Update and build\n",
    "Pkg.update()\n",
    "Pkg.build(\"CUDArt\")\n",
    "Pkg.build(\"CUBLAS\")\n",
    "Pkg.build(\"CUDNN\")\n",
    "Pkg.test(\"CUDArt\")\n",
    "Pkg.test(\"CUBLAS\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "using CUDArt\n",
    "using CUBLAS\n",
    "using CUDNN\n",
    "using Base.LinAlg.BLAS\n",
    "using Devectorize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## II. Experiment Parameters\n",
    "We will focus our comparisons on the BLAS function `gemm` which computes\n",
    "$$ \\mathbf{C} \\leftarrow \\alpha \\mathbf{A}\\mathbf{B} + \\beta \\mathbf{C}.$$\n",
    "We will assume that all of these matrices are dense and real. For our experiments we will set\n",
    "$\\mathbf{A}: (n \\times m)$, $\\mathbf{B}: (m \\times k)$, $\\mathbf{C}: (n \\times k)$, and\n",
    "$\\alpha = \\beta = 1.0$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                             A    390 KB     500x100 Array{Float64,2} : [0.8697…\n",
      "                             B    612 KB     100x784 Array{Float64,2} : [-0.770…\n",
      "                          Base  27158 KB     Module : Base\n",
      "                       BinDeps    486 KB     Module : BinDeps\n",
      "                             C   3062 KB     500x784 Array{Float64,2} : [-0.578…\n",
      "                        CUBLAS    545 KB     Module : CUBLAS\n",
      "                        CUDArt    535 KB     Module : CUDArt\n",
      "                         CUDNN    162 KB     Module : CUDNN\n",
      "                        Compat     58 KB     Module : Compat\n",
      "                          Core   3239 KB     Module : Core\n",
      "                DataStructures    381 KB     Module : DataStructures\n",
      "                   Devectorize    259 KB     Module : Devectorize\n",
      "                      Homebrew     56 KB     Module : Homebrew\n",
      "                        IJulia    331 KB     Module : IJulia\n",
      "                IPythonDisplay     26 KB     Module : IPythonDisplay\n",
      "                          JSON    214 KB     Module : JSON\n",
      "                          Main  36909 KB     Module : Main\n",
      "                  MyTimedGemm!     12 KB     Function : MyTimedGemm!\n",
      "                        Nettle    187 KB     Module : Nettle\n",
      "                           SHA     50 KB     Module : SHA\n",
      "                     URIParser     85 KB     Module : URIParser\n",
      "                           ZMQ     81 KB     Module : ZMQ\n",
      "                             a      8 bytes  Float64 : 1.0\n",
      "                             b      8 bytes  Float64 : 1.0\n",
      "                           d_A     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArt…\n",
      "                           d_B     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArt…\n",
      "                           d_C     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArt…\n",
      "                             k      8 bytes  Int64 : 784\n",
      "                             m      8 bytes  Int64 : 100\n",
      "                             n      8 bytes  Int64 : 500\n"
     ]
    }
   ],
   "source": [
    "# Dimensions\n",
    "n = 500\n",
    "m = 100\n",
    "k = 784\n",
    "# Scalings\n",
    "a = 1.0\n",
    "b = 1.0\n",
    "# Initialization\n",
    "A = randn(n,m);\n",
    "B = randn(m,k);\n",
    "C = randn(n,k);\n",
    "whos()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## III. Baseline Performance\n",
    "We will now look at the timing of the base OpenBLAS implementation of `gemm`, which runs on the CPU, alone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.001017 seconds (4 allocations: 160 bytes)\n",
      "  0.001137 seconds (4 allocations: 160 bytes)\n",
      "  0.001359 seconds (4 allocations: 160 bytes)\n",
      "  0.001426 seconds (4 allocations: 160 bytes)\n",
      "  0.001248 seconds (4 allocations: 160 bytes)\n"
     ]
    }
   ],
   "source": [
    "# Warmpup\n",
    "gemm!('N','N',a,A,B,b,C);\n",
    "gemm!('N','N',a,A,B,b,C);\n",
    "# Time: 5 runs\n",
    "@time gemm!('N','N',a,A,B,b,C);\n",
    "@time gemm!('N','N',a,A,B,b,C);\n",
    "@time gemm!('N','N',a,A,B,b,C);\n",
    "@time gemm!('N','N',a,A,B,b,C);\n",
    "@time gemm!('N','N',a,A,B,b,C);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## IV. CUDArt Datatypes\n",
    "Our first step in being able to use CuBLAS is to initialize our GPU device and make on-device copies of the datastructures we're interested in. Below we detail how to fence off the GPU code and ensure that proper garbage collection is performed on the device via CUDArt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CUDArt Data Pointer Descriptions\n",
      "CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00a80000),(500,100),0)\n",
      "CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00ae1c00),(100,784),0)\n",
      "CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00b80000),(500,784),0)\n"
     ]
    }
   ],
   "source": [
    "# Assign Device\n",
    "device(0)\n",
    "device_reset(0) \n",
    "device(0)\n",
    "# Create and Copy \"A\"\n",
    "d_A = CudaArray(A)\n",
    "copy!(d_A,A)\n",
    "# Create and Copy \"B\"\n",
    "d_B = CudaArray(B)\n",
    "copy!(d_B,B)\n",
    "# Create and Copy \"C\"\n",
    "d_C = CudaArray(C)\n",
    "copy!(d_C,C)\n",
    "# Show \n",
    "println(\"CUDArt Data Pointer Descriptions\")\n",
    "println(d_A)\n",
    "println(d_B)\n",
    "println(d_C)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## V. CuBLAS Timings\n",
    "Now, lets look at the time requirements for just running `gemm`. We note that this **does not** include the time of memory copying to and from device memory. For now, lets limit ourselves to the direct comparison of the BLAS function implementation, alone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.000027 seconds (19 allocations: 960 bytes)\n",
      "  0.000052 seconds (19 allocations: 960 bytes)\n",
      "  0.000038 seconds (19 allocations: 960 bytes)\n",
      "  0.000034 seconds (19 allocations: 960 bytes)\n",
      "  0.000034 seconds (19 allocations: 960 bytes)\n"
     ]
    }
   ],
   "source": [
    "# Warmpup\n",
    "CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
    "CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
    "# Time: 5 runs\n",
    "@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
    "@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
    "@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
    "@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
    "@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we can see form the above that we are looking at an *order of magnitude* improvement in computation time, potentially."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# End Session\n",
    "device_reset(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## VI. CuBLAS Timings: With Memory Copying\n",
    "We will now look at the situation where we want to declare a local function which will conduct all of the necessary device-to-device memory copying requried for the GPU implemenation. Our goal is to see exactly how much advantage we retain in a realistic comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warmups============\n",
      "(A->d_A)         0.000084 seconds\n",
      "(B->d_B)         0.000080 seconds\n",
      "(C->d_C)         0.000472 seconds\n",
      "(CUBLAS.gemm!)   0.071927 seconds (15 allocations: 800 bytes)\n",
      "(d_C->C)         0.004300 seconds\n",
      "(A->d_A)         0.000077 seconds\n",
      "(B->d_B)         0.000078 seconds\n",
      "(C->d_C)         0.000474 seconds\n",
      "(CUBLAS.gemm!)   0.000068 seconds (15 allocations: 800 bytes)\n",
      "(d_C->C)         0.003919 seconds\n",
      "Actual=============\n",
      "(A->d_A)         0.000063 seconds\n",
      "(B->d_B)         0.000084 seconds\n",
      "(C->d_C)         0.000489 seconds\n",
      "(CUBLAS.gemm!)   0.000069 seconds (15 allocations: 800 bytes)\n",
      "(d_C->C)         0.003869 seconds\n",
      "  0.005277 seconds (487 allocations: 23.641 KB)\n"
     ]
    }
   ],
   "source": [
    "function MyTimedGemm!(tA,tB,a,A,d_A,B,d_B,b,C,d_C)\n",
    "    # Copy to device\n",
    "    @printf \"(A->d_A)       \" \n",
    "        @time copy!(d_A,A)\n",
    "    @printf \"(B->d_B)       \" \n",
    "        @time copy!(d_B,B)\n",
    "    @printf \"(C->d_C)       \" \n",
    "        @time copy!(d_C,C)\n",
    "    # Run device-level BLAS\n",
    "    @printf \"(CUBLAS.gemm!) \"\n",
    "        @time CUBLAS.gemm!(tA,tB,a,d_A,d_B,b,d_C)\n",
    "    # Gather result\n",
    "    @printf \"(d_C->C)       \"\n",
    "#         @time copy!(C,d_C)\n",
    "#         @time C = to_host(d_C)\n",
    "    @time C = CUBLAS.copy!(C,d_C)\n",
    "end\n",
    "\n",
    "device(0)\n",
    "device_reset(0)\n",
    "device(0)\n",
    "\n",
    "# These pointers can be pre-allocated\n",
    "d_A = CudaArray(A)\n",
    "d_B = CudaArray(B)\n",
    "d_C = CudaArray(C)\n",
    "\n",
    "# Warmup\n",
    "println(\"Warmups============\")\n",
    "MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);\n",
    "MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);\n",
    "println(\"Actual=============\")\n",
    "@time MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the act of reading the matrix $\\mathbf{C}$ back from the device to the CPU actually incurs a huge cost. In fact, the cost is so high as to entirely remove any time advantage we obtain from the CuBLAS implemenation of `gemm`. The workaround for this is, of course, to retain as many computations on the device as possible. \n",
    "\n",
    "However, with the need to update the matrices on each call, as in the RBM implementation, we are forced to re-copy memory to and from the GPU between each call of `CUBLAS.gemm!`. This means that either we need to find a way to reduce the time requirement for copying $\\mathbf{C}$ back from the device, or we need to find a clever scheduling of GPU calls and memory copies."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## VII. Activation Functions\n",
    "It is also of interest to us to make use of on-device activation functions. The `CUDNN.jl` wrapper for the NVIDIA CuDNN library does just this for us. Here, lets make a test of using the CUDNN sigmoid activation function against a direct implementation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using CPU:       0.116879 seconds (3.13 M allocations: 56.808 MB, 4.25% gc time)\n",
      "Using CuDNN:     0.101829 seconds (234 allocations: 1.230 MB)\n",
      " MSE: 3.193e-34"
     ]
    }
   ],
   "source": [
    "# Reset Device\n",
    "device(0)\n",
    "device_reset(0)\n",
    "device(0)\n",
    "# Initialize matrix of values to process\n",
    "n = 784\n",
    "m = 500\n",
    "C = randn(m,n)\n",
    "Dcpu = zeros(C)\n",
    "Dgpu = zeros(C)\n",
    "\n",
    "# Run Activation: CPU\n",
    "@printf \"Using CPU:     \"\n",
    "@time @devec Dcpu = 1 ./ (1 + exp(-C));\n",
    "\n",
    "# Run Activation: CUDNN\n",
    "d_C = CudaArray(C)\n",
    "copy!(d_C,C)\n",
    "@printf \"Using CuDNN:   \"\n",
    "@time cudnnActivationForward(d_C;mode=CUDNN_ACTIVATION_SIGMOID);\n",
    "copy!(Dgpu,d_C)\n",
    "\n",
    "\n",
    "# Difference\n",
    "mse = sum((vec(Dgpu)-vec(Dcpu)).^2)./(m*n)\n",
    "@printf(\" MSE: %0.3e\",mse)\n",
    "    \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.4.0-rc2",
   "language": "julia",
   "name": "julia-0.4"
  },
  "language_info": {
   "name": "julia",
   "version": "0.4.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Testing CuBLAS and CUDArt for Julia\n",
	"After finally getting NVCC to work on OSX, we can start using the CUDA-themed BLAS packages written for Julia. In this notebook we will document how to utilize the necessary datatypes and show comparisons between the CPU and GPU implementations of common BLAS functions."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## I. Calling and using the Libraries\n",
	"Lets first make sure that we have updated and built the libraries. Because of the recent changes in Julia between `v0.3` and `v0.4`, we expect quite a number of warnings, and even errors, to pop up during the testing phase. However, the core functionality of the packges should be there."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"INFO: Updating METADATA...\n",
	"INFO: Updating cache of IJulia...\n",
	"INFO: Updating cache of Gtk...\n",
	"INFO: Updating cache of Grid...\n",
	"INFO: Updating cache of ProfileView...\n",
	"INFO: Updating cache of ImageView...\n",
	"INFO: Updating cache of Zlib...\n",
	"INFO: Updating cache of Calculus...\n",
	"INFO: Updating cache of Nettle...\n",
	"INFO: Updating cache of CUDArt...\n",
	"INFO: Updating cache of DualNumbers...\n",
	"INFO: Updating CUBLAS...\n",
	"INFO: Updating Winston...\n",
	"INFO: Updating Devectorize...\n",
	"INFO: Updating Images...\n",
	"INFO: Updating IJulia...\n",
	"INFO: Updating DataArrays...\n",
	"INFO: Updating Boltzmann...\n",
	"INFO: Updating ImageView...\n",
	"INFO: Updating DataFrames...\n",
	"INFO: Updating MNIST...\n",
	"INFO: Updating LowRAMP...\n",
	"INFO: Computing changes...\n",
	"INFO: Upgrading CUDArt: v0.2.0 => v0.2.1\n",
	"INFO: Upgrading Calculus: v0.1.11 => v0.1.12\n",
	"INFO: Upgrading ColorTypes: v0.1.5 => v0.1.6\n",
	"INFO: Upgrading DualNumbers: v0.1.3 => v0.1.4\n",
	"INFO: Upgrading Grid: v0.3.11 => v0.4.0\n",
	"INFO: Upgrading Gtk: v0.9.0 => v0.9.2\n",
	"INFO: Upgrading Nettle: v0.1.10 => v0.2.0\n",
	"INFO: Upgrading ProfileView: v0.1.0 => v0.1.1\n",
	"INFO: Upgrading Zlib: v0.1.9 => v0.1.10\n",
	"INFO: Building CUDArt\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"make: Nothing to be done for `all'.\n",
	"make: Nothing to be done for `all'.\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"INFO: Building Homebrew\n",
	"INFO: Recompiling stale cache file /Users/tramel/.julia/lib/v0.4/URIParser.ji for module URIParser.\n",
	"INFO: Recompiling stale cache file /Users/tramel/.julia/lib/v0.4/SHA.ji for module SHA.\n",
	"From https://github.com/Homebrew/homebrew\n",
	" 4257c3d..433aa72 master -> origin/master\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"HEAD is now at 433aa72 pyqt: update 4.11.3_1 bottle.\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"From https://github.com/staticfloat/homebrew-juliadeps\n",
	" 73fce5f..b359d09 master -> origin/master\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"HEAD is now at b359d09 gnome-icon-theme needs to be mirrored by us to avoid pkg-config problems\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"INFO: Building Cairo\n",
	"INFO: Building Gtk\n",
	"INFO: Building Nettle\n",
	"INFO: Building CUDArt\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"make: Nothing to be done for `all'.\n",
	"make: Nothing to be done for `all'.\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"INFO: Building CUDArt\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"make: Nothing to be done for `all'.\n",
	"make: Nothing to be done for `all'.\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"INFO: Building CUDArt\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"make: Nothing to be done for `all'.\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"INFO: Testing CUDArt\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"make: Nothing to be done for `all'.\n"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"INFO: Recompiling stale cache file /Users/tramel/.julia/lib/v0.4/CUDArt.ji for module CUDArt.\n",
	"WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
	"WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
	"WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
	"WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
	"in anonymous at null:-1\n",
	"WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
	"in anonymous at null:-1\n",
	"WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
	"in anonymous at null:-1\n",
	"WARNING: Base.Uint16 is deprecated, use UInt16 instead.\n",
	"in anonymous at null:-1\n",
	"ERROR: LoadError: LoadError: test failed: !(isempty(CUDArt.cuda_ptrs))\n",
	" in expression: !(isempty(CUDArt.cuda_ptrs))\n",
	" in error at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
	" in default_handler at test.jl:30\n",
	" in do_test at test.jl:53\n",
	" in include at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
	" in include_from_node1 at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
	" in include at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
	" in include_from_node1 at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
	" in process_options at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
	" in _start at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib\n",
	"while loading /Users/tramel/.julia/v0.4/CUDArt/test/gc.jl, in expression starting on line 57\n",
	"while loading /Users/tramel/.julia/v0.4/CUDArt/test/runtests.jl, in expression starting on line 1\n",
	"===============================[ ERROR: CUDArt ]================================\n",
	"\n",
	"failed process: Process(`/Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/bin/julia --check-bounds=yes --code-coverage=none --color=no /Users/tramel/.julia/v0.4/CUDArt/test/runtests.jl`, ProcessExited(1)) [1]\n"
	]
	},
	{
	"ename": "LoadError",
	"evalue": "LoadError: CUDArt had test errors\nwhile loading In[14], in expression starting on line 6",
	"output_type": "error",
	"traceback": [
	"LoadError: CUDArt had test errors\nwhile loading In[14], in expression starting on line 6",
	"",
	" in error at /Applications/Julia-0.4.0-rc2.app/Contents/Resources/julia/lib/julia/sys.dylib",
	" in test at pkg/entry.jl:753",
	" in anonymous at pkg/dir.jl:31",
	" in cd at file.jl:22",
	" in cd at pkg/dir.jl:31",
	" in test at pkg.jl:71"
	]
	},
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"\n",
	"================================================================================\n"
	]
	}
	],
	"source": [
	"# Update and build\n",
	"Pkg.update()\n",
	"Pkg.build(\"CUDArt\")\n",
	"Pkg.build(\"CUBLAS\")\n",
	"Pkg.build(\"CUDNN\")\n",
	"Pkg.test(\"CUDArt\")\n",
	"Pkg.test(\"CUBLAS\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"using CUDArt\n",
	"using CUBLAS\n",
	"using CUDNN\n",
	"using Base.LinAlg.BLAS\n",
	"using Devectorize"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## II. Experiment Parameters\n",
	"We will focus our comparisons on the BLAS function `gemm` which computes\n",
	"$$ \\mathbf{C} \\leftarrow \\alpha \\mathbf{A}\\mathbf{B} + \\beta \\mathbf{C}.$$\n",
	"We will assume that all of these matrices are dense and real. For our experiments we will set\n",
	"$\\mathbf{A}: (n \\times m)$, $\\mathbf{B}: (m \\times k)$, $\\mathbf{C}: (n \\times k)$, and\n",
	"$\\alpha = \\beta = 1.0$."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" A 390 KB 500x100 Array{Float64,2} : [0.8697…\n",
	" B 612 KB 100x784 Array{Float64,2} : [-0.770…\n",
	" Base 27158 KB Module : Base\n",
	" BinDeps 486 KB Module : BinDeps\n",
	" C 3062 KB 500x784 Array{Float64,2} : [-0.578…\n",
	" CUBLAS 545 KB Module : CUBLAS\n",
	" CUDArt 535 KB Module : CUDArt\n",
	" CUDNN 162 KB Module : CUDNN\n",
	" Compat 58 KB Module : Compat\n",
	" Core 3239 KB Module : Core\n",
	" DataStructures 381 KB Module : DataStructures\n",
	" Devectorize 259 KB Module : Devectorize\n",
	" Homebrew 56 KB Module : Homebrew\n",
	" IJulia 331 KB Module : IJulia\n",
	" IPythonDisplay 26 KB Module : IPythonDisplay\n",
	" JSON 214 KB Module : JSON\n",
	" Main 36909 KB Module : Main\n",
	" MyTimedGemm! 12 KB Function : MyTimedGemm!\n",
	" Nettle 187 KB Module : Nettle\n",
	" SHA 50 KB Module : SHA\n",
	" URIParser 85 KB Module : URIParser\n",
	" ZMQ 81 KB Module : ZMQ\n",
	" a 8 bytes Float64 : 1.0\n",
	" b 8 bytes Float64 : 1.0\n",
	" d_A 40 bytes CUDArt.CudaArray{Float64,2}(CUDArt…\n",
	" d_B 40 bytes CUDArt.CudaArray{Float64,2}(CUDArt…\n",
	" d_C 40 bytes CUDArt.CudaArray{Float64,2}(CUDArt…\n",
	" k 8 bytes Int64 : 784\n",
	" m 8 bytes Int64 : 100\n",
	" n 8 bytes Int64 : 500\n"
	]
	}
	],
	"source": [
	"# Dimensions\n",
	"n = 500\n",
	"m = 100\n",
	"k = 784\n",
	"# Scalings\n",
	"a = 1.0\n",
	"b = 1.0\n",
	"# Initialization\n",
	"A = randn(n,m);\n",
	"B = randn(m,k);\n",
	"C = randn(n,k);\n",
	"whos()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## III. Baseline Performance\n",
	"We will now look at the timing of the base OpenBLAS implementation of `gemm`, which runs on the CPU, alone."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {
	"collapsed": false,
	"scrolled": true
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" 0.001017 seconds (4 allocations: 160 bytes)\n",
	" 0.001137 seconds (4 allocations: 160 bytes)\n",
	" 0.001359 seconds (4 allocations: 160 bytes)\n",
	" 0.001426 seconds (4 allocations: 160 bytes)\n",
	" 0.001248 seconds (4 allocations: 160 bytes)\n"
	]
	}
	],
	"source": [
	"# Warmpup\n",
	"gemm!('N','N',a,A,B,b,C);\n",
	"gemm!('N','N',a,A,B,b,C);\n",
	"# Time: 5 runs\n",
	"@time gemm!('N','N',a,A,B,b,C);\n",
	"@time gemm!('N','N',a,A,B,b,C);\n",
	"@time gemm!('N','N',a,A,B,b,C);\n",
	"@time gemm!('N','N',a,A,B,b,C);\n",
	"@time gemm!('N','N',a,A,B,b,C);"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## IV. CUDArt Datatypes\n",
	"Our first step in being able to use CuBLAS is to initialize our GPU device and make on-device copies of the datastructures we're interested in. Below we detail how to fence off the GPU code and ensure that proper garbage collection is performed on the device via CUDArt."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CUDArt Data Pointer Descriptions\n",
	"CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00a80000),(500,100),0)\n",
	"CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00ae1c00),(100,784),0)\n",
	"CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00b80000),(500,784),0)\n"
	]
	}
	],
	"source": [
	"# Assign Device\n",
	"device(0)\n",
	"device_reset(0) \n",
	"device(0)\n",
	"# Create and Copy \"A\"\n",
	"d_A = CudaArray(A)\n",
	"copy!(d_A,A)\n",
	"# Create and Copy \"B\"\n",
	"d_B = CudaArray(B)\n",
	"copy!(d_B,B)\n",
	"# Create and Copy \"C\"\n",
	"d_C = CudaArray(C)\n",
	"copy!(d_C,C)\n",
	"# Show \n",
	"println(\"CUDArt Data Pointer Descriptions\")\n",
	"println(d_A)\n",
	"println(d_B)\n",
	"println(d_C)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## V. CuBLAS Timings\n",
	"Now, lets look at the time requirements for just running `gemm`. We note that this does not include the time of memory copying to and from device memory. For now, lets limit ourselves to the direct comparison of the BLAS function implementation, alone."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" 0.000027 seconds (19 allocations: 960 bytes)\n",
	" 0.000052 seconds (19 allocations: 960 bytes)\n",
	" 0.000038 seconds (19 allocations: 960 bytes)\n",
	" 0.000034 seconds (19 allocations: 960 bytes)\n",
	" 0.000034 seconds (19 allocations: 960 bytes)\n"
	]
	}
	],
	"source": [
	"# Warmpup\n",
	"CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
	"CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
	"# Time: 5 runs\n",
	"@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
	"@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
	"@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
	"@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);\n",
	"@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So, we can see form the above that we are looking at an order of magnitude improvement in computation time, potentially."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# End Session\n",
	"device_reset(0)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## VI. CuBLAS Timings: With Memory Copying\n",
	"We will now look at the situation where we want to declare a local function which will conduct all of the necessary device-to-device memory copying requried for the GPU implemenation. Our goal is to see exactly how much advantage we retain in a realistic comparison."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Warmups============\n",
	"(A->d_A) 0.000084 seconds\n",
	"(B->d_B) 0.000080 seconds\n",
	"(C->d_C) 0.000472 seconds\n",
	"(CUBLAS.gemm!) 0.071927 seconds (15 allocations: 800 bytes)\n",
	"(d_C->C) 0.004300 seconds\n",
	"(A->d_A) 0.000077 seconds\n",
	"(B->d_B) 0.000078 seconds\n",
	"(C->d_C) 0.000474 seconds\n",
	"(CUBLAS.gemm!) 0.000068 seconds (15 allocations: 800 bytes)\n",
	"(d_C->C) 0.003919 seconds\n",
	"Actual=============\n",
	"(A->d_A) 0.000063 seconds\n",
	"(B->d_B) 0.000084 seconds\n",
	"(C->d_C) 0.000489 seconds\n",
	"(CUBLAS.gemm!) 0.000069 seconds (15 allocations: 800 bytes)\n",
	"(d_C->C) 0.003869 seconds\n",
	" 0.005277 seconds (487 allocations: 23.641 KB)\n"
	]
	}
	],
	"source": [
	"function MyTimedGemm!(tA,tB,a,A,d_A,B,d_B,b,C,d_C)\n",
	" # Copy to device\n",
	" @printf \"(A->d_A) \" \n",
	" @time copy!(d_A,A)\n",
	" @printf \"(B->d_B) \" \n",
	" @time copy!(d_B,B)\n",
	" @printf \"(C->d_C) \" \n",
	" @time copy!(d_C,C)\n",
	" # Run device-level BLAS\n",
	" @printf \"(CUBLAS.gemm!) \"\n",
	" @time CUBLAS.gemm!(tA,tB,a,d_A,d_B,b,d_C)\n",
	" # Gather result\n",
	" @printf \"(d_C->C) \"\n",
	"# @time copy!(C,d_C)\n",
	"# @time C = to_host(d_C)\n",
	" @time C = CUBLAS.copy!(C,d_C)\n",
	"end\n",
	"\n",
	"device(0)\n",
	"device_reset(0)\n",
	"device(0)\n",
	"\n",
	"# These pointers can be pre-allocated\n",
	"d_A = CudaArray(A)\n",
	"d_B = CudaArray(B)\n",
	"d_C = CudaArray(C)\n",
	"\n",
	"# Warmup\n",
	"println(\"Warmups============\")\n",
	"MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);\n",
	"MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);\n",
	"println(\"Actual=============\")\n",
	"@time MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can see that the act of reading the matrix $\\mathbf{C}$ back from the device to the CPU actually incurs a huge cost. In fact, the cost is so high as to entirely remove any time advantage we obtain from the CuBLAS implemenation of `gemm`. The workaround for this is, of course, to retain as many computations on the device as possible. \n",
	"\n",
	"However, with the need to update the matrices on each call, as in the RBM implementation, we are forced to re-copy memory to and from the GPU between each call of `CUBLAS.gemm!`. This means that either we need to find a way to reduce the time requirement for copying $\\mathbf{C}$ back from the device, or we need to find a clever scheduling of GPU calls and memory copies."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## VII. Activation Functions\n",
	"It is also of interest to us to make use of on-device activation functions. The `CUDNN.jl` wrapper for the NVIDIA CuDNN library does just this for us. Here, lets make a test of using the CUDNN sigmoid activation function against a direct implementation."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Using CPU: 0.116879 seconds (3.13 M allocations: 56.808 MB, 4.25% gc time)\n",
	"Using CuDNN: 0.101829 seconds (234 allocations: 1.230 MB)\n",
	" MSE: 3.193e-34"
	]
	}
	],
	"source": [
	"# Reset Device\n",
	"device(0)\n",
	"device_reset(0)\n",
	"device(0)\n",
	"# Initialize matrix of values to process\n",
	"n = 784\n",
	"m = 500\n",
	"C = randn(m,n)\n",
	"Dcpu = zeros(C)\n",
	"Dgpu = zeros(C)\n",
	"\n",
	"# Run Activation: CPU\n",
	"@printf \"Using CPU: \"\n",
	"@time @devec Dcpu = 1 ./ (1 + exp(-C));\n",
	"\n",
	"# Run Activation: CUDNN\n",
	"d_C = CudaArray(C)\n",
	"copy!(d_C,C)\n",
	"@printf \"Using CuDNN: \"\n",
	"@time cudnnActivationForward(d_C;mode=CUDNN_ACTIVATION_SIGMOID);\n",
	"copy!(Dgpu,d_C)\n",
	"\n",
	"\n",
	"# Difference\n",
	"mse = sum((vec(Dgpu)-vec(Dcpu)).^2)./(m*n)\n",
	"@printf(\" MSE: %0.3e\",mse)\n",
	" \n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Julia 0.4.0-rc2",
	"language": "julia",
	"name": "julia-0.4"
	},
	"language_info": {
	"name": "julia",
	"version": "0.4.0"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}