SharanSMenon/opencl_matrix_multiply.ipynb

## opencl_matrix_multiply.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1a530103-f585-4306-8416-2eb423b09ec4",
   "metadata": {},
   "source": [
    "# Matrix Multiplication with OpenCL\n",
    "\n",
    "The following notebook demonstrates matrix multiplication on the GPU by using OpenCL."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac559786-3ec4-45f2-afb7-b3436a2ea112",
   "metadata": {},
   "source": [
    "> `pip install numpy pyopencl`. Make sure you have OpenCL installed on your system\n",
    "\n",
    "This notebook was run on `macOS 12.0` on an `arm64` system (Apple M1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19184a0b-cc2f-41e6-be48-ae501867525b",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "323c5a1b-e33d-45b1-869d-7830a126e7c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pyopencl as cl\n",
    "import pyopencl.array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6ae5b58f-df47-4350-b724-8057d8887cba",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext pyopencl.ipython_ext"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "728289ad-f99c-43c6-b270-9b653766a7a7",
   "metadata": {},
   "source": [
    "Creating the OpenCL \"device\" and the command queue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "94a25fd1-994b-4adb-a752-7e3305337e15",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Choose platform:\n",
      "[0] <pyopencl.Platform 'Apple' at 0x7fff0000>\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "Choice [0]: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.\n"
     ]
    }
   ],
   "source": [
    "ctx = cl.create_some_context(interactive=True)\n",
    "queue = cl.CommandQueue(ctx)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8915b360-3c95-46b5-a87c-0d7149fbcfd5",
   "metadata": {},
   "source": [
    "## Matrix Multiplication kernel\n",
    "\n",
    "The following cell contains the matrix multiplication kernel. Note the kernel takes in 3 **one**-dimensional arrays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "03cb2128-21a1-48a6-875c-1cdc468cac3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%cl_kernel -o \"-cl-fast-relaxed-math\"\n",
    "\n",
    "__kernel void multiply(ushort n,\n",
    "ushort m, ushort p, __global float *a,\n",
    "__global float *b, __global float *c)\n",
    "{\n",
    "  int gid = get_global_id(0);\n",
    "  c[gid] = 0.0f;\n",
    "  int rowC = gid/p;\n",
    "  int colC = gid%p;\n",
    "  __global float *pA = &a[rowC*m];\n",
    "  __global float *pB = &b[colC];\n",
    "  for(int k=0; k<m; k++)\n",
    "  {\n",
    "     pB = &b[colC+k*p];\n",
    "     c[gid] += (*(pA++))*(*pB);\n",
    "  }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "624d5e22-41c0-490e-b6e8-bfe5a5aea796",
   "metadata": {},
   "source": [
    "## Matrix creation.\n",
    "\n",
    "Create some large matrices filled with some large numbers, for the GPU/CPU to compute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "8c538f11-4c1d-40b2-afb7-3aba2b03b788",
   "metadata": {},
   "outputs": [],
   "source": [
    "ha = np.int32(1892)\n",
    "wa = np.int32(1742)\n",
    "wb = np.int32(922)\n",
    "a_host = np.random.randint(0, 950, (ha, wa)).astype(np.float32)\n",
    "a = cl.array.to_device(queue, a_host.flatten()) # Flattens a_host and creates a GPU buffer\n",
    "b_host = np.random.randint(0, 50, (wa, wb)).astype(np.float32)\n",
    "b = cl.array.to_device(queue, b_host.flatten()) # Flattens a_host and creates a GPU buffer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "02008b82-734a-4cbd-8c0a-f37f506d0d29",
   "metadata": {},
   "outputs": [],
   "source": [
    "c = cl.array.empty(queue, (ha*wb), dtype=np.float32)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47b4f3d1-ac4a-40af-b576-049f714e1dec",
   "metadata": {},
   "source": [
    "## CPU Matrix Multiplication\n",
    "\n",
    "The following cell measures the performance of matrix multiplication on the CPU. This can be really fast or take a long time depending on the computer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "1805589b-01e2-46d7-b5a1-e3e3644cd7b2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 301 ms, sys: 8.06 ms, total: 309 ms\n",
      "Wall time: 54.3 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "np_out = np.matmul(a_host, b_host)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d12c809-c741-4085-8970-ead27f4e04b2",
   "metadata": {},
   "source": [
    "## GPU Matrix Multiplication\n",
    "\n",
    "The following code executes the kernel on the flattened arrays. The first 3 arguments are default OpenCL arguments, the other 6 are kernel-specific arguments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "ae1bac2f-03bb-4b1d-a155-f1ebb1b4744a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.35 ms, sys: 17 µs, total: 1.36 ms\n",
      "Wall time: 1.36 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "resg = multiply(queue, (ha*wb,), None, np.uint16(ha), np.uint16(wa), np.uint16(wb), a.data, b.data, c.data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16a2854c-c299-4d26-a0d1-61218fbc9326",
   "metadata": {},
   "source": [
    "It takes 120 ms to pull the data from the GPU, but only 1.3 ms to actually perform the multiplication.\n",
    "\n",
    "The GPU is almost 40x faster than the CPU, but the CPU would still be better for multiplying these matrices due to the latency in transferring the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "11849fcd-2658-4c0a-86c7-5a712296f0ad",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.15 ms, sys: 2.79 ms, total: 3.95 ms\n",
      "Wall time: 120 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "c_ans = c.get()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "feaad757-14a7-49d0-9cb1-7db2279d187a",
   "metadata": {},
   "source": [
    "## Checks\n",
    "\n",
    "The following code checks that the CPU and the GPU compute the same values. If the output is 0, it me"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "40a54ead-5b47-4323-825c-f146d7dbc5c4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.linalg.norm(np_out - c_ans)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "5ea1dd44-962d-4131-a107-b5fe52856856",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(892, 922)"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c_ans.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5d5fc98-6c45-46a7-aa9a-1c800594d7f2",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "1a530103-f585-4306-8416-2eb423b09ec4",
	"metadata": {},
	"source": [
	"# Matrix Multiplication with OpenCL\n",
	"\n",
	"The following notebook demonstrates matrix multiplication on the GPU by using OpenCL."
	]
	},
	{
	"cell_type": "markdown",
	"id": "ac559786-3ec4-45f2-afb7-b3436a2ea112",
	"metadata": {},
	"source": [
	"> `pip install numpy pyopencl`. Make sure you have OpenCL installed on your system\n",
	"\n",
	"This notebook was run on `macOS 12.0` on an `arm64` system (Apple M1)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "19184a0b-cc2f-41e6-be48-ae501867525b",
	"metadata": {},
	"source": [
	"## Setup"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"id": "323c5a1b-e33d-45b1-869d-7830a126e7c4",
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"import pyopencl as cl\n",
	"import pyopencl.array"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "6ae5b58f-df47-4350-b724-8057d8887cba",
	"metadata": {},
	"outputs": [],
	"source": [
	"%load_ext pyopencl.ipython_ext"
	]
	},
	{
	"cell_type": "markdown",
	"id": "728289ad-f99c-43c6-b270-9b653766a7a7",
	"metadata": {},
	"source": [
	"Creating the OpenCL \"device\" and the command queue"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"id": "94a25fd1-994b-4adb-a752-7e3305337e15",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Choose platform:\n",
	"[0] <pyopencl.Platform 'Apple' at 0x7fff0000>\n"
	]
	},
	{
	"name": "stdin",
	"output_type": "stream",
	"text": [
	"Choice [0]: 0\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.\n"
	]
	}
	],
	"source": [
	"ctx = cl.create_some_context(interactive=True)\n",
	"queue = cl.CommandQueue(ctx)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "8915b360-3c95-46b5-a87c-0d7149fbcfd5",
	"metadata": {},
	"source": [
	"## Matrix Multiplication kernel\n",
	"\n",
	"The following cell contains the matrix multiplication kernel. Note the kernel takes in 3 one-dimensional arrays"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"id": "03cb2128-21a1-48a6-875c-1cdc468cac3c",
	"metadata": {},
	"outputs": [],
	"source": [
	"%%cl_kernel -o \"-cl-fast-relaxed-math\"\n",
	"\n",
	"__kernel void multiply(ushort n,\n",
	"ushort m, ushort p, __global float *a,\n",
	"__global float b, __global float c)\n",
	"{\n",
	" int gid = get_global_id(0);\n",
	" c[gid] = 0.0f;\n",
	" int rowC = gid/p;\n",
	" int colC = gid%p;\n",
	" __global float pA = &a[rowCm];\n",
	" __global float *pB = &b[colC];\n",
	" for(int k=0; k<m; k++)\n",
	" {\n",
	" pB = &b[colC+k*p];\n",
	" c[gid] += ((pA++))(*pB);\n",
	" }\n",
	"}"
	]
	},
	{
	"cell_type": "markdown",
	"id": "624d5e22-41c0-490e-b6e8-bfe5a5aea796",
	"metadata": {},
	"source": [
	"## Matrix creation.\n",
	"\n",
	"Create some large matrices filled with some large numbers, for the GPU/CPU to compute"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 56,
	"id": "8c538f11-4c1d-40b2-afb7-3aba2b03b788",
	"metadata": {},
	"outputs": [],
	"source": [
	"ha = np.int32(1892)\n",
	"wa = np.int32(1742)\n",
	"wb = np.int32(922)\n",
	"a_host = np.random.randint(0, 950, (ha, wa)).astype(np.float32)\n",
	"a = cl.array.to_device(queue, a_host.flatten()) # Flattens a_host and creates a GPU buffer\n",
	"b_host = np.random.randint(0, 50, (wa, wb)).astype(np.float32)\n",
	"b = cl.array.to_device(queue, b_host.flatten()) # Flattens a_host and creates a GPU buffer"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 57,
	"id": "02008b82-734a-4cbd-8c0a-f37f506d0d29",
	"metadata": {},
	"outputs": [],
	"source": [
	"c = cl.array.empty(queue, (ha*wb), dtype=np.float32)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "47b4f3d1-ac4a-40af-b576-049f714e1dec",
	"metadata": {},
	"source": [
	"## CPU Matrix Multiplication\n",
	"\n",
	"The following cell measures the performance of matrix multiplication on the CPU. This can be really fast or take a long time depending on the computer."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 76,
	"id": "1805589b-01e2-46d7-b5a1-e3e3644cd7b2",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 301 ms, sys: 8.06 ms, total: 309 ms\n",
	"Wall time: 54.3 ms\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"np_out = np.matmul(a_host, b_host)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "4d12c809-c741-4085-8970-ead27f4e04b2",
	"metadata": {},
	"source": [
	"## GPU Matrix Multiplication\n",
	"\n",
	"The following code executes the kernel on the flattened arrays. The first 3 arguments are default OpenCL arguments, the other 6 are kernel-specific arguments"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 80,
	"id": "ae1bac2f-03bb-4b1d-a155-f1ebb1b4744a",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 1.35 ms, sys: 17 µs, total: 1.36 ms\n",
	"Wall time: 1.36 ms\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"resg = multiply(queue, (ha*wb,), None, np.uint16(ha), np.uint16(wa), np.uint16(wb), a.data, b.data, c.data)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "16a2854c-c299-4d26-a0d1-61218fbc9326",
	"metadata": {},
	"source": [
	"It takes 120 ms to pull the data from the GPU, but only 1.3 ms to actually perform the multiplication.\n",
	"\n",
	"The GPU is almost 40x faster than the CPU, but the CPU would still be better for multiplying these matrices due to the latency in transferring the data."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 75,
	"id": "11849fcd-2658-4c0a-86c7-5a712296f0ad",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 1.15 ms, sys: 2.79 ms, total: 3.95 ms\n",
	"Wall time: 120 ms\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"c_ans = c.get()"
	]
	},
	{
	"cell_type": "markdown",
	"id": "feaad757-14a7-49d0-9cb1-7db2279d187a",
	"metadata": {},
	"source": [
	"## Checks\n",
	"\n",
	"The following code checks that the CPU and the GPU compute the same values. If the output is 0, it me"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 53,
	"id": "40a54ead-5b47-4323-825c-f146d7dbc5c4",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.0"
	]
	},
	"execution_count": 53,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"np.linalg.norm(np_out - c_ans)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 54,
	"id": "5ea1dd44-962d-4131-a107-b5fe52856856",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(892, 922)"
	]
	},
	"execution_count": 54,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"c_ans.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "d5d5fc98-6c45-46a7-aa9a-1c800594d7f2",
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.9.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}