joksim/gist:d7c02aa7dde26ae274636d025faadd92

## gistfile1.txt
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parallelization with GPU & CUDA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "CUDA is a platform that allows the use of GPUs to execute and parallelize programs. In python, the numba library translates code into CUDA kernels and functions."
    "- To install it with conda::\n",
    "    1. Run `conda install numba`\n",
    "    2. Install the latest  for your platform [NVIDIA graphics drivers](https://www.nvidia.com/Download/index.aspx)\n",
    "    3. Install cudatoolkit library: `conda install cudatoolkit`\n",
    "- To install it with pip\n",
    "    1. Execute `pip install numba`\n",
    "    2. Install the [CUDA SDK](https://developer.nvidia.com/cuda-downloads)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "За проверка на исталацијата извршете:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 1 CUDA devices\n",
      "id 0       b'GeForce GT 730'                              [SUPPORTED]\n",
      "                      compute capability: 3.5\n",
      "                           pci device id: 0\n",
      "                              pci bus id: 1\n",
      "Summary:\n",
      "\t1/1 devices are supported\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cuda.detect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,\n",
       "       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "numbers = np.arange(0,32,1)\n",
    "numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "CPU програма за множење на секој елемент со 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def multiply2_CPU(arr):\n",
    "    res = np.copy(arr)\n",
    "    for i in range(len(arr)):\n",
    "        res[i] = arr[i]*2\n",
    "    return res"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,\n",
       "       34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "multiply2_CPU(numbers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "GPU програма за множење на секој елемент со 2\n",
    "\n",
    "Се инстанцираат 32 threads, при што секоја од нив работи на еден елемент од низата. Threads се организирани во блокови. Во следниот пример, се користат 32 блокови со по 1 thread."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "threadsperblock = 1\n",
    "blockspergrid = (numbers.size + (threadsperblock - 1)) // threadsperblock"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Следно, се дефинира кернелот - функцијата која ја извршува секоја thread. Елементот на кој работи секоја од нив се добива со block_id * големина_на_блок + thread_id.\n",
    "\n",
    "Во дадениот пример, thread 0 во блок 5 ќе работи на елементот 5 * 1 + 0 = 5 од влезната низа."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def multiply2_GPU(in_arr, out_arr):\n",
    "    tx = cuda.threadIdx.x    # Thread id во блокот\n",
    "    ty = cuda.blockIdx.x     # Block id \n",
    "    bw = cuda.blockDim.x     # број на threads во блокот\n",
    "    pos = tx + ty * bw     # индексот на елементот на кој треба да работи секоја thread се добива \n",
    "                            # со block_id * големина_на_блок + thread_id\n",
    "    if pos < in_arr.size and pos < out_arr.size:  # Check array boundaries\n",
    "        out_arr[pos] = in_arr[pos]*2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "На следниот начин се стартува кернелот. \n",
    "\n",
    "Прво податоците треба да се префрлат од CPU (host) на GPU (device).\n",
    "\n",
    "Потоа со `име_на_кернел[број_на_блокови, големина_на_блок]` се почнува извршувањето на кернелот. \n",
    "\n",
    "По завршувањето, треба податоците повторно да се префрлат на CPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "res = np.copy(numbers)\n",
    "\n",
    "d_in = cuda.to_device(numbers)\n",
    "d_out = cuda.to_device(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "multiply2_GPU[blockspergrid, threadsperblock](d_in, d_out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "res = d_out.copy_to_host()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,\n",
       "       34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Проверката `pos < in_arr.size and pos < out_arr.size` е потребна затоа што е можно големината на низата да не е делива со големината на блокот. Во следниот пример се стартуваат 16 блокови со по 32 threads, или вкупно 512 threads. 12 од нив ќе ја прават само оваа проверка и, затоа што индексот на кој би требало да работат не постои во низите на кои работат, нема да пристапат до тие непостоечки индекси. Доколку во кернелот ја нема оваа проверка, за тие threads би се јавила грешка, затоа што би се обиделе да пристапат до елементи кои не постојат."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  0,   2,   4,   6,   8,  10,  12,  14,  16,  18,  20,  22,  24,\n",
       "        26,  28,  30,  32,  34,  36,  38,  40,  42,  44,  46,  48,  50,\n",
       "        52,  54,  56,  58,  60,  62,  64,  66,  68,  70,  72,  74,  76,\n",
       "        78,  80,  82,  84,  86,  88,  90,  92,  94,  96,  98, 100, 102,\n",
       "       104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128,\n",
       "       130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154,\n",
       "       156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180,\n",
       "       182, 184, 186, 188, 190, 192, 194, 196, 198, 200, 202, 204, 206,\n",
       "       208, 210, 212, 214, 216, 218, 220, 222, 224, 226, 228, 230, 232,\n",
       "       234, 236, 238, 240, 242, 244, 246, 248, 250, 252, 254, 256, 258,\n",
       "       260, 262, 264, 266, 268, 270, 272, 274, 276, 278, 280, 282, 284,\n",
       "       286, 288, 290, 292, 294, 296, 298, 300, 302, 304, 306, 308, 310,\n",
       "       312, 314, 316, 318, 320, 322, 324, 326, 328, 330, 332, 334, 336,\n",
       "       338, 340, 342, 344, 346, 348, 350, 352, 354, 356, 358, 360, 362,\n",
       "       364, 366, 368, 370, 372, 374, 376, 378, 380, 382, 384, 386, 388,\n",
       "       390, 392, 394, 396, 398, 400, 402, 404, 406, 408, 410, 412, 414,\n",
       "       416, 418, 420, 422, 424, 426, 428, 430, 432, 434, 436, 438, 440,\n",
       "       442, 444, 446, 448, 450, 452, 454, 456, 458, 460, 462, 464, 466,\n",
       "       468, 470, 472, 474, 476, 478, 480, 482, 484, 486, 488, 490, 492,\n",
       "       494, 496, 498, 500, 502, 504, 506, 508, 510, 512, 514, 516, 518,\n",
       "       520, 522, 524, 526, 528, 530, 532, 534, 536, 538, 540, 542, 544,\n",
       "       546, 548, 550, 552, 554, 556, 558, 560, 562, 564, 566, 568, 570,\n",
       "       572, 574, 576, 578, 580, 582, 584, 586, 588, 590, 592, 594, 596,\n",
       "       598, 600, 602, 604, 606, 608, 610, 612, 614, 616, 618, 620, 622,\n",
       "       624, 626, 628, 630, 632, 634, 636, 638, 640, 642, 644, 646, 648,\n",
       "       650, 652, 654, 656, 658, 660, 662, 664, 666, 668, 670, 672, 674,\n",
       "       676, 678, 680, 682, 684, 686, 688, 690, 692, 694, 696, 698, 700,\n",
       "       702, 704, 706, 708, 710, 712, 714, 716, 718, 720, 722, 724, 726,\n",
       "       728, 730, 732, 734, 736, 738, 740, 742, 744, 746, 748, 750, 752,\n",
       "       754, 756, 758, 760, 762, 764, 766, 768, 770, 772, 774, 776, 778,\n",
       "       780, 782, 784, 786, 788, 790, 792, 794, 796, 798, 800, 802, 804,\n",
       "       806, 808, 810, 812, 814, 816, 818, 820, 822, 824, 826, 828, 830,\n",
       "       832, 834, 836, 838, 840, 842, 844, 846, 848, 850, 852, 854, 856,\n",
       "       858, 860, 862, 864, 866, 868, 870, 872, 874, 876, 878, 880, 882,\n",
       "       884, 886, 888, 890, 892, 894, 896, 898, 900, 902, 904, 906, 908,\n",
       "       910, 912, 914, 916, 918, 920, 922, 924, 926, 928, 930, 932, 934,\n",
       "       936, 938, 940, 942, 944, 946, 948, 950, 952, 954, 956, 958, 960,\n",
       "       962, 964, 966, 968, 970, 972, 974, 976, 978, 980, 982, 984, 986,\n",
       "       988, 990, 992, 994, 996, 998])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "numbers2 = np.arange(500)\n",
    "\n",
    "threadsperblock = 32\n",
    "blockspergrid = (numbers2.size + (threadsperblock - 1)) // threadsperblock\n",
    "\n",
    "res2 = np.copy(numbers2)\n",
    "\n",
    "d_in = cuda.to_device(numbers2)\n",
    "d_out = cuda.to_device(res2)\n",
    "\n",
    "multiply2_GPU[blockspergrid, threadsperblock](d_in, d_out)\n",
    "\n",
    "res = d_out.copy_to_host()\n",
    "\n",
    "res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Во следниот пример, една слика во боја се претвора во слика во црно-бело. Во претходниот пример, се користеше само една димензија на блоковите и решетката во која тие се организаирани. Тука ќе се користат дводимензионални блокови (64 threads, по 8 во секоја димензија) и дводимензионална решетка со број на блокови соодветен на големината на сликата."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def to_grayscale(rgbImage, grayImage):\n",
    "    idx = cuda.blockDim.x * cuda.blockIdx.x + cuda.threadIdx.x\n",
    "    idy = cuda.blockDim.y * cuda.blockIdx.y + cuda.threadIdx.y\n",
    "    grid_width = cuda.gridDim.x * cuda.blockDim.x\n",
    "    index = idy * grid_width + idx\n",
    "    if index < rgbImage.size and index < grayImage.size:\n",
    "        channelSum = 0.299 * rgbImage[index][0] + 0.587 * rgbImage[index][1] + 0.114 * rgbImage[index][2]\n",
    "        grayImage[index] = channelSum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "numRows = 1024\n",
    "numCols = 768\n",
    "\n",
    "rgb = np.random.randint(0,255,size=(1024*768,3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "gray = np.copy(rgb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "bx = 8\n",
    "by = 8\n",
    "gx = numRows // bx +1\n",
    "gy = numCols // by +1\n",
    "block_size = (bx, by, 1)\n",
    "grid_size = (gx, gy, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "d_in = cuda.to_device(rgb)\n",
    "d_outs = cuda.to_device(gray)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(8, 8, 1)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "block_size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(129, 97, 1)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid_size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_grayscale[grid_size, block_size](d_in, d_out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "res = d_out.copy_to_host()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[132, 153,  97],\n",
       "       [154, 205, 247],\n",
       "       [119, 203,  61],\n",
       "       [ 18, 170, 107],\n",
       "       [ 10,   2, 196],\n",
       "       [198, 159, 181],\n",
       "       [  9,  44,   0],\n",
       "       [247, 128, 217],\n",
       "       [194, 100,  49],\n",
       "       [ 18, 222, 162]])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rgb[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([140, 194, 161, 117,  26, 173,  28, 173, 122, 154])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res[0:10]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Parallelization with GPU & CUDA"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"CUDA is a platform that allows the use of GPUs to execute and parallelize programs. In python, the numba library translates code into CUDA kernels and functions."
	"- To install it with conda::\n",
	" 1. Run `conda install numba`\n",
	" 2. Install the latest for your platform [NVIDIA graphics drivers](https://www.nvidia.com/Download/index.aspx)\n",
	" 3. Install cudatoolkit library: `conda install cudatoolkit`\n",
	"- To install it with pip\n",
	" 1. Execute `pip install numba`\n",
	" 2. Install the [CUDA SDK](https://developer.nvidia.com/cuda-downloads)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"from numba import cuda"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"За проверка на исталацијата извршете:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Found 1 CUDA devices\n",
	"id 0 b'GeForce GT 730' [SUPPORTED]\n",
	" compute capability: 3.5\n",
	" pci device id: 0\n",
	" pci bus id: 1\n",
	"Summary:\n",
	"\t1/1 devices are supported\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"True"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"cuda.detect()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n",
	" 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"numbers = np.arange(0,32,1)\n",
	"numbers"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"CPU програма за множење на секој елемент со 2"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"def multiply2_CPU(arr):\n",
	" res = np.copy(arr)\n",
	" for i in range(len(arr)):\n",
	" res[i] = arr[i]*2\n",
	" return res"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,\n",
	" 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62])"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"multiply2_CPU(numbers)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"GPU програма за множење на секој елемент со 2\n",
	"\n",
	"Се инстанцираат 32 threads, при што секоја од нив работи на еден елемент од низата. Threads се организирани во блокови. Во следниот пример, се користат 32 блокови со по 1 thread."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"threadsperblock = 1\n",
	"blockspergrid = (numbers.size + (threadsperblock - 1)) // threadsperblock"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Следно, се дефинира кернелот - функцијата која ја извршува секоја thread. Елементот на кој работи секоја од нив се добива со block_id * големина_на_блок + thread_id.\n",
	"\n",
	"Во дадениот пример, thread 0 во блок 5 ќе работи на елементот 5 * 1 + 0 = 5 од влезната низа."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [],
	"source": [
	"@cuda.jit\n",
	"def multiply2_GPU(in_arr, out_arr):\n",
	" tx = cuda.threadIdx.x # Thread id во блокот\n",
	" ty = cuda.blockIdx.x # Block id \n",
	" bw = cuda.blockDim.x # број на threads во блокот\n",
	" pos = tx + ty * bw # индексот на елементот на кој треба да работи секоја thread се добива \n",
	" # со block_id * големина_на_блок + thread_id\n",
	" if pos < in_arr.size and pos < out_arr.size: # Check array boundaries\n",
	" out_arr[pos] = in_arr[pos]*2"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"На следниот начин се стартува кернелот. \n",
	"\n",
	"Прво податоците треба да се префрлат од CPU (host) на GPU (device).\n",
	"\n",
	"Потоа со `име_на_кернел[број_на_блокови, големина_на_блок]` се почнува извршувањето на кернелот. \n",
	"\n",
	"По завршувањето, треба податоците повторно да се префрлат на CPU."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"res = np.copy(numbers)\n",
	"\n",
	"d_in = cuda.to_device(numbers)\n",
	"d_out = cuda.to_device(res)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [],
	"source": [
	"multiply2_GPU[blockspergrid, threadsperblock](d_in, d_out)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [],
	"source": [
	"res = d_out.copy_to_host()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,\n",
	" 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62])"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"res"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Проверката `pos < in_arr.size and pos < out_arr.size` е потребна затоа што е можно големината на низата да не е делива со големината на блокот. Во следниот пример се стартуваат 16 блокови со по 32 threads, или вкупно 512 threads. 12 од нив ќе ја прават само оваа проверка и, затоа што индексот на кој би требало да работат не постои во низите на кои работат, нема да пристапат до тие непостоечки индекси. Доколку во кернелот ја нема оваа проверка, за тие threads би се јавила грешка, затоа што би се обиделе да пристапат до елементи кои не постојат."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24,\n",
	" 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50,\n",
	" 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76,\n",
	" 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102,\n",
	" 104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128,\n",
	" 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154,\n",
	" 156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180,\n",
	" 182, 184, 186, 188, 190, 192, 194, 196, 198, 200, 202, 204, 206,\n",
	" 208, 210, 212, 214, 216, 218, 220, 222, 224, 226, 228, 230, 232,\n",
	" 234, 236, 238, 240, 242, 244, 246, 248, 250, 252, 254, 256, 258,\n",
	" 260, 262, 264, 266, 268, 270, 272, 274, 276, 278, 280, 282, 284,\n",
	" 286, 288, 290, 292, 294, 296, 298, 300, 302, 304, 306, 308, 310,\n",
	" 312, 314, 316, 318, 320, 322, 324, 326, 328, 330, 332, 334, 336,\n",
	" 338, 340, 342, 344, 346, 348, 350, 352, 354, 356, 358, 360, 362,\n",
	" 364, 366, 368, 370, 372, 374, 376, 378, 380, 382, 384, 386, 388,\n",
	" 390, 392, 394, 396, 398, 400, 402, 404, 406, 408, 410, 412, 414,\n",
	" 416, 418, 420, 422, 424, 426, 428, 430, 432, 434, 436, 438, 440,\n",
	" 442, 444, 446, 448, 450, 452, 454, 456, 458, 460, 462, 464, 466,\n",
	" 468, 470, 472, 474, 476, 478, 480, 482, 484, 486, 488, 490, 492,\n",
	" 494, 496, 498, 500, 502, 504, 506, 508, 510, 512, 514, 516, 518,\n",
	" 520, 522, 524, 526, 528, 530, 532, 534, 536, 538, 540, 542, 544,\n",
	" 546, 548, 550, 552, 554, 556, 558, 560, 562, 564, 566, 568, 570,\n",
	" 572, 574, 576, 578, 580, 582, 584, 586, 588, 590, 592, 594, 596,\n",
	" 598, 600, 602, 604, 606, 608, 610, 612, 614, 616, 618, 620, 622,\n",
	" 624, 626, 628, 630, 632, 634, 636, 638, 640, 642, 644, 646, 648,\n",
	" 650, 652, 654, 656, 658, 660, 662, 664, 666, 668, 670, 672, 674,\n",
	" 676, 678, 680, 682, 684, 686, 688, 690, 692, 694, 696, 698, 700,\n",
	" 702, 704, 706, 708, 710, 712, 714, 716, 718, 720, 722, 724, 726,\n",
	" 728, 730, 732, 734, 736, 738, 740, 742, 744, 746, 748, 750, 752,\n",
	" 754, 756, 758, 760, 762, 764, 766, 768, 770, 772, 774, 776, 778,\n",
	" 780, 782, 784, 786, 788, 790, 792, 794, 796, 798, 800, 802, 804,\n",
	" 806, 808, 810, 812, 814, 816, 818, 820, 822, 824, 826, 828, 830,\n",
	" 832, 834, 836, 838, 840, 842, 844, 846, 848, 850, 852, 854, 856,\n",
	" 858, 860, 862, 864, 866, 868, 870, 872, 874, 876, 878, 880, 882,\n",
	" 884, 886, 888, 890, 892, 894, 896, 898, 900, 902, 904, 906, 908,\n",
	" 910, 912, 914, 916, 918, 920, 922, 924, 926, 928, 930, 932, 934,\n",
	" 936, 938, 940, 942, 944, 946, 948, 950, 952, 954, 956, 958, 960,\n",
	" 962, 964, 966, 968, 970, 972, 974, 976, 978, 980, 982, 984, 986,\n",
	" 988, 990, 992, 994, 996, 998])"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"numbers2 = np.arange(500)\n",
	"\n",
	"threadsperblock = 32\n",
	"blockspergrid = (numbers2.size + (threadsperblock - 1)) // threadsperblock\n",
	"\n",
	"res2 = np.copy(numbers2)\n",
	"\n",
	"d_in = cuda.to_device(numbers2)\n",
	"d_out = cuda.to_device(res2)\n",
	"\n",
	"multiply2_GPU[blockspergrid, threadsperblock](d_in, d_out)\n",
	"\n",
	"res = d_out.copy_to_host()\n",
	"\n",
	"res"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Во следниот пример, една слика во боја се претвора во слика во црно-бело. Во претходниот пример, се користеше само една димензија на блоковите и решетката во која тие се организаирани. Тука ќе се користат дводимензионални блокови (64 threads, по 8 во секоја димензија) и дводимензионална решетка со број на блокови соодветен на големината на сликата."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [],
	"source": [
	"@cuda.jit\n",
	"def to_grayscale(rgbImage, grayImage):\n",
	" idx = cuda.blockDim.x * cuda.blockIdx.x + cuda.threadIdx.x\n",
	" idy = cuda.blockDim.y * cuda.blockIdx.y + cuda.threadIdx.y\n",
	" grid_width = cuda.gridDim.x * cuda.blockDim.x\n",
	" index = idy * grid_width + idx\n",
	" if index < rgbImage.size and index < grayImage.size:\n",
	" channelSum = 0.299 * rgbImage[index][0] + 0.587 * rgbImage[index][1] + 0.114 * rgbImage[index][2]\n",
	" grayImage[index] = channelSum"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [],
	"source": [
	"numRows = 1024\n",
	"numCols = 768\n",
	"\n",
	"rgb = np.random.randint(0,255,size=(1024*768,3))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [],
	"source": [
	"gray = np.copy(rgb)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [],
	"source": [
	"bx = 8\n",
	"by = 8\n",
	"gx = numRows // bx +1\n",
	"gy = numCols // by +1\n",
	"block_size = (bx, by, 1)\n",
	"grid_size = (gx, gy, 1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [],
	"source": [
	"d_in = cuda.to_device(rgb)\n",
	"d_outs = cuda.to_device(gray)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(8, 8, 1)"
	]
	},
	"execution_count": 18,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"block_size"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(129, 97, 1)"
	]
	},
	"execution_count": 19,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"grid_size"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [],
	"source": [
	"to_grayscale[grid_size, block_size](d_in, d_out)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [],
	"source": [
	"res = d_out.copy_to_host()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[132, 153, 97],\n",
	" [154, 205, 247],\n",
	" [119, 203, 61],\n",
	" [ 18, 170, 107],\n",
	" [ 10, 2, 196],\n",
	" [198, 159, 181],\n",
	" [ 9, 44, 0],\n",
	" [247, 128, 217],\n",
	" [194, 100, 49],\n",
	" [ 18, 222, 162]])"
	]
	},
	"execution_count": 24,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"rgb[0:10]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([140, 194, 161, 117, 26, 173, 28, 173, 122, 154])"
	]
	},
	"execution_count": 23,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"res[0:10]"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}