prejithp/numPy Walkthrough.ipynb

## numPy Walkthrough.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# numpy "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to give anyone with knowledge in Python a quick crash course on the capabilities of numpy and the things they should read up more on, depending on their use case. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What? How? Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we start exploring numpy, let us answer the 3 basic questions that is going to come to anyone's mind:\n",
    "* What is numpy?\n",
    "* Why do people use numpy?\n",
    "* How was numpy written?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What is numpy?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "numpy, or numerical Python is a library that is used for advanced mathematical computations while maintaining high levels of performance. \n",
    "According to Wikipedia, \"NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Why do people use numpy?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "numpy is used for a lot of different reasons. Lets list down a few of them:\n",
    "* It has an efficient storage mechanism compared to Python Lists\n",
    "* Ability to specify data type of numpy arrays\n",
    "* Faster operations - Operations on numpy arrays are faster than their siblings on Python Lists, primarily because of its homogenous nature\n",
    "* Creation of n-dimensional arrays and operations on them - Python lists are very primitive in the multi-dimensional and numpy makes life very easy in this department\n",
    "* Creation of random data - numpy can create random data with almost specification that can be thought of and this is extremely useful in a lot of different situations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### How was numpy written?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "numpy is written in C and was a part of SciPy project but was later separated out of it as users might not want to install the huge SciPy package just for the array operations of numpy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Before we start, lets ignore any warnings for this notebook\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting up numpy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The sole reason that numpy is imported as **np** is convention. You are free to use another alias but its not recommended as this is what you will find everywhere and its better to stick to standards"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1.18.1'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### nd-array"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The primary reason that numpy is fast is because of the nd-array type that it uses to store and manipulate data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An ndarray is a generic multidimensional container for homogenous data. It provides vectorized arithmetic operations and sophisticated broadcasting capabilities. Every ndarray has 2 properties: shape and dtype. shape is a tuple providing the dimension of the array and dtype provides you the datatype of the array.\n",
    "\n",
    "The dtype of the array can also be explicity specified while defining the array giving you fine-tuned control over the array. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets create a numpy array from an array. This is possible by passing the array as input to the *np.array* function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 3])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray = np.array([1,2,3])\n",
    "nparray"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A numpy array has numerous properties which gives more information about them"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Datatype of the array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dtype('int32')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray.dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Size of the array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray.size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Shape of the array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3,)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The below 2 cells of code warrant a detailed explanation. The *itemsize* parameter returns the size of a single item in the array. In this case, we have an integer array taking up 32 bits of space for a single item and this is equivalent to 4 bytes(1 byte = 8 bits). The following cell showcases the *nbytes* parameter which returns the size in bytes of the entire array, thereby providing a value of 12 bytes(4 bytes * 3 items).\n",
    "\n",
    "To put it short: *itemsize* provides the size of a single item in the array while *nbytes* returns the size of the entire array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray.itemsize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "12"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray.nbytes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a reason that numpy's nd-array is faster than Python's native list. Lets look at this in depth in the following cells"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "545 µs ± 24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit pythonList = [i for i in range(10000)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.82 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit npList = np.arange(10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "numpy arrays are homogenous and is handled faster in memory. Now compare this to a Python List where you can put anything in; every entry in a Python list is a Python object and this causes overhead in computations. This is the primary reason why numpy arrays are significantly faster than the traditional Python lists"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets now look at some of the functions which makes numpy a flexible and handy library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generating data with numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*arange* generates a list of numbers within the range of the digit passed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.arange(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*linspace* returns a set of lineary-spaced items within the range passed as input. In linspace, the starting digit, ending digit along with the number of digits required as passed as input. Basically, it returns an array with the required number of digits in a specified interval"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0. ,  2.5,  5. ,  7.5, 10. ])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.linspace(0, 10, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*ones* creates an array filled with ones. The parameter passed as input is the size of the required array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1., 1., 1., 1., 1.])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.ones(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*zeros* creates an array filled with zeroes. The parameter passed as input is the size of the required array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0., 0., 0., 0., 0.])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.zeros(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*zeros_like* creates an array with the same size as the array passed as input. The generated array will have zeros as elements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, 0, 0])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.zeros_like(np.arange(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*eye* creates an identity matrix. The generated matrix will be of the dimensions of the integer passed as input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1., 0., 0., 0., 0.],\n",
       "       [0., 1., 0., 0., 0.],\n",
       "       [0., 0., 1., 0., 0.],\n",
       "       [0., 0., 0., 1., 0.],\n",
       "       [0., 0., 0., 0., 1.]])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.eye(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*empty* creates an array filled with garbage values, usually zeroes. The parameter passed as input is the size of the required array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1., 1., 1., 1., 1.])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.empty(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "numpy follows the usual rules of Python when it comes to indexing and slicing. I am laying out a few examples below that you can play around with and experiment with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray[-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Slicing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray[1:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray[:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 3])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray[:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2, 3])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nparray[1:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "largeArray = np.arange(100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0, 20, 40, 60, 80])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "largeArray[::20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 3, 5, 7, 9])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "largeArray[1:10:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([99, 79, 59, 39, 19])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "largeArray[::-20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([10,  8,  6,  4,  2])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "largeArray[10:1:-2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An important thing to keep in mind while using slicing in numpy is that the slices are essentially references(views) and hence, any changes that you make to the sliced data will reflect in the parent. Lets look at an example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "smallArray = largeArray[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "smallArray"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "largeArray[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "smallArray[0] = 666"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([666,   1,   2,   3,   4,   5,   6,   7,   8,   9])"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "smallArray"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([666,   1,   2,   3,   4,   5,   6,   7,   8,   9])"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "largeArray[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The copy() method can be used to create copies instead of such views"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Axes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A numpy array can be multi-dimensional. It also gives you the ability to change an existing array to a shape of your liking provided that it meets multiple constraints"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*reshape* lets you do exactly what the name says. It lets you shape the array into the dimensions passed as input. If you do not pass a dimension that the array can be reshaped into, then the function will return an error. Say, the array has 10 elements and you try to reshape it to an array of shape 3x5, then reshape will return an error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 2, 3],\n",
       "       [4, 5, 6]])"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.arange(1, 7).reshape((2, 3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 3])"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.arange(1, 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 2, 3]])"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.arange(1, 4).reshape(1,3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*newaxis* is used to create a new axis in the data. It is commonly used when working on modelling techniques as models require the data to be shaped in a certain manner. As you can see below, if the *newaxis* parameter is in the first position, then a new row vector will be generated. If its in the second position, then a column will be created with each of the elements being a separate vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 2, 3]])"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.arange(1, 4)[np.newaxis, :]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1],\n",
       "       [2],\n",
       "       [3]])"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.arange(1, 4)[:, np.newaxis]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Array Concatenation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Arrays can be concatenated in numpy using the *concatenate* method. The list of arrays to be concatenated is to be passed as input to the concatenate function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([666,   1,   2,   3,   4,   5,   6,   7,   8,   9, 666,   1,   2,\n",
       "         3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,\n",
       "        16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,\n",
       "        29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,\n",
       "        42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,\n",
       "        55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,\n",
       "        68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,\n",
       "        81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,\n",
       "        94,  95,  96,  97,  98,  99])"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.concatenate([smallArray, largeArray])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([666,   1,   2,   3,   4,   5,   6,   7,   8,   9, 666,   1,   2,\n",
       "         3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,\n",
       "        16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,\n",
       "        29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,\n",
       "        42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,\n",
       "        55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,\n",
       "        68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,\n",
       "        81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,\n",
       "        94,  95,  96,  97,  98,  99, 888, 999])"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.concatenate([smallArray, largeArray, [888, 999]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### numpy Ufuncs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The primary purpose of Ufunc is to be able to speed up repeated operations on values in a numpy Array. It can work both between a scalar value & an array and between 2 arrays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "3 * np.arange(0, 10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([20, 22, 24, 26, 28, 30, 32, 34, 36, 38])"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.arange(0, 10) + np.arange(20, 30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is not just syntactically better & intuitive, Its also faster. Lets try that out below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.22 µs ± 61.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit 3 * smallArray"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.15 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "\n",
    "for i in range(len(smallArray)):\n",
    "    3 * smallArray[i]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen above, the ufunc is multitudes faster than the looped version of the code. This difference becomes more and more pronounced as the computational logic involved gets more complex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aggregation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we will explore the various aggregation functions that numpy provides. numpy ships with a standard *sum()* function which returns the sum of all elements in the array. You may ask what the difference is, between the native Python function and the numpy function! After all, they are doing the same functionality; provide the sum of elements. Let's check it out below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "711"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sum(smallArray)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "711"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum(smallArray)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "526 µs ± 9.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
      "271 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "hugeArray = np.random.randint(100000, size=1000000)\n",
    "%timeit np.sum(hugeArray)\n",
    "%timeit sum(hugeArray)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the above code block, its pretty evident how fast numpy is, compared to the native functions. This applies to most, if not all, aggregation functions available in numpy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.min(smallArray)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "666"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.max(smallArray)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "198.31512801599376"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.std(smallArray)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "71.1"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.mean(smallArray)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.5"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.median(smallArray)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One thing to keep in note while using the aggregation functions is that these are prone to NaN values while you are using them, i.e., if you have a NaN-value in your array, these aggregation functions will fail. In such cases you can use their NaN-safe alternatives. Just to give you an example: nansum() is the NaN-safe alternative for the sum function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.5"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.nanmean(np.array([1, 2, np.nan]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.mean(np.array([1, 2, np.nan]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Broadcasting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now going to look at an operation that you might have used a lot without having a conceptual understanding. In broadcasting, you perform a operation on an entity that has a different shape/dimensions.\n",
    "\n",
    "A simple example for this could be adding a scalar to a numpy array. You can basically think of the values being duplicated to match the dimension of the array followed by the operation that is to be performed. \n",
    "\n",
    "Broadcasting is an operation that can be elaborated on, but that is not my goal here with this post"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 3, 4, 5, 6])"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = np.arange(1, 7)\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2, 3, 4, 5, 6, 7])"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logical Operations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There will be times when you would like to perform logical checks on a piece of data. numpy provides all the normal logical operations that you would expect: greater than, less than & equal to checks. The logical functions return an array of results in Boolean indicating if they satisfied the condition or not"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.array([1, 2, 3, 4])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([False, False,  True,  True])"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x > 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([False,  True, False, False])"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x == 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "numpy also provides useful functions such as any() and all() which are used to perform the following checks: if there is any element that fulfilled the condition or if all the elements satisfy the condition, respectively. They provide a single Boolean value as output which indicates the result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.any(x == 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.all(x == 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the above 2 code blocks: the following can be understood: \n",
    "* any(x == 2): It checks if any of the elements in the array met the condition passed\n",
    "* all(x == 2): It checks if all the elements in the array met the condition passed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Say we want a count of values that satisfy this particular condition, you can use the sum() method as follows. It counts the number of True values in the array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sum(x == 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can also be compounded to check for multiple conditions. The same can be done for the *any* and *all* functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sum((x == 2) | (x == 3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.any((x == 2) | (x == 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Masking"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might have seen the above arrays which prvide values as a Boolean array and wondered how that's helpful because you still need to provide further operations to make sense of the output. Here is where masking comes in; the True/False array can be passed *into* the array to provide only those values which meet the conditions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2])"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x[x == 2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3, 4])"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x[x > 2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fancy Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fancy indexing is nothing but the ability to access multiple elements of the array at once"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 3, 4])"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 3, 4])"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x[[0, 2, 3]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you might have guessed, we can pass in an array as the list of indices too"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 3, 4])"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indexList = [0, 2, 3]\n",
    "x[indexList]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The true power of fancy indexing comes out when it is combined with the likes of slicing, indexing and broadcasting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sorting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The sorting algorithm provided by numpy is very efficient. There are 2 ways you can sort a numpy array; in-place and by calling the numpy sort function that returns the sorted array"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's shuffle our array first so that we can sort it and play around with it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 3, 4, 2])"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.shuffle(x)\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calling *np.sort* on the array will return a copy of the array that is sorted and does not sort the array in place as demonstrated below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 3, 4])"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sort(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 3, 4, 2])"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you call *sort* on the array, then the array will be sorted in-place and there's no need to save it to another array. Both methods perform the same functionality and are used depending on the required use-case"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "x.sort()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 3, 4])"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Concluding Notes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "numpy has a lot more capabilities, especially with regards to higher-dimensional data. numpy can easily, work with, and manipulate data of higher dimensions. I have not decided to go into such details as that would defeat the purpose of this post. If you want to read more about numpy, I would highly recommend the following resources:\n",
    "* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) - The Data Science Handbook is a very popular book in the Dat Science domain and its ebook version is available for free. It will let you go into more details on numpy and, if interested, pandas, matplotlib & scikit-learn\n",
    "* [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) - Written by the creator of Pandas, you can go through this book for in-depth examples of the usage of numpy\n",
    "* And of course, [numpy documentation](https://numpy.org/doc/) - If you are a seasoned developer and know exactly what you are looking for, you could directly go to the numpy documentation as it has a lot of user guides and tutorials and is not just a website that hosts a reference documentation "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}