GraceLoggins/DataFrames.ipynb

## DataFrames.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\"  />\n",
    "</center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Introduction to DataFrames**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Estimated time needed: **15** minutes\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](http://spark.apache.org/images/spark-logo.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objectives\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A DataFrame is two-dimensional. Columns can be of different data types. DataFrames accept many data inputs including series and other DataFrames. You can pass indexes (row labels) and columns (column labels). Indexes can be numbers, dates, or strings/tuples.\n",
    "\n",
    "After completing this lab you will be able to:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*   Load a data file into a DataFrame\n",
    "*   View the data schema of a DataFrame\n",
    "*   Perform basic data manipulation\n",
    "*   Aggregate data in a DataFrame\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this lab, we are going to be using Python and Spark (PySpark). These libraries should be installed in your lab environment or in SN Labs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas is a popular data science package for Python. In this lab, we use Pandas to load a CSV file from disc to a pandas dataframe in memory. PySpark is the Spark API for Python. In this lab, we use PySpark to initialize the spark context.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting pyspark\n",
      "  Downloading pyspark-3.2.0.tar.gz (281.3 MB)\n",
      "     |████████████████████████████████| 281.3 MB 14 kB/s              █▎                        | 63.7 MB 11.1 MB/s eta 0:00:20████                | 140.2 MB 19.6 MB/s eta 0:00:08\n",
      "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25hCollecting py4j==0.10.9.2\n",
      "  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)\n",
      "     |████████████████████████████████| 198 kB 83.8 MB/s            \n",
      "\u001b[?25hBuilding wheels for collected packages: pyspark\n",
      "  Building wheel for pyspark (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=802727a1a5e1294a8850c927f39b1e181c30010acaa651ea41f2ed83231ad8d5\n",
      "  Stored in directory: /home/jupyterlab/.cache/pip/wheels/0b/de/d2/9be5d59d7331c6c2a7c1b6d1a4f463ce107332b1ecd4e80718\n",
      "Successfully built pyspark\n",
      "Installing collected packages: py4j, pyspark\n",
      "Successfully installed py4j-0.10.9.2 pyspark-3.2.0\n",
      "Collecting findspark\n",
      "  Downloading findspark-2.0.0-py2.py3-none-any.whl (4.4 kB)\n",
      "Installing collected packages: findspark\n",
      "Successfully installed findspark-2.0.0\n",
      "Requirement already satisfied: pandas in /home/jupyterlab/conda/envs/python/lib/python3.7/site-packages (1.3.4)\n",
      "Requirement already satisfied: python-dateutil>=2.7.3 in /home/jupyterlab/conda/envs/python/lib/python3.7/site-packages (from pandas) (2.8.2)\n",
      "Requirement already satisfied: pytz>=2017.3 in /home/jupyterlab/conda/envs/python/lib/python3.7/site-packages (from pandas) (2021.3)\n",
      "Requirement already satisfied: numpy>=1.17.3 in /home/jupyterlab/conda/envs/python/lib/python3.7/site-packages (from pandas) (1.21.4)\n",
      "Requirement already satisfied: six>=1.5 in /home/jupyterlab/conda/envs/python/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)\n"
     ]
    }
   ],
   "source": [
    "# Installing required packages\n",
    "!pip install pyspark\n",
    "!pip install findspark\n",
    "!pip install pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import findspark\n",
    "findspark.init()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pyspark import SparkContext, SparkConf\n",
    "from pyspark.sql import SparkSession"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1 -  Spark session\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this exercise, you will create and initialize the Spark session needed to load the dataframes and operate on it\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 1: Creating the spark session and context\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "SLF4J: Class path contains multiple SLF4J bindings.\n",
      "SLF4J: Found binding in [jar:file:/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/pyspark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]\n",
      "SLF4J: Found binding in [jar:file:/home/jupyterlab/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]\n",
      "SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.\n",
      "SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
      "22/01/21 00:25:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
     ]
    }
   ],
   "source": [
    "# Creating a spark context class\n",
    "sc = SparkContext()\n",
    "\n",
    "# Creating a spark session\n",
    "spark = SparkSession \\\n",
    "    .builder \\\n",
    "    .appName(\"Python Spark DataFrames basic example\") \\\n",
    "    .config(\"spark.some.config.option\", \"some-value\") \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 2: Initialize Spark session\n",
    "\n",
    "To work with dataframes we just need to verify that the spark session instance has been created.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "            <div>\n",
       "                <p><b>SparkSession - in-memory</b></p>\n",
       "                \n",
       "        <div>\n",
       "            <p><b>SparkContext</b></p>\n",
       "\n",
       "            <p><a href=\"http://jupyterlab-grace123:4040\">Spark UI</a></p>\n",
       "\n",
       "            <dl>\n",
       "              <dt>Version</dt>\n",
       "                <dd><code>v3.2.0</code></dd>\n",
       "              <dt>Master</dt>\n",
       "                <dd><code>local[*]</code></dd>\n",
       "              <dt>AppName</dt>\n",
       "                <dd><code>pyspark-shell</code></dd>\n",
       "            </dl>\n",
       "        </div>\n",
       "        \n",
       "            </div>\n",
       "        "
      ],
      "text/plain": [
       "<pyspark.sql.session.SparkSession at 0x7fb4f753e390>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2 - Load the data and Spark dataframe\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, you will first read the CSV file into a Pandas DataFrame and then read it into a Spark DataFrame.\n",
    "Pandas is a library used for data manipulation and analysis. Pandas offers data structures and operations for creating and manipulating Data Series and DataFrame objects. Data can be imported from various data sources, e.g., Numpy arrays, Python dictionaries, and CSV files. Pandas allows you to manipulate, organize and display the data.\n",
    "To create a Spark DataFrame we load an external DataFrame, called mtcars. This DataFrame includes 32 observations on 11 variables:\n",
    "\n",
    "| colIndex | colName | units/description                        |\n",
    "| :------: | :------ | :--------------------------------------- |\n",
    "|   [, 1]  | mpg     | Miles per gallon                         |\n",
    "|   [, 2]  | cyl     | Number of cylinders                      |\n",
    "|   [, 3]  | disp    | Displacement (cu.in.)                    |\n",
    "|   [, 4]  | hp      | Gross horsepower                         |\n",
    "|   [, 5]  | drat    | Rear axle ratio                          |\n",
    "|   [, 6]  | wt      | Weight (lb/1000)                         |\n",
    "|   [, 7]  | qsec    | 1/4 mile time                            |\n",
    "|   [, 8]  | vs      | V/S                                      |\n",
    "|   [, 9]  | am      | Transmission (0 = automatic, 1 = manual) |\n",
    "|   [,10]  | gear    | Number of forward gears                  |\n",
    "|   [,11]  | carb    | Number of carburetors                    |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 1: Loading data into a Pandas DataFrame\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the file using `read_csv` function in pandas\n",
    "mtcars = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/data/mtcars.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>mpg</th>\n",
       "      <th>cyl</th>\n",
       "      <th>disp</th>\n",
       "      <th>hp</th>\n",
       "      <th>drat</th>\n",
       "      <th>wt</th>\n",
       "      <th>qsec</th>\n",
       "      <th>vs</th>\n",
       "      <th>am</th>\n",
       "      <th>gear</th>\n",
       "      <th>carb</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mazda RX4</td>\n",
       "      <td>21.0</td>\n",
       "      <td>6</td>\n",
       "      <td>160.0</td>\n",
       "      <td>110</td>\n",
       "      <td>3.90</td>\n",
       "      <td>2.620</td>\n",
       "      <td>16.46</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Mazda RX4 Wag</td>\n",
       "      <td>21.0</td>\n",
       "      <td>6</td>\n",
       "      <td>160.0</td>\n",
       "      <td>110</td>\n",
       "      <td>3.90</td>\n",
       "      <td>2.875</td>\n",
       "      <td>17.02</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Datsun 710</td>\n",
       "      <td>22.8</td>\n",
       "      <td>4</td>\n",
       "      <td>108.0</td>\n",
       "      <td>93</td>\n",
       "      <td>3.85</td>\n",
       "      <td>2.320</td>\n",
       "      <td>18.61</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Hornet 4 Drive</td>\n",
       "      <td>21.4</td>\n",
       "      <td>6</td>\n",
       "      <td>258.0</td>\n",
       "      <td>110</td>\n",
       "      <td>3.08</td>\n",
       "      <td>3.215</td>\n",
       "      <td>19.44</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Hornet Sportabout</td>\n",
       "      <td>18.7</td>\n",
       "      <td>8</td>\n",
       "      <td>360.0</td>\n",
       "      <td>175</td>\n",
       "      <td>3.15</td>\n",
       "      <td>3.440</td>\n",
       "      <td>17.02</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  \\\n",
       "0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4   \n",
       "1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4   \n",
       "2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4   \n",
       "3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3   \n",
       "4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3   \n",
       "\n",
       "   carb  \n",
       "0     4  \n",
       "1     4  \n",
       "2     1  \n",
       "3     1  \n",
       "4     2  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Preview a few records\n",
    "mtcars.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 2: Loading data into a Spark DataFrame\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We use the `createDataFrame` function to load the data into a spark dataframe\n",
    "sdf = spark.createDataFrame(mtcars) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- Unnamed: 0: string (nullable = true)\n",
      " |-- mpg: double (nullable = true)\n",
      " |-- cyl: long (nullable = true)\n",
      " |-- disp: double (nullable = true)\n",
      " |-- hp: long (nullable = true)\n",
      " |-- drat: double (nullable = true)\n",
      " |-- wt: double (nullable = true)\n",
      " |-- qsec: double (nullable = true)\n",
      " |-- vs: long (nullable = true)\n",
      " |-- am: long (nullable = true)\n",
      " |-- gear: long (nullable = true)\n",
      " |-- carb: long (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Let us look at the schema of the loaded spark dataframe\n",
    "sdf.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3: Basic data analysis and manipulation\n",
    "\n",
    "In this section, we perform basic data analysis and manipulation. We start with previewing the data and then applying some filtering and columwise operations.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 1: Displays the content of the DataFrame\n",
    "\n",
    "We use the `show()` method for this. Here we preview the first 5 records. Compare it to a similar `head()` function in Pandas.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+\n",
      "|       Unnamed: 0| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|\n",
      "+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+\n",
      "|        Mazda RX4|21.0|  6|160.0|110| 3.9| 2.62|16.46|  0|  1|   4|   4|\n",
      "|    Mazda RX4 Wag|21.0|  6|160.0|110| 3.9|2.875|17.02|  0|  1|   4|   4|\n",
      "|       Datsun 710|22.8|  4|108.0| 93|3.85| 2.32|18.61|  1|  1|   4|   1|\n",
      "|   Hornet 4 Drive|21.4|  6|258.0|110|3.08|3.215|19.44|  1|  0|   3|   1|\n",
      "|Hornet Sportabout|18.7|  8|360.0|175|3.15| 3.44|17.02|  0|  0|   3|   2|\n",
      "+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sdf.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the `select()` function to select a particular column of data. Here we show the `mpg` column.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----+\n",
      "| mpg|\n",
      "+----+\n",
      "|21.0|\n",
      "|21.0|\n",
      "|22.8|\n",
      "|21.4|\n",
      "|18.7|\n",
      "+----+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sdf.select('mpg').show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 2: Filtering and Columnar operations\n",
    "\n",
    "Filtering and Column operations are important to select relevant data and apply useful transformations.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first filter to only retain rows with mpg > 18. We use the `filter()` function for this.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------+----+---+-----+---+----+----+-----+---+---+----+----+\n",
      "| Unnamed: 0| mpg|cyl| disp| hp|drat|  wt| qsec| vs| am|gear|carb|\n",
      "+-----------+----+---+-----+---+----+----+-----+---+---+----+----+\n",
      "| Duster 360|14.3|  8|360.0|245|3.21|3.57|15.84|  0|  0|   3|   4|\n",
      "|  Merc 280C|17.8|  6|167.6|123|3.92|3.44| 18.9|  1|  0|   4|   4|\n",
      "| Merc 450SE|16.4|  8|275.8|180|3.07|4.07| 17.4|  0|  0|   3|   3|\n",
      "| Merc 450SL|17.3|  8|275.8|180|3.07|3.73| 17.6|  0|  0|   3|   3|\n",
      "|Merc 450SLC|15.2|  8|275.8|180|3.07|3.78| 18.0|  0|  0|   3|   3|\n",
      "+-----------+----+---+-----+---+----+----+-----+---+---+----+----+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "sdf.filter(sdf['mpg'] < 18).show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Operating on Columns\n",
    "\n",
    "Spark also provides a number of functions that can be directly applied to columns for data processing and aggregation. The example below shows the use of basic arithmetic functions to convert the weight values from `lb` to `metric ton`.\n",
    "We create a new column called `wtTon` that has the weight from the `wt` column converted to metric tons.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+-------+\n",
      "|       Unnamed: 0| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|  wtTon|\n",
      "+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+-------+\n",
      "|        Mazda RX4|21.0|  6|160.0|110| 3.9| 2.62|16.46|  0|  1|   4|   4|  1.179|\n",
      "|    Mazda RX4 Wag|21.0|  6|160.0|110| 3.9|2.875|17.02|  0|  1|   4|   4|1.29375|\n",
      "|       Datsun 710|22.8|  4|108.0| 93|3.85| 2.32|18.61|  1|  1|   4|   1|  1.044|\n",
      "|   Hornet 4 Drive|21.4|  6|258.0|110|3.08|3.215|19.44|  1|  0|   3|   1|1.44675|\n",
      "|Hornet Sportabout|18.7|  8|360.0|175|3.15| 3.44|17.02|  0|  0|   3|   2|  1.548|\n",
      "+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+-------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sdf.withColumn('wtTon', sdf['wt'] * 0.45).show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 4: Grouping and Aggregation\n",
    "\n",
    "Spark DataFrames support a number of commonly used functions to aggregate data after grouping. In this example we compute the average weight of cars by their cylinders as shown below.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Stage 9:=======>                                                 (2 + 14) / 16]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+------------------+\n",
      "|cyl|           avg(wt)|\n",
      "+---+------------------+\n",
      "|  6| 3.117142857142857|\n",
      "|  4|2.2857272727272724|\n",
      "|  8| 3.999214285714286|\n",
      "+---+------------------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "sdf.groupby(['cyl'])\\\n",
    ".agg({\"wt\": \"AVG\"})\\\n",
    ".show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also sort the output from the aggregation to get the most common cars.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Stage 12:============================>                            (8 + 8) / 16]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+---------+\n",
      "|cyl|count(wt)|\n",
      "+---+---------+\n",
      "|  8|       14|\n",
      "|  4|       11|\n",
      "|  6|        7|\n",
      "+---+---------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "car_counts = sdf.groupby(['cyl'])\\\n",
    ".agg({\"wt\": \"count\"})\\\n",
    ".sort(\"count(wt)\", ascending=False)\\\n",
    ".show(5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Questions\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1 - DataFrame basics\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display the first 5 rows of all cars that have atleast 5 cylinders.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+\n",
      "|       Unnamed: 0| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|\n",
      "+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+\n",
      "|        Mazda RX4|21.0|  6|160.0|110| 3.9| 2.62|16.46|  0|  1|   4|   4|\n",
      "|    Mazda RX4 Wag|21.0|  6|160.0|110| 3.9|2.875|17.02|  0|  1|   4|   4|\n",
      "|   Hornet 4 Drive|21.4|  6|258.0|110|3.08|3.215|19.44|  1|  0|   3|   1|\n",
      "|Hornet Sportabout|18.7|  8|360.0|175|3.15| 3.44|17.02|  0|  0|   3|   2|\n",
      "|          Valiant|18.1|  6|225.0|105|2.76| 3.46|20.22|  1|  0|   3|   1|\n",
      "+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Code block for learners to answer\n",
    "sdf.filter(sdf['cyl'] >= 5).show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2 - DataFrame aggregation\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the functions and tables shown above, print out the mean weight of a car in our database in metric tons.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Stage 42:>                                                       (0 + 16) / 16]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------+\n",
      "|avg(wtTon)|\n",
      "+----------+\n",
      "| 1.4477625|\n",
      "+----------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "# Code block for learners to answer\n",
    "#sdf.withColumn('wtTon', sdf['wt'] * 0.45).show(5)\n",
    "#sdf.agg({\"wtTon\": \"AVG\"}).show()\n",
    "sdf.withColumn('wtTon', sdf['wt'] * 0.45).agg({\"wtTon\": \"AVG\"}).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 3 - DataFrame columnar operations\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the earlier sections of this notebook, we have created a new column called `wtTon` to indicate the weight in metric tons using a standard conversion formula. In this case we have applied this directly to the dataframe column `wt` as it is a linear operation (multiply by 0.45). Similarly, as part of this exercise, create a new column for mileage in `kmpl` (kilometer-per-liter) instead of `mpg`(miles-per-gallon) by using a conversion factor of 0.425.\n",
    "\n",
    "Additionally sort the output in decreasing order of mileage in kmpl.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Stage 47:========================>                                (7 + 9) / 16]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------+----+---+----+---+----+-----+-----+---+---+----+----+------------------+\n",
      "|    Unnamed: 0| mpg|cyl|disp| hp|drat|   wt| qsec| vs| am|gear|carb|              kmpl|\n",
      "+--------------+----+---+----+---+----+-----+-----+---+---+----+----+------------------+\n",
      "|Toyota Corolla|33.9|  4|71.1| 65|4.22|1.835| 19.9|  1|  1|   4|   1|14.407499999999999|\n",
      "|      Fiat 128|32.4|  4|78.7| 66|4.08|  2.2|19.47|  1|  1|   4|   1|             13.77|\n",
      "|  Lotus Europa|30.4|  4|95.1|113|3.77|1.513| 16.9|  1|  1|   5|   2|             12.92|\n",
      "|   Honda Civic|30.4|  4|75.7| 52|4.93|1.615|18.52|  1|  1|   4|   2|             12.92|\n",
      "|     Fiat X1-9|27.3|  4|79.0| 66|4.08|1.935| 18.9|  1|  1|   4|   1|           11.6025|\n",
      "+--------------+----+---+----+---+----+-----+-----+---+---+----+----+------------------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "# Code block for learners to answer\n",
    "#sdf.withColumn('kmpl', sdf['mpg'] * 0.425).show(5)\n",
    "sdf.withColumn('kmpl', sdf['mpg'] * 0.425).sort(\"kmpl\", ascending=False).show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for a hint.\n",
    "\n",
    "<!-- The hint is below:\n",
    "\n",
    "1. Use the functions `withColumn()` to create a new column with a linear operation of an existing column. \n",
    "2. Use the `sort()` function to order results.\n",
    "\n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for the solution.\n",
    "\n",
    "<!-- The answer is below:\n",
    "\n",
    "sdf.withColumn('kmpl', sdf['mpg'] * 0.425).sort('mpg', ascending=False).show()\n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Authors\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Karthik Muthuraman](https://www.linkedin.com/in/karthik-muthuraman/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0225ENSkillsNetwork25716109-2021-01-01)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other Contributors\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jerome Nilmeier](https://github.com/nilmeier)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Change Log\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
    "| ----------------- | ------- | ---------- | ------------------ |\n",
    "| 2021-07-02        | 0.2     | Karthik    | Beta launch        |\n",
    "| 2021-06-30        | 0.1     | Karthik    | First Draft        |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright © 2021 IBM Corporation. All rights reserved.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python",
   "language": "python",
   "name": "conda-env-python-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}