GraceLoggins/SparkIntro.ipynb

## SparkIntro.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\"  />\n",
    "</center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Getting Started With Spark using Python**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Estimated time needed: **15** minutes\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](http://spark.apache.org/images/spark-logo.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Python API\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spark is written in Scala, which compiles to Java bytecode, but you can write python code to communicate to the java virtual machine through a library called py4j. Python has the richest API, but it can be somewhat limiting if you need to use a method that is not available, or if you need to write a specialized piece of code. The latency associated with communicating back and forth to the JVM can sometimes cause the code to run slower.\n",
    "An exception to this is the SparkSQL library, which has an execution planning engine that precompiles the queries. Even with this optimization, there are cases where the code may run slower than the native scala version.\n",
    "The general recommendation for PySpark code is to use the \"out of the box\" methods available as much as possible and avoid overly frequent (iterative) calls to Spark methods. If you need to write high-performance or specialized code, try doing it in scala.\n",
    "But hey, we know Python rules, and the plotting libraries are way better. So, it's up to you!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objectives\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this lab, we will go over the basics of Apache Spark and PySpark. We will start with creating the SparkContext and SparkSession. We then create an RDD and apply some basic transformations and actions. Finally we demonstrate the basics dataframes and SparkSQL.\n",
    "\n",
    "After this lab you will be able to:\n",
    "\n",
    "*   Create the SparkContext and SparkSession\n",
    "*   Create an RDD and apply some basic transformations and actions to RDDs\n",
    "*   Demonstrate the use of the basics Dataframes and SparkSQL\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this lab, we are going to be using Python and Spark (PySpark). These libraries should be installed in your lab environment or in SN Labs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: pyspark in /home/jupyterlab/conda/envs/python/lib/python3.7/site-packages (3.2.0)\n",
      "Requirement already satisfied: py4j==0.10.9.2 in /home/jupyterlab/conda/envs/python/lib/python3.7/site-packages (from pyspark) (0.10.9.2)\n",
      "Requirement already satisfied: findspark in /home/jupyterlab/conda/envs/python/lib/python3.7/site-packages (2.0.0)\n"
     ]
    }
   ],
   "source": [
    "# Installing required packages\n",
    "!pip install pyspark\n",
    "!pip install findspark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import findspark\n",
    "findspark.init()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# PySpark is the Spark API for Python. In this lab, we use PySpark to initialize the spark context. \n",
    "from pyspark import SparkContext, SparkConf\n",
    "from pyspark.sql import SparkSession"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1 -  Spark Context and Spark Session\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this exercise, you will create the Spark Context and initialize the Spark session needed for SparkSQL and DataFrames.\n",
    "SparkContext is the entry point for Spark applications and contains functions to create RDDs such as `parallelize()`. SparkSession is needed for SparkSQL and DataFrame operations.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 1: Creating the spark session and context\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "SLF4J: Class path contains multiple SLF4J bindings.\n",
      "SLF4J: Found binding in [jar:file:/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/pyspark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]\n",
      "SLF4J: Found binding in [jar:file:/home/jupyterlab/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]\n",
      "SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.\n",
      "SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
      "22/01/20 04:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
     ]
    }
   ],
   "source": [
    "# Creating a spark context class\n",
    "sc = SparkContext()\n",
    "\n",
    "# Creating a spark session\n",
    "spark = SparkSession \\\n",
    "    .builder \\\n",
    "    .appName(\"Python Spark DataFrames basic example\") \\\n",
    "    .config(\"spark.some.config.option\", \"some-value\") \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 2: Initialize Spark session\n",
    "\n",
    "To work with dataframes we just need to verify that the spark session instance has been created.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "            <div>\n",
       "                <p><b>SparkSession - in-memory</b></p>\n",
       "                \n",
       "        <div>\n",
       "            <p><b>SparkContext</b></p>\n",
       "\n",
       "            <p><a href=\"http://jupyterlab-grace123:4040\">Spark UI</a></p>\n",
       "\n",
       "            <dl>\n",
       "              <dt>Version</dt>\n",
       "                <dd><code>v3.2.0</code></dd>\n",
       "              <dt>Master</dt>\n",
       "                <dd><code>local[*]</code></dd>\n",
       "              <dt>AppName</dt>\n",
       "                <dd><code>pyspark-shell</code></dd>\n",
       "            </dl>\n",
       "        </div>\n",
       "        \n",
       "            </div>\n",
       "        "
      ],
      "text/plain": [
       "<pyspark.sql.session.SparkSession at 0x7f6f2996a210>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2: RDDs\n",
    "\n",
    "In this exercise we work with Resilient Distributed Datasets (RDDs). RDDs are Spark's primitive data abstraction and we use concepts from functional programming to create and manipulate RDDs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 1: Create an RDD.\n",
    "\n",
    "For demonstration purposes, we create an RDD here by calling `sc.parallelize()`\\\n",
    "We create an RDD which has integers from 1 to 30.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "PythonRDD[1] at RDD at PythonRDD.scala:53"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = range(1,30)\n",
    "# print first element of iterator\n",
    "print(data[0])\n",
    "len(data)\n",
    "xrangeRDD = sc.parallelize(data, 4)\n",
    "\n",
    "# this will let us know that we created an RDD\n",
    "xrangeRDD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 2: Transformations\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A transformation is an operation on an RDD that results in a new RDD. The transformed RDD is generated rapidly because the new RDD is lazily evaluated, which means that the calculation is not carried out when the new RDD is generated. The RDD will contain a series of transformations, or computation instructions, that will only be carried out when an action is called. In this transformation, we reduce each element in the RDD by 1. Note the use of the lambda function. We also then filter the RDD to only contain elements <10.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "subRDD = xrangeRDD.map(lambda x: x-1)\n",
    "filteredRDD = subRDD.filter(lambda x : x<10)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 3: Actions\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A transformation returns a result to the driver. We now apply the `collect()` action to get the output from the transformation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(filteredRDD.collect())\n",
    "filteredRDD.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 4: Caching Data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This simple example shows how to create an RDD and cache it. Notice the **10x speed improvement**!  If you wish to see the actual computation time, browse to the Spark UI...it's at host:4040.  You'll see that the second calculation took much less time!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dt1:  1.3594059944152832\n",
      "dt2:  0.260051965713501\n"
     ]
    }
   ],
   "source": [
    "import time \n",
    "\n",
    "test = sc.parallelize(range(1,50000),4)\n",
    "test.cache()\n",
    "\n",
    "t1 = time.time()\n",
    "# first count will trigger evaluation of count *and* cache\n",
    "count1 = test.count()\n",
    "dt1 = time.time() - t1\n",
    "print(\"dt1: \", dt1)\n",
    "\n",
    "\n",
    "t2 = time.time()\n",
    "# second count operates on cached data only\n",
    "count2 = test.count()\n",
    "dt2 = time.time() - t2\n",
    "print(\"dt2: \", dt2)\n",
    "\n",
    "#test.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3: DataFrames and SparkSQL\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to work with the extremely powerful SQL engine in Apache Spark, you will need a Spark Session. We have created that in the first Exercise, let us verify that spark session is still active.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "            <div>\n",
       "                <p><b>SparkSession - in-memory</b></p>\n",
       "                \n",
       "        <div>\n",
       "            <p><b>SparkContext</b></p>\n",
       "\n",
       "            <p><a href=\"http://jupyterlab-grace123:4040\">Spark UI</a></p>\n",
       "\n",
       "            <dl>\n",
       "              <dt>Version</dt>\n",
       "                <dd><code>v3.2.0</code></dd>\n",
       "              <dt>Master</dt>\n",
       "                <dd><code>local[*]</code></dd>\n",
       "              <dt>AppName</dt>\n",
       "                <dd><code>pyspark-shell</code></dd>\n",
       "            </dl>\n",
       "        </div>\n",
       "        \n",
       "            </div>\n",
       "        "
      ],
      "text/plain": [
       "<pyspark.sql.session.SparkSession at 0x7f6f2996a210>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 1: Create Your First DataFrame!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can create a structured data set (much like a database table) in Spark.  Once you have done that, you can then use powerful SQL tools to query and join your dataframes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100    73  100    73    0     0    695      0 --:--:-- --:--:-- --:--:--   695\n"
     ]
    }
   ],
   "source": [
    "# Download the data first into a local `people.json` file\n",
    "!curl https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/data/people.json >> people.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "# Read the dataset into a spark dataframe using the `read.json()` function\n",
    "df = spark.read.json(\"people.json\").cache()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----+-------+\n",
      "| age|   name|\n",
      "+----+-------+\n",
      "|null|Michael|\n",
      "|  30|   Andy|\n",
      "|  19| Justin|\n",
      "|null|Michael|\n",
      "|  30|   Andy|\n",
      "|  19| Justin|\n",
      "|null|Michael|\n",
      "|  30|   Andy|\n",
      "|  19| Justin|\n",
      "|null|Michael|\n",
      "|  30|   Andy|\n",
      "|  19| Justin|\n",
      "|null|Michael|\n",
      "|  30|   Andy|\n",
      "|  19| Justin|\n",
      "+----+-------+\n",
      "\n",
      "root\n",
      " |-- age: long (nullable = true)\n",
      " |-- name: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Print the dataframe as well as the data schema\n",
    "df.show()\n",
    "df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Register the DataFrame as a SQL temporary view\n",
    "df.createTempView(\"people\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Task 2: Explore the data using DataFrame functions and SparkSQL\n",
    "\n",
    "In this section, we explore the datasets using functions both from dataframes as well as corresponding SQL queries using sparksql. Note the different ways to achieve the same task!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+\n",
      "|   name|\n",
      "+-------+\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "+-------+\n",
      "\n",
      "+-------+\n",
      "|   name|\n",
      "+-------+\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "+-------+\n",
      "\n",
      "+-------+\n",
      "|   name|\n",
      "+-------+\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "|Michael|\n",
      "|   Andy|\n",
      "| Justin|\n",
      "+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Select and show basic data columns\n",
    "\n",
    "df.select(\"name\").show()\n",
    "df.select(df[\"name\"]).show()\n",
    "spark.sql(\"SELECT name FROM people\").show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+----+\n",
      "|age|name|\n",
      "+---+----+\n",
      "| 30|Andy|\n",
      "| 30|Andy|\n",
      "| 30|Andy|\n",
      "| 30|Andy|\n",
      "| 30|Andy|\n",
      "+---+----+\n",
      "\n",
      "+---+----+\n",
      "|age|name|\n",
      "+---+----+\n",
      "| 30|Andy|\n",
      "| 30|Andy|\n",
      "| 30|Andy|\n",
      "| 30|Andy|\n",
      "| 30|Andy|\n",
      "+---+----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Perform basic filtering\n",
    "\n",
    "df.filter(df[\"age\"] > 21).show()\n",
    "spark.sql(\"SELECT age, name FROM people WHERE age > 21\").show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----+-----+\n",
      "| age|count|\n",
      "+----+-----+\n",
      "|  19|    5|\n",
      "|  30|    5|\n",
      "|null|    5|\n",
      "+----+-----+\n",
      "\n",
      "+----+-----+\n",
      "| age|count|\n",
      "+----+-----+\n",
      "|  19|    5|\n",
      "|  30|    5|\n",
      "|null|    0|\n",
      "+----+-----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Perfom basic aggregation of data\n",
    "\n",
    "df.groupBy(\"age\").count().show()\n",
    "spark.sql(\"SELECT age, COUNT(age) as count FROM people GROUP BY age\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1 - RDDs\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create an RDD with integers from 1-50. Apply a transformation to multiply every number by 2, resulting in an RDD that contains the first 50 even numbers.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# starter code\n",
    "# numbers = range(1, 50)\n",
    "# numbers_RDD = ...\n",
    "# even_numbers_RDD = numbers_RDD.map(lambda x: ..)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Code block for learners to answer\n",
    "numbers = range(1, 50)\n",
    "numbers_RDD = sc.parallelize(numbers, 4)\n",
    "even_numbers_RDD = numbers_RDD.map(lambda x:(x%2==0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2 - DataFrames and SparkSQL\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to the `people.json` file, now read the `people2.json` file into the notebook, load it into a dataframe and apply SQL operations to determine the average age in our people2 file.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# starter code\n",
    "# !curl https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/data/people2.json >> people2.json\n",
    "# df = spark.read...\n",
    "# df.createTempView..\n",
    "# spark.sql(\"SELECT ...\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100   136  100   136    0     0    358      0 --:--:-- --:--:-- --:--:--   359\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "22/01/20 04:40:43 WARN execution.CacheManager: Asked to cache already cached data.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+\n",
      "|avg(age)|\n",
      "+--------+\n",
      "|   24.75|\n",
      "+--------+\n",
      "\n",
      "+--------+\n",
      "|avg(age)|\n",
      "+--------+\n",
      "|   24.75|\n",
      "+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Code block for learners to answer\n",
    "!curl https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/people2.json >> people2.json\n",
    "df2 = spark.read.json(\"people2.json\").cache()\n",
    "#df2.createTempView(\"people2\")\n",
    "#dataframe\n",
    "df2.agg({\"age\": \"avg\"}).show()\n",
    "#sql stmt\n",
    "spark.sql(\"SELECT AVG(age) from people2\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for a hint.\n",
    "\n",
    "<!-- The hint is below:\n",
    "\n",
    "1. The SQL query \"Select AVG(column_name) from..\" can be used to find the average value of a column. \n",
    "2. Another possible way is to use the dataframe operations select() and mean()\n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for the solution.\n",
    "\n",
    "<!-- The answer is below:\n",
    "df = spark.read('people2.json')\n",
    "df.createTempView(\"people2\")\n",
    "spark.sql(\"SELECT AVG(age) from people2\")\n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 3 - SparkSession\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Close the SparkSession we created for this notebook\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Code block for learners to answer\n",
    "spark.stop()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Double-click **here** for the solution.\n",
    "\n",
    "<!-- The answer is below:\n",
    "\n",
    "spark.stop() will stop the spark session\n",
    "\n",
    "-->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Authors\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Karthik Muthuraman](https://www.linkedin.com/in/karthik-muthuraman/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0225ENSkillsNetwork25716109-2021-01-01)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other Contributors\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jerome Nilmeier](https://github.com/nilmeier)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Change Log\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
    "| ----------------- | ------- | ---------- | ------------------ |\n",
    "| 2021-07-02        | 0.2     | Karthik    | Beta launch        |\n",
    "| 2021-06-30        | 0.1     | Karthik    | First Draft        |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright © 2021 IBM Corporation. All rights reserved.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python",
   "language": "python",
   "name": "conda-env-python-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}