Skip to content

Instantly share code, notes, and snippets.

@marcelomgarcia
Created December 27, 2022 13:51
Show Gist options
  • Save marcelomgarcia/6130b2f30b7b175e76d06a3900760fd6 to your computer and use it in GitHub Desktop.
Save marcelomgarcia/6130b2f30b7b175e76d06a3900760fd6 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "<p style=\"text-align:center\">\n <a href=\"https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0231ENSkillsNetwork26766988-2022-01-01\" target=\"_blank\">\n <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png\" width=\"200\" alt=\"Skills Network Logo\" />\n </a>\n</p>\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "# **Machine Learning with Apache Spark ML**\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Estimated time needed: **15** minutes\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "This lab goes introduces Machine Learning using Spark ML Lib (sparkml).\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "![](http://spark.apache.org/images/spark-logo.png)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Objectives\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Spark ML Library is also commonly called MLlib and is used to perform machine learning operations using DataFrame-based APIs.\n\nAfter completing this lab you will be able to:\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "* Import the Spark ML and Statistics Libraries\n* Perform basic statistics operations using Spark\n* Build a simple linear regression model using Spark ML\n* Train the model and perform evaluation\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "***\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Setup\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "For this lab, we are going to be using Python and Spark (pyspark). These libraries should be installed in your lab environment or in SN Labs.\n"
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Collecting pyspark==3.1.2\n Downloading pyspark-3.1.2.tar.gz (212.4 MB)\n\u001b[K |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 212.4 MB 106 kB/s eta 0:00:011 |\u258f | 1.1 MB 12.5 MB/s eta 0:00:17 |\u2588\u2588\u2588\u2588\u258a | 31.3 MB 12.5 MB/s eta 0:00:15\n\u001b[?25hCollecting py4j==0.10.9\n Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)\n\u001b[K |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 198 kB 75.7 MB/s eta 0:00:01\n\u001b[?25hBuilding wheels for collected packages: pyspark\n Building wheel for pyspark (setup.py) ... \u001b[?25ldone\n\u001b[?25h Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=afe4be2e4862afde64faaaf10028c24b21ea6144eb8bcb808d19db8b1eb93272\n Stored in directory: /home/spark/shared/.cache/pip/wheels/11/17/0b/53e7d10fe66ca7647d391cdba323fcf5b2f9dfcb7ebad87aa7\nSuccessfully built pyspark\nInstalling collected packages: py4j, pyspark\n Attempting uninstall: py4j\n Found existing installation: py4j 0.10.9.5\n Uninstalling py4j-0.10.9.5:\n\u001b[31mERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: 'LICENSE.txt'\nConsider using the `--user` option or check the permissions.\n\u001b[0m\n\u001b[33mWARNING: Ignoring invalid distribution -y4j (/opt/ibm/conda/miniconda3.9/lib/python3.9/site-packages)\u001b[0m\n\u001b[33mWARNING: Ignoring invalid distribution -y4j (/opt/ibm/conda/miniconda3.9/lib/python3.9/site-packages)\u001b[0m\nCollecting findspark\n Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)\n\u001b[33mWARNING: Ignoring invalid distribution -y4j (/opt/ibm/conda/miniconda3.9/lib/python3.9/site-packages)\u001b[0m\nInstalling collected packages: findspark\n\u001b[33mWARNING: Ignoring invalid distribution -y4j (/opt/ibm/conda/miniconda3.9/lib/python3.9/site-packages)\u001b[0m\nSuccessfully installed findspark-2.0.1\n\u001b[33mWARNING: Ignoring invalid distribution -y4j (/opt/ibm/conda/miniconda3.9/lib/python3.9/site-packages)\u001b[0m\n\u001b[33mWARNING: Ignoring invalid distribution -y4j (/opt/ibm/conda/miniconda3.9/lib/python3.9/site-packages)\u001b[0m\n\u001b[33mWARNING: Ignoring invalid distribution -y4j (/opt/ibm/conda/miniconda3.9/lib/python3.9/site-packages)\u001b[0m\n"
}
],
"source": "# When you are executing on SN labs please uncomment the below lines and then run all cells.Next again Restart the kernel and run all cells.\n!pip3 install pyspark==3.1.2\n!pip install findspark\nimport findspark\nfindspark.init()"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": "# Pandas is a popular data science package for Python. In this lab, we use Pandas to load a CSV file from disc to a pandas dataframe in memory.\nimport pandas as pd\nimport matplotlib.pyplot as plt\n# pyspark is the Spark API for Python. In this lab, we use pyspark to initialize the spark context. \nfrom pyspark import SparkContext, SparkConf\nfrom pyspark.sql import SparkSession"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Exercise 1 - Spark session\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In this exercise, you will create and initialize the Spark session needed to load the dataframes and operate on it\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 1: Creating the spark session and context\n"
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Cannot run multiple SparkContexts at once; existing SparkContext(app=python3.9, master=spark://jkg-deployment-807a44de-60ae-4899-9de2-8b31027e7e58-869959lmjkb:7077) created by getOrCreate at /usr/local/share/jupyter/kernels/python39/scripts/launch_ipykernel.py:84 ",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/usr/local/share/jupyter/kernels/python39/scripts/launch_ipykernel.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Creating a spark context class\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0msc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mSparkContext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;31m# Creating a spark session\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mspark\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mSparkSession\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/ibm/spark/python/lib/pyspark.zip/pyspark/context.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls, udf_profiler_cls)\u001b[0m\n\u001b[1;32m 193\u001b[0m )\n\u001b[1;32m 194\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 195\u001b[0;31m \u001b[0mSparkContext\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_ensure_initialized\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgateway\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mgateway\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconf\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mconf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 196\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 197\u001b[0m self._do_init(\n",
"\u001b[0;32m/opt/ibm/spark/python/lib/pyspark.zip/pyspark/context.py\u001b[0m in \u001b[0;36m_ensure_initialized\u001b[0;34m(cls, instance, gateway, conf)\u001b[0m\n\u001b[1;32m 428\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 429\u001b[0m \u001b[0;31m# Raise error if there is already a running Spark context\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 430\u001b[0;31m raise ValueError(\n\u001b[0m\u001b[1;32m 431\u001b[0m \u001b[0;34m\"Cannot run multiple SparkContexts at once; \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 432\u001b[0m \u001b[0;34m\"existing SparkContext(app=%s, master=%s)\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: Cannot run multiple SparkContexts at once; existing SparkContext(app=python3.9, master=spark://jkg-deployment-807a44de-60ae-4899-9de2-8b31027e7e58-869959lmjkb:7077) created by getOrCreate at /usr/local/share/jupyter/kernels/python39/scripts/launch_ipykernel.py:84 "
]
}
],
"source": "# Creating a spark context class\nsc = SparkContext()\n\n# Creating a spark session\nspark = SparkSession \\\n .builder \\\n .appName(\"Python Spark DataFrames basic example\") \\\n .config(\"spark.some.config.option\", \"some-value\") \\\n .getOrCreate()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 2: Initialize Spark session\n\nTo work with dataframes we just need to verify that the spark session instance has been created.\nFeel free to click on the \"Spark UI\" button to explore the Spark UI elements.\n"
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "\n <div>\n <p><b>SparkSession - in-memory</b></p>\n \n <div>\n <p><b>SparkContext</b></p>\n\n <p><a href=\"http://172.30.156.29:4040\">Spark UI</a></p>\n\n <dl>\n <dt>Version</dt>\n <dd><code>v3.3.1</code></dd>\n <dt>Master</dt>\n <dd><code>spark://jkg-deployment-807a44de-60ae-4899-9de2-8b31027e7e58-869959lmjkb:7077</code></dd>\n <dt>AppName</dt>\n <dd><code>python3.9</code></dd>\n </dl>\n </div>\n \n </div>\n ",
"text/plain": "<pyspark.sql.session.SparkSession at 0x7f191c5597f0>"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "spark"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 2: Importing Spark ML libraries\n\nIn this exercise we will import 4 SparkML functions.\n\n1. (Feature library) VectorAssembler(): This function is used to create feature vectors from dataframes/raw data. These feature vectors are required to train a ML model or perform any statistical operations.\n2. (Stat library) Correlation(): This function is from the statistics library within SparkML. This function is used to calculate correlation between feature vectors.\n3. (Feature library) Normalized(): This function is used to normalize features. Normalizing features leads to better ML model convergence and training results.\n4. (Regression Library) LinearRegression(): This function is used to create a Linear Regression model and train it.\n"
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": "from pyspark.ml.feature import VectorAssembler, Normalizer, StandardScaler\nfrom pyspark.ml.stat import Correlation\nfrom pyspark.ml.regression import LinearRegression"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Exercise 2 - Loading the data and Creating Feature Vectors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In this section, you will first read the CSV file into a pandas dataframe and then read it into a Spark dataframe\n\nPandas is a library used for data manipulation and analysis. Pandas offers data structures and operations for creating and manipulating Data Series and DataFrame objects. Data can be imported from various data sources, e.g., Numpy arrays, Python dictionaries and CSV files. Pandas allows you to manipulate, organize and display the data.\n\nIn this example we use a dataset that contains information about cars.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 1: Loading data into a Pandas DataFrame\n"
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": "# Read the file using `read_csv` function in pandas\ncars = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/data/cars.csv')"
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>mpg</th>\n <th>cylinders</th>\n <th>displacement</th>\n <th>horsepower</th>\n <th>weight</th>\n <th>acceleration</th>\n <th>model</th>\n <th>origin</th>\n <th>car_name</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>18.0</td>\n <td>8</td>\n <td>307.0</td>\n <td>130.0</td>\n <td>3504.0</td>\n <td>12.0</td>\n <td>70</td>\n <td>1</td>\n <td>chevrolet chevelle malibu</td>\n </tr>\n <tr>\n <th>1</th>\n <td>15.0</td>\n <td>8</td>\n <td>350.0</td>\n <td>165.0</td>\n <td>3693.0</td>\n <td>11.5</td>\n <td>70</td>\n <td>1</td>\n <td>buick skylark 320</td>\n </tr>\n <tr>\n <th>2</th>\n <td>18.0</td>\n <td>8</td>\n <td>318.0</td>\n <td>150.0</td>\n <td>3436.0</td>\n <td>11.0</td>\n <td>70</td>\n <td>1</td>\n <td>plymouth satellite</td>\n </tr>\n <tr>\n <th>3</th>\n <td>16.0</td>\n <td>8</td>\n <td>304.0</td>\n <td>150.0</td>\n <td>3433.0</td>\n <td>12.0</td>\n <td>70</td>\n <td>1</td>\n <td>amc rebel sst</td>\n </tr>\n <tr>\n <th>4</th>\n <td>17.0</td>\n <td>8</td>\n <td>302.0</td>\n <td>140.0</td>\n <td>3449.0</td>\n <td>10.5</td>\n <td>70</td>\n <td>1</td>\n <td>ford torino</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " mpg cylinders displacement horsepower weight acceleration model \\\n0 18.0 8 307.0 130.0 3504.0 12.0 70 \n1 15.0 8 350.0 165.0 3693.0 11.5 70 \n2 18.0 8 318.0 150.0 3436.0 11.0 70 \n3 16.0 8 304.0 150.0 3433.0 12.0 70 \n4 17.0 8 302.0 140.0 3449.0 10.5 70 \n\n origin car_name \n0 1 chevrolet chevelle malibu \n1 1 buick skylark 320 \n2 1 plymouth satellite \n3 1 amc rebel sst \n4 1 ford torino "
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Preview a few records\ncars.head()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "For this example, we pre process the data and only use 3 columns. This preprocessed dataset can be found in the `cars2.csv` file.\n"
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>mpg</th>\n <th>hp</th>\n <th>weight</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>18.0</td>\n <td>130.0</td>\n <td>3504.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>15.0</td>\n <td>165.0</td>\n <td>3693.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>18.0</td>\n <td>150.0</td>\n <td>3436.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>16.0</td>\n <td>150.0</td>\n <td>3433.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>17.0</td>\n <td>140.0</td>\n <td>3449.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " mpg hp weight\n0 18.0 130.0 3504.0\n1 15.0 165.0 3693.0\n2 18.0 150.0 3436.0\n3 16.0 150.0 3433.0\n4 17.0 140.0 3449.0"
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "cars2 = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/data/cars2.csv', header=None, names=[\"mpg\", \"hp\", \"weight\"])\ncars2.head()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 2: Loading data into a Spark DataFrame\n"
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": "# We use the `createDataFrame` function to load the data into a spark dataframe\nsdf = spark.createDataFrame(cars2)"
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "root\n |-- mpg: double (nullable = true)\n |-- hp: double (nullable = true)\n |-- weight: double (nullable = true)\n\n"
}
],
"source": "# Let us look at the schema of the loaded spark dataframe\nsdf.printSchema()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 3: Converting data frame columns into feature vectors\n\nIn this task we use the `VectorAssembler()` function to convert the dataframe columns into feature vectors.\nFor our example, we use the horsepower (\"hp) and weight of the car as input features and the miles-per-gallon (\"mpg\") as target labels.\n"
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": "assembler = VectorAssembler(\n inputCols=[\"hp\", \"weight\"],\n outputCol=\"features\")\n\noutput = assembler.transform(sdf).select('features','mpg')"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We now create a test-train split of 75%-25%\n"
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": "train, test = output.randomSplit([0.75, 0.25])"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Exercise 3 - Basic stats and feature engineering\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In this exercise, we determine the correlation between feature vectors and normalize the features.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 1: Correlation\n\nSpark ML has inbuilt Correlation function as part of the Stat library. We use the correlation function to determine the different types of correlation between the 2 features - \"hp\" and \"weight\".\n"
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Pearson correlation matrix:\nDenseMatrix([[1. , 0.86712711],\n [0.86712711, 1. ]])\n"
}
],
"source": "r1 = Correlation.corr(train, \"features\").head()\nprint(\"Pearson correlation matrix:\\n\" + str(r1[0]))"
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Spearman correlation matrix:\nDenseMatrix([[1. , 0.89651913],\n [0.89651913, 1. ]])\n"
}
],
"source": "r2 = Correlation.corr(train, \"features\", \"spearman\").head()\nprint(\"Spearman correlation matrix:\\n\" + str(r2[0]))"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can see that there is a 0.86 (or 86%) correlation between the features. That is logical as a car with higher horsepower likely has a bigger engine and thus weighs more. We can also visualize the feature vectors to see that they are indeed correlated.\n"
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": "<Figure size 432x288 with 1 Axes>"
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": "plt.figure()\nplt.scatter(cars2[\"hp\"], cars2[\"weight\"])\nplt.xlabel(\"horsepower\")\nplt.ylabel(\"weight\")\nplt.title(\"Correlation between Horsepower and Weight\")\nplt.show()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 2: Normalization\n\nIn order for better model training and convergence, it is a good practice to normalize feature vectors.\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "normalizer = Normalizer(inputCol=\"features\", outputCol=\"features_normalized\", p=1.0)\ntrain_norm = normalizer.transform(train)\nprint(\"Normalized using L^1 norm\")\ntrain_norm.show(5, truncate=False)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 2: Standard Scaling\n\nThis is a standard practice to scale the features such that all columns in the features have zero mean and unit variance.\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "standard_scaler = StandardScaler(inputCol=\"features\", outputCol=\"features_scaled\")\ntrain_model = standard_scaler.fit(train)\ntrain_scaled = train_model.transform(train)\ntrain_scaled.show(5, truncate=False)"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "test_scaled = train_model.transform(test)\ntest_scaled.show(5, truncate=False)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Exercise 4 - Building and Training a Linear Regression Model\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In this exercise, we train a Linear Regression model `lrModel` on our training dataset. We train the model on the standard scaled version of features.\nWe also print the final RMSE and R-Squared metrics.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 1: Create and Train model\n\nWe can create the model using the `LinearRegression()` class and train using the `fit()` function.\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Create a LR model\nlr = LinearRegression(featuresCol='features_scaled', labelCol='mpg', maxIter=100)\n\n# Fit the model\nlrModel = lr.fit(train_scaled)\n\n# Print the coefficients and intercept for linear regression\nprint(\"Coefficients: %s\" % str(lrModel.coefficients))\nprint(\"Intercept: %s\" % str(lrModel.intercept))\n\n# Summarize the model over the training set and print out some metrics\ntrainingSummary = lrModel.summary\n#trainingSummary.residuals.show()\nprint(\"RMSE: %f\" % trainingSummary.rootMeanSquaredError)\nprint(\"R-squared: %f\" % trainingSummary.r2)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We see a RMSE (Root mean squared error) of 4.26. This means that our model predicts the `mpg` with an average error of 4.2 units.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Task 2: Predict on new data\n\nOnce a model is trained, we can then `transform()` new unseen data (for eg. the test data) to generate predictions.\nIn the below cell, notice the \"prediction\" column that contains the predicted \"mpg\".\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "lrModel.transform(test_scaled).show(5)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Question 1 - Correlation\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print the correlation matrix for the test dataset split we created above.\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Code block for learners to answer"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Double-click **here** for the solution.\n\n<!-- The answer is below:\n\nr1 = Correlation.corr(test, \"features\").head()\nprint(\"Pearson correlation matrix:\\n\" + str(r1[0]))\n\n-->\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Question 2 - Feature Normalization\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Normalize the training features by using the L2 norm of the feature vector.\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Code block for learners to answer"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Double-click **here** for the solution.\n\n<!-- The answer is below:\n\nnormalizer_l2 = Normalizer(inputCol=\"features\", outputCol=\"features_normalized\", p=2.0)\ntrain_norm_l2 = normalizer_l2.transform(train)\nrint(\"Normalized using L^1 norm\\n\"+str(train_norm_l2))\ntrain_norm_l2.show(5, truncate=False)\n\n-->\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Question 3 - Train Model\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Repeat the model training shown above for another 100 iterations and report the coefficients.\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Code block for Question 3"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Authors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "[Karthik Muthuraman](https://www.linkedin.com/in/karthik-muthuraman/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0231ENSkillsNetwork26766988-2022-01-01)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Other Contributors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "[Jerome Nilmeier](https://github.com/nilmeier/)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Change Log\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ------------- | ----------------------- |\n| 2022-07-14 | 0.4 | Lakshmi Holla | Added code for pyspark |\n| 2021-12-22 | 0.3 | Lakshmi Holla | Made changes in scaling |\n| 2021-08-05 | 0.2 | Azim | Beta launch |\n| 2021-07-01 | 0.1 | Karthik | Initial Draft |\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Copyright \u00a9 2021 IBM Corporation. All rights reserved.\n"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9 with Spark",
"language": "python3",
"name": "python39"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment