Skip to content

Instantly share code, notes, and snippets.

@GraceLoggins
Created January 9, 2022 04:45
Show Gist options
  • Save GraceLoggins/ed3c73ea305f73135bbc52ede51949c7 to your computer and use it in GitHub Desktop.
Save GraceLoggins/ed3c73ea305f73135bbc52ede51949c7 to your computer and use it in GitHub Desktop.
Coursera - Python Project for Data Engineering - ETL
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n",
"</center>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **Extract Transform Load (ETL) Lab**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Estimated time needed: **30** minutes\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Objectives\n",
"\n",
"After completing this lab you will be able to:\n",
"\n",
"* Read CSV and JSON file types.\n",
"* Extract data from the above file types.\n",
"* Transform data.\n",
"* Save the transformed data in a ready-to-load format which data engineers can use to load into an RDBMS.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the required modules and functions\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import glob # this module helps in selecting files \n",
"import pandas as pd # this module helps in processing CSV files\n",
"import xml.etree.ElementTree as ET # this module helps in processing XML files.\n",
"from datetime import datetime"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download Files\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2022-01-05 02:52:03-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/source.zip\n",
"Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\n",
"Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 2707 (2.6K) [application/zip]\n",
"Saving to: ‘source.zip’\n",
"\n",
"source.zip 100%[===================>] 2.64K --.-KB/s in 0s \n",
"\n",
"2022-01-05 02:52:03 (78.5 MB/s) - ‘source.zip’ saved [2707/2707]\n",
"\n"
]
}
],
"source": [
"!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/source.zip"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Unzip Files\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Archive: source.zip\n",
" inflating: source3.json \n",
" inflating: source1.csv \n",
" inflating: source2.csv \n",
" inflating: source3.csv \n",
" inflating: source1.json \n",
" inflating: source2.json \n",
" inflating: source1.xml \n",
" inflating: source2.xml \n",
" inflating: source3.xml \n"
]
}
],
"source": [
"!unzip source.zip"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set Paths"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"tmpfile = \"temp.tmp\" # file used to store all extracted data\n",
"logfile = \"logfile.txt\" # all event logs will be stored in this file\n",
"targetfile = \"transformed_data.csv\" # file where transformed data is stored"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extract\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CSV Extract Function\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def extract_from_csv(file_to_process):\n",
" dataframe = pd.read_csv(file_to_process)\n",
" return dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### JSON Extract Function\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def extract_from_json(file_to_process):\n",
" dataframe = pd.read_json(file_to_process,lines=True)\n",
" return dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### XML Extract Function\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def extract_from_xml(file_to_process):\n",
" dataframe = pd.DataFrame(columns=[\"name\", \"height\", \"weight\"])\n",
" tree = ET.parse(file_to_process)\n",
" root = tree.getroot()\n",
" for person in root:\n",
" name = person.find(\"name\").text\n",
" height = float(person.find(\"height\").text)\n",
" weight = float(person.find(\"weight\").text)\n",
" dataframe = dataframe.append({\"name\":name, \"height\":height, \"weight\":weight}, ignore_index=True)\n",
" return dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extract Function\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def extract():\n",
" extracted_data = pd.DataFrame(columns=['name','height','weight']) # create an empty data frame to hold extracted data\n",
" \n",
" #process all csv files\n",
" for csvfile in glob.glob(\"*.csv\"):\n",
" extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)\n",
" \n",
" #process all json files\n",
" for jsonfile in glob.glob(\"*.json\"):\n",
" extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)\n",
" \n",
" #process all xml files\n",
" for xmlfile in glob.glob(\"*.xml\"):\n",
" extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)\n",
" \n",
" return extracted_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transform\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The transform function does the following tasks.\n",
"\n",
"1. Convert height which is in inches to millimeter\n",
"2. Convert weight which is in pounds to kilograms\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def transform(data):\n",
" #Convert height which is in inches to millimeter\n",
" #Convert the datatype of the column into float\n",
" #data.height = data.height.astype(float)\n",
" #Convert inches to meters and round off to two decimals(one inch is 0.0254 meters)\n",
" data['height'] = round(data.height * 0.0254,2)\n",
" \n",
" #Convert weight which is in pounds to kilograms\n",
" #Convert the datatype of the column into float\n",
" #data.weight = data.weight.astype(float)\n",
" #Convert pounds to kilograms and round off to two decimals(one pound is 0.45359237 kilograms)\n",
" data['weight'] = round(data.weight * 0.45359237,2)\n",
" return data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def load(targetfile,data_to_load):\n",
" data_to_load.to_csv(targetfile)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Logging\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"def log(message):\n",
" timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second\n",
" now = datetime.now() # get current timestamp\n",
" timestamp = now.strftime(timestamp_format)\n",
" with open(\"logfile.txt\",\"a\") as f:\n",
" f.write(timestamp + ',' + message + '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running ETL Process\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"log(\"ETL Job Started\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>height</th>\n",
" <th>weight</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>alex</td>\n",
" <td>65.78</td>\n",
" <td>112.99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ajay</td>\n",
" <td>71.52</td>\n",
" <td>136.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>alice</td>\n",
" <td>69.40</td>\n",
" <td>153.03</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ravi</td>\n",
" <td>68.22</td>\n",
" <td>142.34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>joe</td>\n",
" <td>67.79</td>\n",
" <td>144.30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>alex</td>\n",
" <td>65.78</td>\n",
" <td>112.99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>ajay</td>\n",
" <td>71.52</td>\n",
" <td>136.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>alice</td>\n",
" <td>69.40</td>\n",
" <td>153.03</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>ravi</td>\n",
" <td>68.22</td>\n",
" <td>142.34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>joe</td>\n",
" <td>67.79</td>\n",
" <td>144.30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>alex</td>\n",
" <td>65.78</td>\n",
" <td>112.99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>ajay</td>\n",
" <td>71.52</td>\n",
" <td>136.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>alice</td>\n",
" <td>69.40</td>\n",
" <td>153.03</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>ravi</td>\n",
" <td>68.22</td>\n",
" <td>142.34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>joe</td>\n",
" <td>67.79</td>\n",
" <td>144.30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>jack</td>\n",
" <td>68.70</td>\n",
" <td>123.30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>tom</td>\n",
" <td>69.80</td>\n",
" <td>141.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>tracy</td>\n",
" <td>70.01</td>\n",
" <td>136.46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>john</td>\n",
" <td>67.90</td>\n",
" <td>112.37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>jack</td>\n",
" <td>68.70</td>\n",
" <td>123.30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>tom</td>\n",
" <td>69.80</td>\n",
" <td>141.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>tracy</td>\n",
" <td>70.01</td>\n",
" <td>136.46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>john</td>\n",
" <td>67.90</td>\n",
" <td>112.37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>jack</td>\n",
" <td>68.70</td>\n",
" <td>123.30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>tom</td>\n",
" <td>69.80</td>\n",
" <td>141.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>tracy</td>\n",
" <td>70.01</td>\n",
" <td>136.46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>john</td>\n",
" <td>67.90</td>\n",
" <td>112.37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>simon</td>\n",
" <td>67.90</td>\n",
" <td>112.37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>jacob</td>\n",
" <td>66.78</td>\n",
" <td>120.67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>cindy</td>\n",
" <td>66.49</td>\n",
" <td>127.45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>ivan</td>\n",
" <td>67.62</td>\n",
" <td>114.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>simon</td>\n",
" <td>67.90</td>\n",
" <td>112.37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>jacob</td>\n",
" <td>66.78</td>\n",
" <td>120.67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>cindy</td>\n",
" <td>66.49</td>\n",
" <td>127.45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>ivan</td>\n",
" <td>67.62</td>\n",
" <td>114.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>simon</td>\n",
" <td>67.90</td>\n",
" <td>112.37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>jacob</td>\n",
" <td>66.78</td>\n",
" <td>120.67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>cindy</td>\n",
" <td>66.49</td>\n",
" <td>127.45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>ivan</td>\n",
" <td>67.62</td>\n",
" <td>114.14</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name height weight\n",
"0 alex 65.78 112.99\n",
"1 ajay 71.52 136.49\n",
"2 alice 69.40 153.03\n",
"3 ravi 68.22 142.34\n",
"4 joe 67.79 144.30\n",
"5 alex 65.78 112.99\n",
"6 ajay 71.52 136.49\n",
"7 alice 69.40 153.03\n",
"8 ravi 68.22 142.34\n",
"9 joe 67.79 144.30\n",
"10 alex 65.78 112.99\n",
"11 ajay 71.52 136.49\n",
"12 alice 69.40 153.03\n",
"13 ravi 68.22 142.34\n",
"14 joe 67.79 144.30\n",
"15 jack 68.70 123.30\n",
"16 tom 69.80 141.49\n",
"17 tracy 70.01 136.46\n",
"18 john 67.90 112.37\n",
"19 jack 68.70 123.30\n",
"20 tom 69.80 141.49\n",
"21 tracy 70.01 136.46\n",
"22 john 67.90 112.37\n",
"23 jack 68.70 123.30\n",
"24 tom 69.80 141.49\n",
"25 tracy 70.01 136.46\n",
"26 john 67.90 112.37\n",
"27 simon 67.90 112.37\n",
"28 jacob 66.78 120.67\n",
"29 cindy 66.49 127.45\n",
"30 ivan 67.62 114.14\n",
"31 simon 67.90 112.37\n",
"32 jacob 66.78 120.67\n",
"33 cindy 66.49 127.45\n",
"34 ivan 67.62 114.14\n",
"35 simon 67.90 112.37\n",
"36 jacob 66.78 120.67\n",
"37 cindy 66.49 127.45\n",
"38 ivan 67.62 114.14"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"log(\"Extract phase Started\")\n",
"extracted_data = extract()\n",
"log(\"Extract phase Ended\")\n",
"extracted_data"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>height</th>\n",
" <th>weight</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>alex</td>\n",
" <td>1.67</td>\n",
" <td>51.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ajay</td>\n",
" <td>1.82</td>\n",
" <td>61.91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>alice</td>\n",
" <td>1.76</td>\n",
" <td>69.41</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ravi</td>\n",
" <td>1.73</td>\n",
" <td>64.56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>joe</td>\n",
" <td>1.72</td>\n",
" <td>65.45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>alex</td>\n",
" <td>1.67</td>\n",
" <td>51.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>ajay</td>\n",
" <td>1.82</td>\n",
" <td>61.91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>alice</td>\n",
" <td>1.76</td>\n",
" <td>69.41</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>ravi</td>\n",
" <td>1.73</td>\n",
" <td>64.56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>joe</td>\n",
" <td>1.72</td>\n",
" <td>65.45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>alex</td>\n",
" <td>1.67</td>\n",
" <td>51.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>ajay</td>\n",
" <td>1.82</td>\n",
" <td>61.91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>alice</td>\n",
" <td>1.76</td>\n",
" <td>69.41</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>ravi</td>\n",
" <td>1.73</td>\n",
" <td>64.56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>joe</td>\n",
" <td>1.72</td>\n",
" <td>65.45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>jack</td>\n",
" <td>1.74</td>\n",
" <td>55.93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>tom</td>\n",
" <td>1.77</td>\n",
" <td>64.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>tracy</td>\n",
" <td>1.78</td>\n",
" <td>61.90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>john</td>\n",
" <td>1.72</td>\n",
" <td>50.97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>jack</td>\n",
" <td>1.74</td>\n",
" <td>55.93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>tom</td>\n",
" <td>1.77</td>\n",
" <td>64.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>tracy</td>\n",
" <td>1.78</td>\n",
" <td>61.90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>john</td>\n",
" <td>1.72</td>\n",
" <td>50.97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>jack</td>\n",
" <td>1.74</td>\n",
" <td>55.93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>tom</td>\n",
" <td>1.77</td>\n",
" <td>64.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>tracy</td>\n",
" <td>1.78</td>\n",
" <td>61.90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>john</td>\n",
" <td>1.72</td>\n",
" <td>50.97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>simon</td>\n",
" <td>1.72</td>\n",
" <td>50.97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>jacob</td>\n",
" <td>1.70</td>\n",
" <td>54.73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>cindy</td>\n",
" <td>1.69</td>\n",
" <td>57.81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>ivan</td>\n",
" <td>1.72</td>\n",
" <td>51.77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>simon</td>\n",
" <td>1.72</td>\n",
" <td>50.97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>jacob</td>\n",
" <td>1.70</td>\n",
" <td>54.73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>cindy</td>\n",
" <td>1.69</td>\n",
" <td>57.81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>ivan</td>\n",
" <td>1.72</td>\n",
" <td>51.77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>simon</td>\n",
" <td>1.72</td>\n",
" <td>50.97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>jacob</td>\n",
" <td>1.70</td>\n",
" <td>54.73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>cindy</td>\n",
" <td>1.69</td>\n",
" <td>57.81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>ivan</td>\n",
" <td>1.72</td>\n",
" <td>51.77</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name height weight\n",
"0 alex 1.67 51.25\n",
"1 ajay 1.82 61.91\n",
"2 alice 1.76 69.41\n",
"3 ravi 1.73 64.56\n",
"4 joe 1.72 65.45\n",
"5 alex 1.67 51.25\n",
"6 ajay 1.82 61.91\n",
"7 alice 1.76 69.41\n",
"8 ravi 1.73 64.56\n",
"9 joe 1.72 65.45\n",
"10 alex 1.67 51.25\n",
"11 ajay 1.82 61.91\n",
"12 alice 1.76 69.41\n",
"13 ravi 1.73 64.56\n",
"14 joe 1.72 65.45\n",
"15 jack 1.74 55.93\n",
"16 tom 1.77 64.18\n",
"17 tracy 1.78 61.90\n",
"18 john 1.72 50.97\n",
"19 jack 1.74 55.93\n",
"20 tom 1.77 64.18\n",
"21 tracy 1.78 61.90\n",
"22 john 1.72 50.97\n",
"23 jack 1.74 55.93\n",
"24 tom 1.77 64.18\n",
"25 tracy 1.78 61.90\n",
"26 john 1.72 50.97\n",
"27 simon 1.72 50.97\n",
"28 jacob 1.70 54.73\n",
"29 cindy 1.69 57.81\n",
"30 ivan 1.72 51.77\n",
"31 simon 1.72 50.97\n",
"32 jacob 1.70 54.73\n",
"33 cindy 1.69 57.81\n",
"34 ivan 1.72 51.77\n",
"35 simon 1.72 50.97\n",
"36 jacob 1.70 54.73\n",
"37 cindy 1.69 57.81\n",
"38 ivan 1.72 51.77"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"log(\"Transform phase Started\")\n",
"transformed_data = transform(extracted_data)\n",
"log(\"Transform phase Ended\")\n",
"transformed_data"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"log(\"Load phase Started\")\n",
"load(targetfile,transformed_data)\n",
"log(\"Load phase Ended\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"log(\"ETL Job Ended\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the example above complete the exercise below.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download Files\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2022-01-05 02:53:16-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/datasource.zip\n",
"Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\n",
"Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 4249 (4.1K) [application/zip]\n",
"Saving to: ‘datasource.zip’\n",
"\n",
"datasource.zip 100%[===================>] 4.15K --.-KB/s in 0s \n",
"\n",
"2022-01-05 02:53:16 (82.7 MB/s) - ‘datasource.zip’ saved [4249/4249]\n",
"\n"
]
}
],
"source": [
"!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/datasource.zip"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Unzip Files\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Archive: datasource.zip\n",
" inflating: dealership_data/used_car_prices1.csv \n",
" inflating: dealership_data/used_car_prices2.csv \n",
" inflating: dealership_data/used_car_prices3.csv \n",
" inflating: dealership_data/used_car_prices1.json \n",
" inflating: dealership_data/used_car_prices2.json \n",
" inflating: dealership_data/used_car_prices3.json \n",
" inflating: dealership_data/used_car_prices1.xml \n",
" inflating: dealership_data/used_car_prices2.xml \n",
" inflating: dealership_data/used_car_prices3.xml \n"
]
}
],
"source": [
"!unzip datasource.zip -d dealership_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## About the Data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The file `dealership_data` contains CSV, JSON, and XML files for used car data which contain features named `car_model`, `year_of_manufacture`, `price`, and `fuel`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set Paths\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"tmpfile = \"dealership_temp.tmp\" # file used to store all extracted data\n",
"logfile = \"dealership_logfile.txt\" # all event logs will be stored in this file\n",
"targetfile = \"dealership_transformed_data.csv\" # file where transformed data is stored"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extract\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 1: CSV Extract Function\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Add the CSV extract function below\n",
"def extract_from_csv(file_to_process):\n",
" dataframe = pd.read_csv(file_to_process)\n",
" return dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
" \n",
"def extract_from_csv(file_to_process):\n",
" dataframe = pd.read_csv(file_to_process)\n",
" return dataframe\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 2: JSON Extract Function\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# Add the JSON extract function below\n",
"def extract_from_json(file_to_process):\n",
" dataframe = pd.read_json(file_to_process, lines=True)\n",
" return dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
" \n",
"def extract_from_json(file_to_process):\n",
" dataframe = pd.read_json(file_to_process,lines=True)\n",
" return dataframe\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 3: XML Extract Function\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Add the XML extract function below, it is the same as the xml extract function above but the column names need to be renamed.\n",
"def extract_from_xml(file_to_process):\n",
" dataframe = pd.DataFrame(columns=[\"car_model\", \"year_of_manufacture\", \"price\", \"fuel\"])\n",
" tree = ET.parse(file_to_process)\n",
" root = tree.getroot()\n",
" for car in root:\n",
" car_model = car.find(\"car_model\").text\n",
" year_of_manufacture = int(car.find(\"year_of_manufacture\").text)\n",
" price = float(car.find(\"price\").text)\n",
" fuel = car.find(\"fuel\").text\n",
" dataframe = dataframe.append({\"car_model\":car_model, \"year_of_manufacture\":year_of_manufacture, \"price\":price, \"fuel\":fuel}, ignore_index = True)\n",
" return dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
" \n",
"def extract_from_xml(file_to_process):\n",
" dataframe = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel'])\n",
" tree = ET.parse(file_to_process)\n",
" root = tree.getroot()\n",
" for person in root:\n",
" car_model = person.find(\"car_model\").text\n",
" year_of_manufacture = int(person.find(\"year_of_manufacture\").text)\n",
" price = float(person.find(\"price\").text)\n",
" fuel = person.find(\"fuel\").text\n",
" dataframe = dataframe.append({\"car_model\":car_model, \"year_of_manufacture\":year_of_manufacture, \"price\":price, \"fuel\":fuel}, ignore_index=True)\n",
" return dataframe\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 4: Extract Function\n",
"\n",
"Call the specific extract functions you created above by replacing the `ADD_FUNCTION_CALL` with the proper function call.\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"def extract():\n",
" extracted_data = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel']) # create an empty data frame to hold extracted data\n",
"\n",
" #process all csv files\n",
" for csvfile in glob.glob(\"dealership_data/*.csv\"):\n",
" extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)\n",
"\n",
" #process all json files\n",
" for jsonfile in glob.glob(\"dealership_data/*.json\"):\n",
" extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)\n",
"\n",
" #process all xml files\n",
" for xmlfile in glob.glob(\"dealership_data/*.xml\"):\n",
" extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)\n",
"\n",
" return extracted_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
" \n",
"def extract():\n",
" extracted_data = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel']) # create an empty data frame to hold extracted data\n",
" \n",
" #process all csv files\n",
" for csvfile in glob.glob(\"dealership_data/*.csv\"):\n",
" extracted_data = extracted_data.append(extract_from_csv(csvfile), ignore_index=True)\n",
" \n",
" #process all json files\n",
" for jsonfile in glob.glob(\"dealership_data/*.json\"):\n",
" extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)\n",
" \n",
" #process all xml files\n",
" for xmlfile in glob.glob(\"dealership_data/*.xml\"):\n",
" extracted_data = extracted_data.append(extract_from_xml(xmlfile), ignore_index=True)\n",
" \n",
" return extracted_data\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transform\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 5: Transform\n",
"\n",
"Round the `price` columns to 2 decimal places\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# Add the transform function below\n",
"def transform(data):\n",
" # do conversions\n",
" data['price'] = round(data.price, 2)\n",
" return data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
"\n",
"def transform(data):\n",
" data['price'] = round(data.price, 2)\n",
" return data\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 6: Load\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# Add the load function below\n",
"def load(targetfile, data_to_load):\n",
" data_to_load.to_csv(targetfile)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
"\n",
"def load(targetfile,data_to_load):\n",
" data_to_load.to_csv(targetfile) \n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Logging\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 7: Log\n",
"\n",
"Make sure to change the name of the logfile to the one specified in the set paths section. Change the timestamp order to Hour-Minute-Second-Monthname-Day-Year.\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# Add the log function below\n",
"def log(message):\n",
" timestamp_format = '%H:%M:%S-%h-%d-%Y' # HH:MM:SS-Monthname-day-year\n",
" now = datetime.now()\n",
" timestamp = now.strftime(timestamp_format) # restring and format for human readability\n",
" with open(\"dealership_logfile.txt\", \"a\") as f:\n",
" f.write(timestamp + \",\" +message + \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
"\n",
"def log(message):\n",
" timestamp_format = '%H:%M:%S-%h-%d-%Y' #Hour-Minute-Second-MonthName-Day-Year\n",
" now = datetime.now() # get current timestamp\n",
" timestamp = now.strftime(timestamp_format)\n",
" with open(\"dealership_logfile.txt\",\"a\") as f:\n",
" f.write(timestamp + ',' + message + '\\n') \n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running ETL Process\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 8: ETL Process\n",
"\n",
"Run all functions to extract, transform, and load the data. Make sure to log all events using the `log` function. Place your code under each comment.\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# Log that you have started the ETL process\n",
"log(\"ETL begun\")\n",
"\n",
"# Log that you have started the Extract step\n",
"log(\"Extracting\")\n",
"# Call the Extract function\n",
"extracted_data = extract()\n",
"# Log that you have completed the Extract step\n",
"log(\"Extract complete\")\n",
"\n",
"# Log that you have started the Transform step\n",
"log(\"Transforming\")\n",
"# Call the Transform function\n",
"transformed_data = transform(extracted_data)\n",
"# Log that you have completed the Transform step\n",
"log(\"Transform complete\")\n",
"\n",
"# Log that you have started the Load step\n",
"log(\"Loading data\")\n",
"# Call the Load function\n",
"load(targetfile, transformed_data)\n",
"# Log that you have completed the Load step\n",
"log(\"Load complete\")\n",
"\n",
"# Log that you have completed the ETL process\n",
"log(\"ETL complete\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
"\n",
"log(\"ETL Job Started\")\n",
"\n",
"log(\"Extract phase Started\")\n",
"extracted_data = extract()\n",
"log(\"Extract phase Ended\")\n",
"\n",
"log(\"Transform phase Started\")\n",
"transformed_data = transform(extracted_data)\n",
"log(\"Transform phase Ended\")\n",
"\n",
"log(\"Load phase Started\")\n",
"load(targetfile,transformed_data)\n",
"log(\"Load phase Ended\")\n",
"\n",
"log(\"ETL Job Ended\")\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Authors\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ramesh Sannareddy\n",
"\n",
"Joseph Santarcangelo\n",
"\n",
"Azim Hirjani\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Change Log\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"| ----------------- | ------- | ----------------- | ---------------------------------- |\n",
"| 2020-11-25 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright © 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2021-01-01&cm_mmc=Email_Newsletter-\\_-Developer_Ed%2BTech-\\_-WW_WW-\\_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork-23455645&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment