Skip to content

Instantly share code, notes, and snippets.

@javiicc
Created December 18, 2021 11:07
Show Gist options
  • Save javiicc/7af2e0c789401c3d3800915c84f6d403 to your computer and use it in GitHub Desktop.
Save javiicc/7af2e0c789401c3d3800915c84f6d403 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "# Peer Review Assignment - Data Engineer - ETL\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Estimated time needed: **20** minutes\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Objectives\n\nIn this final part you will:\n\n* Run the ETL process\n* Extract bank and market cap data from the JSON file `bank_market_cap.json`\n* Transform the market cap currency using the exchange rate data\n* Load the transformed data into a seperate CSV\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "#!pip install glob\n#!pip install pandas\n#!pip install requests\n#!pip install datetime"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Imports\n\nImport any additional libraries you may need here.\n"
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": "import glob\nimport pandas as pd\nfrom datetime import datetime"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "As the exchange rate fluctuates, we will download the same dataset to make marking simpler. This will be in the same format as the dataset you used in the last section\n"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "--2021-12-18 10:25:37-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_1.json\nResolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\nConnecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 2815 (2.7K) [application/json]\nSaving to: \u2018bank_market_cap_1.json\u2019\n\nbank_market_cap_1.j 100%[===================>] 2.75K --.-KB/s in 0s \n\n2021-12-18 10:25:37 (111 MB/s) - \u2018bank_market_cap_1.json\u2019 saved [2815/2815]\n\n--2021-12-18 10:25:38-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_2.json\nResolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\nConnecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 1429 (1.4K) [application/json]\nSaving to: \u2018bank_market_cap_2.json\u2019\n\nbank_market_cap_2.j 100%[===================>] 1.40K --.-KB/s in 0s \n\n2021-12-18 10:25:38 (49.6 MB/s) - \u2018bank_market_cap_2.json\u2019 saved [1429/1429]\n\n--2021-12-18 10:25:39-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Final%20Assignment/exchange_rates.csv\nResolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\nConnecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 590 [text/csv]\nSaving to: \u2018exchange_rates.csv\u2019\n\nexchange_rates.csv 100%[===================>] 590 --.-KB/s in 0s \n\n2021-12-18 10:25:40 (20.4 MB/s) - \u2018exchange_rates.csv\u2019 saved [590/590]\n\n"
}
],
"source": "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_1.json\n!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_2.json\n!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Final%20Assignment/exchange_rates.csv"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Extract\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### JSON Extract Function\n\nThis function will extract JSON files.\n"
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": "def extract_from_json(file_to_process):\n dataframe = pd.read_json(file_to_process)\n return dataframe"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Extract Function\n\nDefine the extract function that finds JSON file `bank_market_cap_1.json` and calls the function created above to extract data from them. Store the data in a `pandas` dataframe. Use the following list for the columns.\n"
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": "columns=['Name','Market Cap (US$ Billion)']"
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": "def extract():\n # Write your code here\n extracted_data = pd.DataFrame(columns=['Name','Market Cap (US$ Billion)'])\n \n # Process json file\n for jsonfile in glob.glob(\"*.json\"):\n extracted_data = extracted_data.append(extract_from_json(jsonfile), ignore_index=True)\n \n return extracted_data"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<b>Question 1</b> Load the file <code>exchange_rates.csv</code> as a dataframe and find the exchange rate for British pounds with the symbol <code>GBP</code>, store it in the variable <code>exchange_rate</code>, you will be asked for the number. Hint: set the parameter <code>index_col</code> to 0.\n"
},
{
"cell_type": "raw",
"metadata": {},
"source": "def log(message):\n timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second\n now = datetime.now() # get current timestamp\n timestamp = now.strftime(timestamp_format)\n with open(\"logfile.txt\",\"a\") as f:\n f.write(timestamp + ',' + message + '\\n')"
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Name</th>\n <th>Market Cap (US$ Billion)</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Banco Bilbao Vizcaya Argentaria</td>\n <td>60.678</td>\n </tr>\n <tr>\n <th>1</th>\n <td>Japan Post Bank</td>\n <td>60.525</td>\n </tr>\n <tr>\n <th>2</th>\n <td>The Bank of New York Mellon</td>\n <td>59.823</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Shanghai Pudong Development Bank</td>\n <td>58.796</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Industrial Bank (China)</td>\n <td>57.388</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>100</th>\n <td>Ping An Bank</td>\n <td>37.993</td>\n </tr>\n <tr>\n <th>101</th>\n <td>Standard Chartered</td>\n <td>37.319</td>\n </tr>\n <tr>\n <th>102</th>\n <td>United Overseas Bank</td>\n <td>35.128</td>\n </tr>\n <tr>\n <th>103</th>\n <td>QNB Group</td>\n <td>33.560</td>\n </tr>\n <tr>\n <th>104</th>\n <td>Bank Rakyat</td>\n <td>33.081</td>\n </tr>\n </tbody>\n</table>\n<p>105 rows \u00d7 2 columns</p>\n</div>",
"text/plain": " Name Market Cap (US$ Billion)\n0 Banco Bilbao Vizcaya Argentaria 60.678\n1 Japan Post Bank 60.525\n2 The Bank of New York Mellon 59.823\n3 Shanghai Pudong Development Bank 58.796\n4 Industrial Bank (China) 57.388\n.. ... ...\n100 Ping An Bank 37.993\n101 Standard Chartered 37.319\n102 United Overseas Bank 35.128\n103 QNB Group 33.560\n104 Bank Rakyat 33.081\n\n[105 rows x 2 columns]"
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Write your code here\n#log(\"Extract phase Started\")\nextracted_data = extract()\n#log(\"Extract phase Ended\")\nextracted_data"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Transform\n\nUsing <code>exchange_rate</code> and the `exchange_rates.csv` file find the exchange rate of USD to GBP. Write a transform function that\n\n1. Changes the `Market Cap (US$ Billion)` column from USD to GBP\n2. Rounds the Market Cap (US$ Billion)\\` column to 3 decimal places\n3. Rename `Market Cap (US$ Billion)` to `Market Cap (GBP$ Billion)`\n"
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Name</th>\n <th>Market Cap (GBP$ Billion)</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Banco Bilbao Vizcaya Argentaria</td>\n <td>45.508</td>\n </tr>\n <tr>\n <th>1</th>\n <td>Japan Post Bank</td>\n <td>45.394</td>\n </tr>\n <tr>\n <th>2</th>\n <td>The Bank of New York Mellon</td>\n <td>44.867</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Shanghai Pudong Development Bank</td>\n <td>44.097</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Industrial Bank (China)</td>\n <td>43.041</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>100</th>\n <td>Ping An Bank</td>\n <td>28.495</td>\n </tr>\n <tr>\n <th>101</th>\n <td>Standard Chartered</td>\n <td>27.989</td>\n </tr>\n <tr>\n <th>102</th>\n <td>United Overseas Bank</td>\n <td>26.346</td>\n </tr>\n <tr>\n <th>103</th>\n <td>QNB Group</td>\n <td>25.170</td>\n </tr>\n <tr>\n <th>104</th>\n <td>Bank Rakyat</td>\n <td>24.811</td>\n </tr>\n </tbody>\n</table>\n<p>105 rows \u00d7 2 columns</p>\n</div>",
"text/plain": " Name Market Cap (GBP$ Billion)\n0 Banco Bilbao Vizcaya Argentaria 45.508\n1 Japan Post Bank 45.394\n2 The Bank of New York Mellon 44.867\n3 Shanghai Pudong Development Bank 44.097\n4 Industrial Bank (China) 43.041\n.. ... ...\n100 Ping An Bank 28.495\n101 Standard Chartered 27.989\n102 United Overseas Bank 26.346\n103 QNB Group 25.170\n104 Bank Rakyat 24.811\n\n[105 rows x 2 columns]"
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "def transform(data):\n # Write your code here\n data['Market Cap (US$ Billion)'] = round(0.75 * data['Market Cap (US$ Billion)'], 3)\n data.rename(columns={'Market Cap (US$ Billion)': 'Market Cap (GBP$ Billion)'}, inplace=True)\n return data\n\ntransformed_data = transform(extracted_data)\ntransformed_data"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Load\n\nCreate a function that takes a dataframe and load it to a csv named `bank_market_cap_gbp.csv`. Make sure to set `index` to `False`.\n"
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": "def load(target_file, data_to_load):\n # Write your code here\n data_to_load.to_csv(target_file, index=False) "
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Logging Function\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Write the logging function <code>log</code> to log your data:\n"
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": "def log(message):\n # Write your code here\n timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second\n now = datetime.now() # get current timestamp\n timestamp = now.strftime(timestamp_format)\n with open(\"logfile.txt\",\"a\") as f:\n f.write(timestamp + ',' + message + '\\n')"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Running the ETL Process\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Log the process accordingly using the following <code>\"ETL Job Started\"</code> and <code>\"Extract phase Started\"</code>\n"
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": "# Write your code here\nlog(\"ETL Job Started\")\nlog(\"Extract phase Started\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Extract\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<code>Question 2</code> Use the function <code>extract</code>, and print the first 5 rows, take a screen shot:\n"
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Name</th>\n <th>Market Cap (US$ Billion)</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Banco Bilbao Vizcaya Argentaria</td>\n <td>60.678</td>\n </tr>\n <tr>\n <th>1</th>\n <td>Japan Post Bank</td>\n <td>60.525</td>\n </tr>\n <tr>\n <th>2</th>\n <td>The Bank of New York Mellon</td>\n <td>59.823</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Shanghai Pudong Development Bank</td>\n <td>58.796</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Industrial Bank (China)</td>\n <td>57.388</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Name Market Cap (US$ Billion)\n0 Banco Bilbao Vizcaya Argentaria 60.678\n1 Japan Post Bank 60.525\n2 The Bank of New York Mellon 59.823\n3 Shanghai Pudong Development Bank 58.796\n4 Industrial Bank (China) 57.388"
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Call the function here\nextracted_data = extract()\n# Print the rows here\nextracted_data.head()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Log the data as <code>\"Extract phase Ended\"</code>\n"
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": "# Write your code here\nlog(\"Extract phase Ended\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Transform\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Log the following <code>\"Transform phase Started\"</code>\n"
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": "# Write your code here\nlog(\"Transform phase Started\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<code>Question 3</code> Use the function <code>transform</code> and print the first 5 rows of the output, take a screen shot:\n"
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Name</th>\n <th>Market Cap (GBP$ Billion)</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Banco Bilbao Vizcaya Argentaria</td>\n <td>45.508</td>\n </tr>\n <tr>\n <th>1</th>\n <td>Japan Post Bank</td>\n <td>45.394</td>\n </tr>\n <tr>\n <th>2</th>\n <td>The Bank of New York Mellon</td>\n <td>44.867</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Shanghai Pudong Development Bank</td>\n <td>44.097</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Industrial Bank (China)</td>\n <td>43.041</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Name Market Cap (GBP$ Billion)\n0 Banco Bilbao Vizcaya Argentaria 45.508\n1 Japan Post Bank 45.394\n2 The Bank of New York Mellon 44.867\n3 Shanghai Pudong Development Bank 44.097\n4 Industrial Bank (China) 43.041"
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Call the function here\ntransformed_data = transform(extracted_data)\n# Print the first 5 rows here\ntransformed_data.head()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Log your data <code>\"Transform phase Ended\"</code>\n"
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": "# Write your code here\nlog(\"Transform phase Ended\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Load\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Log the following `\"Load phase Started\"`.\n"
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": "# Write your code here\nlog(\"Load phase Started\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Call the load function\n"
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": "# Write your code here\nload('market_cap.csv', transformed_data)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Log the following `\"Load phase Ended\"`.\n"
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": "# Write your code here\nlog(\"Load phase Ended\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Authors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Ramesh Sannareddy, Joseph Santrcangelo and Azim Hirjani\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Other Contributors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Rav Ahuja\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Change Log\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ----------------- | ---------------------------------- |\n| 2020-11-25 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Copyright \u00a9 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2021-01-01&cm_mmc=Email_Newsletter-\\_-Developer_Ed%2BTech-\\_-WW_WW-\\_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment