Created
September 8, 2023 01:35
-
-
Save dsimanoliveira/603e5cba4dcfb721e24cf68454fa499e to your computer and use it in GitHub Desktop.
Webscraping_Engineer_Peer_Review_Assignment.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/dsimanoliveira/603e5cba4dcfb721e24cf68454fa499e/webscraping_engineer_peer_review_assignment.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "733501fd-f799-4dbc-8825-0d1942ddeecb" | |
}, | |
"source": [ | |
"<p style=\"text-align:center\">\n", | |
" <a href=\"https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork899-2023-01-01\">\n", | |
" <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png\" width=\"200\" alt=\"Skills Network Logo\" />\n", | |
" </a>\n", | |
"</p>\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "6475f542-e53f-4d38-b94d-667b0f0b9813" | |
}, | |
"source": [ | |
"# Peer Review Assignment - Data Engineer - Webscraping\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "5634bd5b-2073-4abf-ac5e-60addf1e8059" | |
}, | |
"source": [ | |
"Estimated time needed: **20** minutes\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "4cbadd10-bc26-404c-9f0f-8e16390af86c" | |
}, | |
"source": [ | |
"## Objectives\n", | |
"\n", | |
"In this part you will:\n", | |
"\n", | |
"- Use webscraping to get bank information\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "d1f34239-7d19-4df7-aab3-ac6d88827eda" | |
}, | |
"source": [ | |
"## Imports\n", | |
"\n", | |
"Import any additional libraries you may need here.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "22f312ab-d5ed-4da3-9016-278ebf7518ad" | |
}, | |
"outputs": [], | |
"source": [ | |
"from bs4 import BeautifulSoup\n", | |
"import html5lib\n", | |
"import requests\n", | |
"import pandas as pd" | |
], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "764e78fa-dda1-488b-830d-4c42fa046624" | |
}, | |
"source": [ | |
"## Extract Data Using Web Scraping\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "68802876-ac84-43e3-b24a-8df5261d80f7" | |
}, | |
"source": [ | |
"The wikipedia webpage https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks provides information about largest banks in the world by various parameters. Scrape the data from the table 'By market capitalization' and store it in a JSON file.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "bd044e14-2b76-487b-a9fd-6bc9520635ad" | |
}, | |
"source": [ | |
"### Webpage Contents\n", | |
"\n", | |
"Gather the contents of the webpage in text format using the `requests` library and assign it to the variable <code>html_data</code>\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"response = requests.get('https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks')\n", | |
"html_data = response.content\n" | |
], | |
"metadata": { | |
"id": "fzL2q_6M7kId" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "3bdf785c-6c25-413f-87f2-d55335c16a46" | |
}, | |
"source": [ | |
"<b>Question 1</b> Print out the output of the following line, and remember it as it will be a quiz question:\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"html_data[760:783]" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
}, | |
"id": "p_Z3eL3B8OaD", | |
"outputId": "f56820ea-8112-491b-ac4a-84fbfa72b908" | |
}, | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"b'\" href=\"/_static/css/ba'" | |
] | |
}, | |
"metadata": {}, | |
"execution_count": 112 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "183111fc-9af4-4f7f-96ca-9eab98e4bcba" | |
}, | |
"source": [ | |
"### Scraping the Data\n", | |
"\n", | |
"<b> Question 2</b> Using the contents and `beautiful soup` load the data from the `By market capitalization` table into a `pandas` dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names. Display the first five rows using head.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ce4d7391-008e-4458-84c1-d1fc7b9d0375" | |
}, | |
"source": [ | |
"Using BeautifulSoup parse the contents of the webpage.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"soup = BeautifulSoup(html_data, 'html.parser')" | |
], | |
"metadata": { | |
"id": "gsmVGxWs8ZjT" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "eefd343b-f629-4865-add4-14ce9e9aed51" | |
}, | |
"source": [ | |
"Load the data from the `By market capitalization` table into a pandas dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names. Using the empty dataframe `data` and the given loop extract the necessary data from each row and append it to the empty dataframe.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"data = pd.DataFrame(columns=[\"Name\", \"Market Cap (US$ Billion)\"])\n", | |
"\n", | |
"for row in soup.find_all('tbody')[2].find_all('tr'):\n", | |
" col = row.find_all('td')\n", | |
" #Write your code here\n", | |
" if col:\n", | |
" new_df_row = {\n", | |
" 'Name': col[1].find_all('a')[-1].text,\n", | |
" 'Market Cap (US$ Billion)': float(col[-1].text)\n", | |
" }\n", | |
" data = pd.concat([data, pd.DataFrame(new_df_row, index=[0])], ignore_index=True)\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "rUwQYeUE_tuK" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "fc5391c0-33f5-4229-b4fe-765495e4ac68" | |
}, | |
"source": [ | |
"**Question 3** Display the first five rows using the `head` function.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"data.head()" | |
], | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 206 | |
}, | |
"id": "w6HMq1RiMBip", | |
"outputId": "544f1804-350f-48f3-9daa-b9438dcec8da" | |
}, | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
" Name Market Cap (US$ Billion)\n", | |
"0 JPMorgan Chase 390.934\n", | |
"1 Industrial and Commercial Bank of China 345.214\n", | |
"2 Bank of America 325.331\n", | |
"3 Wells Fargo 308.013\n", | |
"4 China Construction Bank 257.399" | |
], | |
"text/html": [ | |
"\n", | |
" <div id=\"df-9689971c-bf5d-4d42-85b8-d5c3c212b32a\" class=\"colab-df-container\">\n", | |
" <div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Name</th>\n", | |
" <th>Market Cap (US$ Billion)</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>JPMorgan Chase</td>\n", | |
" <td>390.934</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>Industrial and Commercial Bank of China</td>\n", | |
" <td>345.214</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Bank of America</td>\n", | |
" <td>325.331</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Wells Fargo</td>\n", | |
" <td>308.013</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>China Construction Bank</td>\n", | |
" <td>257.399</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>\n", | |
" <div class=\"colab-df-buttons\">\n", | |
"\n", | |
" <div class=\"colab-df-container\">\n", | |
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-9689971c-bf5d-4d42-85b8-d5c3c212b32a')\"\n", | |
" title=\"Convert this dataframe to an interactive table.\"\n", | |
" style=\"display:none;\">\n", | |
"\n", | |
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n", | |
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n", | |
" </svg>\n", | |
" </button>\n", | |
"\n", | |
" <style>\n", | |
" .colab-df-container {\n", | |
" display:flex;\n", | |
" gap: 12px;\n", | |
" }\n", | |
"\n", | |
" .colab-df-convert {\n", | |
" background-color: #E8F0FE;\n", | |
" border: none;\n", | |
" border-radius: 50%;\n", | |
" cursor: pointer;\n", | |
" display: none;\n", | |
" fill: #1967D2;\n", | |
" height: 32px;\n", | |
" padding: 0 0 0 0;\n", | |
" width: 32px;\n", | |
" }\n", | |
"\n", | |
" .colab-df-convert:hover {\n", | |
" background-color: #E2EBFA;\n", | |
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", | |
" fill: #174EA6;\n", | |
" }\n", | |
"\n", | |
" .colab-df-buttons div {\n", | |
" margin-bottom: 4px;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-convert {\n", | |
" background-color: #3B4455;\n", | |
" fill: #D2E3FC;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-convert:hover {\n", | |
" background-color: #434B5C;\n", | |
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", | |
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", | |
" fill: #FFFFFF;\n", | |
" }\n", | |
" </style>\n", | |
"\n", | |
" <script>\n", | |
" const buttonEl =\n", | |
" document.querySelector('#df-9689971c-bf5d-4d42-85b8-d5c3c212b32a button.colab-df-convert');\n", | |
" buttonEl.style.display =\n", | |
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n", | |
"\n", | |
" async function convertToInteractive(key) {\n", | |
" const element = document.querySelector('#df-9689971c-bf5d-4d42-85b8-d5c3c212b32a');\n", | |
" const dataTable =\n", | |
" await google.colab.kernel.invokeFunction('convertToInteractive',\n", | |
" [key], {});\n", | |
" if (!dataTable) return;\n", | |
"\n", | |
" const docLinkHtml = 'Like what you see? Visit the ' +\n", | |
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", | |
" + ' to learn more about interactive tables.';\n", | |
" element.innerHTML = '';\n", | |
" dataTable['output_type'] = 'display_data';\n", | |
" await google.colab.output.renderOutput(dataTable, element);\n", | |
" const docLink = document.createElement('div');\n", | |
" docLink.innerHTML = docLinkHtml;\n", | |
" element.appendChild(docLink);\n", | |
" }\n", | |
" </script>\n", | |
" </div>\n", | |
"\n", | |
"\n", | |
"<div id=\"df-b3d98d79-bcfe-4c9c-aa1d-6d8d6181e0f9\">\n", | |
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-b3d98d79-bcfe-4c9c-aa1d-6d8d6181e0f9')\"\n", | |
" title=\"Suggest charts.\"\n", | |
" style=\"display:none;\">\n", | |
"\n", | |
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", | |
" width=\"24px\">\n", | |
" <g>\n", | |
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", | |
" </g>\n", | |
"</svg>\n", | |
" </button>\n", | |
"\n", | |
"<style>\n", | |
" .colab-df-quickchart {\n", | |
" --bg-color: #E8F0FE;\n", | |
" --fill-color: #1967D2;\n", | |
" --hover-bg-color: #E2EBFA;\n", | |
" --hover-fill-color: #174EA6;\n", | |
" --disabled-fill-color: #AAA;\n", | |
" --disabled-bg-color: #DDD;\n", | |
" }\n", | |
"\n", | |
" [theme=dark] .colab-df-quickchart {\n", | |
" --bg-color: #3B4455;\n", | |
" --fill-color: #D2E3FC;\n", | |
" --hover-bg-color: #434B5C;\n", | |
" --hover-fill-color: #FFFFFF;\n", | |
" --disabled-bg-color: #3B4455;\n", | |
" --disabled-fill-color: #666;\n", | |
" }\n", | |
"\n", | |
" .colab-df-quickchart {\n", | |
" background-color: var(--bg-color);\n", | |
" border: none;\n", | |
" border-radius: 50%;\n", | |
" cursor: pointer;\n", | |
" display: none;\n", | |
" fill: var(--fill-color);\n", | |
" height: 32px;\n", | |
" padding: 0;\n", | |
" width: 32px;\n", | |
" }\n", | |
"\n", | |
" .colab-df-quickchart:hover {\n", | |
" background-color: var(--hover-bg-color);\n", | |
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n", | |
" fill: var(--button-hover-fill-color);\n", | |
" }\n", | |
"\n", | |
" .colab-df-quickchart-complete:disabled,\n", | |
" .colab-df-quickchart-complete:disabled:hover {\n", | |
" background-color: var(--disabled-bg-color);\n", | |
" fill: var(--disabled-fill-color);\n", | |
" box-shadow: none;\n", | |
" }\n", | |
"\n", | |
" .colab-df-spinner {\n", | |
" border: 2px solid var(--fill-color);\n", | |
" border-color: transparent;\n", | |
" border-bottom-color: var(--fill-color);\n", | |
" animation:\n", | |
" spin 1s steps(1) infinite;\n", | |
" }\n", | |
"\n", | |
" @keyframes spin {\n", | |
" 0% {\n", | |
" border-color: transparent;\n", | |
" border-bottom-color: var(--fill-color);\n", | |
" border-left-color: var(--fill-color);\n", | |
" }\n", | |
" 20% {\n", | |
" border-color: transparent;\n", | |
" border-left-color: var(--fill-color);\n", | |
" border-top-color: var(--fill-color);\n", | |
" }\n", | |
" 30% {\n", | |
" border-color: transparent;\n", | |
" border-left-color: var(--fill-color);\n", | |
" border-top-color: var(--fill-color);\n", | |
" border-right-color: var(--fill-color);\n", | |
" }\n", | |
" 40% {\n", | |
" border-color: transparent;\n", | |
" border-right-color: var(--fill-color);\n", | |
" border-top-color: var(--fill-color);\n", | |
" }\n", | |
" 60% {\n", | |
" border-color: transparent;\n", | |
" border-right-color: var(--fill-color);\n", | |
" }\n", | |
" 80% {\n", | |
" border-color: transparent;\n", | |
" border-right-color: var(--fill-color);\n", | |
" border-bottom-color: var(--fill-color);\n", | |
" }\n", | |
" 90% {\n", | |
" border-color: transparent;\n", | |
" border-bottom-color: var(--fill-color);\n", | |
" }\n", | |
" }\n", | |
"</style>\n", | |
"\n", | |
" <script>\n", | |
" async function quickchart(key) {\n", | |
" const quickchartButtonEl =\n", | |
" document.querySelector('#' + key + ' button');\n", | |
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n", | |
" quickchartButtonEl.classList.add('colab-df-spinner');\n", | |
" try {\n", | |
" const charts = await google.colab.kernel.invokeFunction(\n", | |
" 'suggestCharts', [key], {});\n", | |
" } catch (error) {\n", | |
" console.error('Error during call to suggestCharts:', error);\n", | |
" }\n", | |
" quickchartButtonEl.classList.remove('colab-df-spinner');\n", | |
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n", | |
" }\n", | |
" (() => {\n", | |
" let quickchartButtonEl =\n", | |
" document.querySelector('#df-b3d98d79-bcfe-4c9c-aa1d-6d8d6181e0f9 button');\n", | |
" quickchartButtonEl.style.display =\n", | |
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n", | |
" })();\n", | |
" </script>\n", | |
"</div>\n", | |
" </div>\n", | |
" </div>\n" | |
] | |
}, | |
"metadata": {}, | |
"execution_count": 103 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "7f03eda7-0d24-48a6-b851-e2e81ec8a1ed" | |
}, | |
"source": [ | |
"\n", | |
"### Loading the Data\n", | |
"\n", | |
"Load the `pandas` dataframe created above into a JSON named `bank_market_cap.json` using the `to_json()` function.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"data.to_json('bank_market_cap.json')" | |
], | |
"metadata": { | |
"id": "OarwAbrDNIoL" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "c46b715f-9cdf-4db1-8fc6-78cb57f14ff0" | |
}, | |
"source": [ | |
"## Authors\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "e4f97a2e-adf1-41d6-89fe-e4e63c47f721" | |
}, | |
"source": [ | |
"Ramesh Sannareddy, Joseph Santarcangelo and Azim Hirjani\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "1589609f-21f1-4438-8edb-f51b0781d4a9" | |
}, | |
"source": [ | |
"### Other Contributors\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "de7e35ec-5271-4a04-a721-6a2102cde9a7" | |
}, | |
"source": [ | |
"Rav Ahuja\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "3d790194-f2df-49ee-bc2a-110a812e60a2" | |
}, | |
"source": [ | |
"## Change Log\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "40f0770b-5380-49e7-b6c3-e7c8d9522fd0" | |
}, | |
"source": [ | |
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n", | |
"| ----------------- | ------- | ----------------- | ---------------------------------- |\n", | |
"| 2022-07-12 | 0.2 | Appalabhaktula Hema | Corrected the code and markdown |\n", | |
"| 2020-11-25 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "6b5e1afb-c515-43d1-b8d3-60cd03658f45" | |
}, | |
"source": [ | |
"Copyright © 2020 IBM Corporation.\n" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"name": "python3" | |
}, | |
"language_info": { | |
"name": "python" | |
}, | |
"colab": { | |
"provenance": [], | |
"include_colab_link": true | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment