Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save dsimanoliveira/603e5cba4dcfb721e24cf68454fa499e to your computer and use it in GitHub Desktop.
Save dsimanoliveira/603e5cba4dcfb721e24cf68454fa499e to your computer and use it in GitHub Desktop.
Webscraping_Engineer_Peer_Review_Assignment.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/dsimanoliveira/603e5cba4dcfb721e24cf68454fa499e/webscraping_engineer_peer_review_assignment.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "733501fd-f799-4dbc-8825-0d1942ddeecb"
},
"source": [
"<p style=\"text-align:center\">\n",
" <a href=\"https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork899-2023-01-01\">\n",
" <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png\" width=\"200\" alt=\"Skills Network Logo\" />\n",
" </a>\n",
"</p>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6475f542-e53f-4d38-b94d-667b0f0b9813"
},
"source": [
"# Peer Review Assignment - Data Engineer - Webscraping\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5634bd5b-2073-4abf-ac5e-60addf1e8059"
},
"source": [
"Estimated time needed: **20** minutes\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4cbadd10-bc26-404c-9f0f-8e16390af86c"
},
"source": [
"## Objectives\n",
"\n",
"In this part you will:\n",
"\n",
"- Use webscraping to get bank information\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "d1f34239-7d19-4df7-aab3-ac6d88827eda"
},
"source": [
"## Imports\n",
"\n",
"Import any additional libraries you may need here.\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "22f312ab-d5ed-4da3-9016-278ebf7518ad"
},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"import html5lib\n",
"import requests\n",
"import pandas as pd"
],
"execution_count": null
},
{
"cell_type": "markdown",
"metadata": {
"id": "764e78fa-dda1-488b-830d-4c42fa046624"
},
"source": [
"## Extract Data Using Web Scraping\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "68802876-ac84-43e3-b24a-8df5261d80f7"
},
"source": [
"The wikipedia webpage https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks provides information about largest banks in the world by various parameters. Scrape the data from the table 'By market capitalization' and store it in a JSON file.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bd044e14-2b76-487b-a9fd-6bc9520635ad"
},
"source": [
"### Webpage Contents\n",
"\n",
"Gather the contents of the webpage in text format using the `requests` library and assign it to the variable <code>html_data</code>\n"
]
},
{
"cell_type": "code",
"source": [
"response = requests.get('https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks')\n",
"html_data = response.content\n"
],
"metadata": {
"id": "fzL2q_6M7kId"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "3bdf785c-6c25-413f-87f2-d55335c16a46"
},
"source": [
"<b>Question 1</b> Print out the output of the following line, and remember it as it will be a quiz question:\n"
]
},
{
"cell_type": "code",
"source": [
"html_data[760:783]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "p_Z3eL3B8OaD",
"outputId": "f56820ea-8112-491b-ac4a-84fbfa72b908"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"b'\" href=\"/_static/css/ba'"
]
},
"metadata": {},
"execution_count": 112
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "183111fc-9af4-4f7f-96ca-9eab98e4bcba"
},
"source": [
"### Scraping the Data\n",
"\n",
"<b> Question 2</b> Using the contents and `beautiful soup` load the data from the `By market capitalization` table into a `pandas` dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names. Display the first five rows using head.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ce4d7391-008e-4458-84c1-d1fc7b9d0375"
},
"source": [
"Using BeautifulSoup parse the contents of the webpage.\n"
]
},
{
"cell_type": "code",
"source": [
"soup = BeautifulSoup(html_data, 'html.parser')"
],
"metadata": {
"id": "gsmVGxWs8ZjT"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "eefd343b-f629-4865-add4-14ce9e9aed51"
},
"source": [
"Load the data from the `By market capitalization` table into a pandas dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names. Using the empty dataframe `data` and the given loop extract the necessary data from each row and append it to the empty dataframe.\n"
]
},
{
"cell_type": "code",
"source": [
"data = pd.DataFrame(columns=[\"Name\", \"Market Cap (US$ Billion)\"])\n",
"\n",
"for row in soup.find_all('tbody')[2].find_all('tr'):\n",
" col = row.find_all('td')\n",
" #Write your code here\n",
" if col:\n",
" new_df_row = {\n",
" 'Name': col[1].find_all('a')[-1].text,\n",
" 'Market Cap (US$ Billion)': float(col[-1].text)\n",
" }\n",
" data = pd.concat([data, pd.DataFrame(new_df_row, index=[0])], ignore_index=True)\n",
"\n"
],
"metadata": {
"id": "rUwQYeUE_tuK"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "fc5391c0-33f5-4229-b4fe-765495e4ac68"
},
"source": [
"**Question 3** Display the first five rows using the `head` function.\n"
]
},
{
"cell_type": "code",
"source": [
"data.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "w6HMq1RiMBip",
"outputId": "544f1804-350f-48f3-9daa-b9438dcec8da"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Name Market Cap (US$ Billion)\n",
"0 JPMorgan Chase 390.934\n",
"1 Industrial and Commercial Bank of China 345.214\n",
"2 Bank of America 325.331\n",
"3 Wells Fargo 308.013\n",
"4 China Construction Bank 257.399"
],
"text/html": [
"\n",
" <div id=\"df-9689971c-bf5d-4d42-85b8-d5c3c212b32a\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Market Cap (US$ Billion)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>JPMorgan Chase</td>\n",
" <td>390.934</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Industrial and Commercial Bank of China</td>\n",
" <td>345.214</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Bank of America</td>\n",
" <td>325.331</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Wells Fargo</td>\n",
" <td>308.013</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>China Construction Bank</td>\n",
" <td>257.399</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-9689971c-bf5d-4d42-85b8-d5c3c212b32a')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-9689971c-bf5d-4d42-85b8-d5c3c212b32a button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-9689971c-bf5d-4d42-85b8-d5c3c212b32a');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-b3d98d79-bcfe-4c9c-aa1d-6d8d6181e0f9\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-b3d98d79-bcfe-4c9c-aa1d-6d8d6181e0f9')\"\n",
" title=\"Suggest charts.\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-b3d98d79-bcfe-4c9c-aa1d-6d8d6181e0f9 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 103
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7f03eda7-0d24-48a6-b851-e2e81ec8a1ed"
},
"source": [
"\n",
"### Loading the Data\n",
"\n",
"Load the `pandas` dataframe created above into a JSON named `bank_market_cap.json` using the `to_json()` function.\n"
]
},
{
"cell_type": "code",
"source": [
"data.to_json('bank_market_cap.json')"
],
"metadata": {
"id": "OarwAbrDNIoL"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "c46b715f-9cdf-4db1-8fc6-78cb57f14ff0"
},
"source": [
"## Authors\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "e4f97a2e-adf1-41d6-89fe-e4e63c47f721"
},
"source": [
"Ramesh Sannareddy, Joseph Santarcangelo and Azim Hirjani\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1589609f-21f1-4438-8edb-f51b0781d4a9"
},
"source": [
"### Other Contributors\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "de7e35ec-5271-4a04-a721-6a2102cde9a7"
},
"source": [
"Rav Ahuja\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3d790194-f2df-49ee-bc2a-110a812e60a2"
},
"source": [
"## Change Log\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "40f0770b-5380-49e7-b6c3-e7c8d9522fd0"
},
"source": [
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"| ----------------- | ------- | ----------------- | ---------------------------------- |\n",
"| 2022-07-12 | 0.2 | Appalabhaktula Hema | Corrected the code and markdown |\n",
"| 2020-11-25 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6b5e1afb-c515-43d1-b8d3-60cd03658f45"
},
"source": [
"Copyright © 2020 IBM Corporation.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
},
"colab": {
"provenance": [],
"include_colab_link": true
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment