Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save GraceLoggins/e9658e07c405e42f2556d25d66685012 to your computer and use it in GitHub Desktop.
Save GraceLoggins/e9658e07c405e42f2556d25d66685012 to your computer and use it in GitHub Desktop.
Coursera Python Project Webscraping
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n",
"</center>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Peer Review Assignment - Data Engineer - Webscraping\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Estimated time needed: **20** minutes\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Objectives\n",
"\n",
"In this part you will:\n",
"\n",
"* Use webscraping to get bank information\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting bs4\n",
" Downloading bs4-0.0.1.tar.gz (1.1 kB)\n",
" Preparing metadata (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25hCollecting beautifulsoup4\n",
" Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)\n",
" |████████████████████████████████| 97 kB 9.8 MB/s \n",
"\u001b[?25hCollecting soupsieve>1.2\n",
" Downloading soupsieve-2.3.1-py3-none-any.whl (37 kB)\n",
"Building wheels for collected packages: bs4\n",
" Building wheel for bs4 (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1271 sha256=4481f99e3737a5e34dd01b26638420bcbf061afb731a34c69aacb2466efd36f7\n",
" Stored in directory: /home/jupyterlab/.cache/pip/wheels/0a/9e/ba/20e5bbc1afef3a491f0b3bb74d508f99403aabe76eda2167ca\n",
"Successfully built bs4\n",
"Installing collected packages: soupsieve, beautifulsoup4, bs4\n",
"Successfully installed beautifulsoup4-4.10.0 bs4-0.0.1 soupsieve-2.3.1\n"
]
}
],
"source": [
"#!pip install pandas\n",
"!pip install bs4\n",
"#!pip install requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports\n",
"\n",
"Import any additional libraries you may need here.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"import requests\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extract Data Using Web Scraping\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The wikipedia webpage [https://en.wikipedia.org/wiki/List_of_largest_banks](https://en.wikipedia.org/wiki/List_of_largest_banks?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2021-01-01) provides information about largest banks in the world by various parameters. Scrape the data from the table 'By market capitalization' and store it in a JSON file.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Webpage Contents\n",
"\n",
"Gather the contents of the webpage in text format using the `requests` library and assign it to the variable <code>html_data</code>\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<!DOCTYPE html>\n",
"<html class=\"client-nojs\" lan\n",
"List of largest banks -\n"
]
}
],
"source": [
"#Write your code here\n",
"html_data = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks')\n",
"print(str(html_data.text)[:45])\n",
"print(str(html_data.text)[101:124])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Question 1</b> Print out the output of the following line, and remember it as it will be a quiz question:\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"List of largest banks -\n"
]
}
],
"source": [
"# data[101:124]\n",
"print(str(html_data.text)[101:124])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scraping the Data\n",
"\n",
"<b> Question 2</b> Using the contents and `beautiful soup` load the data from the `By market capitalization` table into a `pandas` dataframe. The dataframe should have the country `Name` and `Market Cap (US$ Billion)` as column names. Display the first five rows using head.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using BeautifulSoup parse the contents of the webpage.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"#Replace the dots below\n",
"soup=BeautifulSoup(html_data.content,\"html.parser\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the data from the `By market capitalization` table into a pandas dataframe. The dataframe should have the country `Name` and `Market Cap (US$ Billion)` as column names. Using the empty dataframe `data` and the given loop extract the necessary data from each row and append it to the empty dataframe.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"data = pd.DataFrame(columns=[\"Name\", \"Market Cap (US$ Billion)\"])\n",
"\n",
"for row in soup.find_all('tbody')[3].find_all('tr'):\n",
" col = row.find_all('td')\n",
" #Write your code here\n",
" if (col != []):\n",
" rank=col[0].text\n",
" name=col[1].text.strip()\n",
" marketcap=col[2].text.strip()\n",
" #print(name, marketcap)\n",
" data = data.append({\"Name\":name, \"Market Cap (US$ Billion)\":marketcap}, ignore_index=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question 3** Display the first five rows using the `head` function.\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Market Cap (US$ Billion)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>JPMorgan Chase</td>\n",
" <td>488.470</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Bank of America</td>\n",
" <td>379.250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Industrial and Commercial Bank of China</td>\n",
" <td>246.500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Wells Fargo</td>\n",
" <td>308.013</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>China Construction Bank</td>\n",
" <td>257.399</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Market Cap (US$ Billion)\n",
"0 JPMorgan Chase 488.470\n",
"1 Bank of America 379.250\n",
"2 Industrial and Commercial Bank of China 246.500\n",
"3 Wells Fargo 308.013\n",
"4 China Construction Bank 257.399"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Write your code here\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading the Data\n",
"\n",
"Usually you will Load the `pandas` dataframe created above into a JSON named `bank_market_cap.json` using the `to_json()` function, but this time the data will be sent to another team who will split the data file into two files and inspect it. If you save the data it will interfere with the next part of the assignment.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Write your code here\n",
"# according to the lab notes above, we aren't supposed to put anything here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Authors\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ramesh Sannareddy, Joseph Santarcangelo and Azim Hirjani\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other Contributors\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rav Ahuja\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Change Log\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"| ----------------- | ------- | ----------------- | ---------------------------------- |\n",
"| 2020-11-25 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copyright © 2020 IBM Corporation.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment