Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save centos-wheezy/d26c745802cf0e980f7eafb164b75a36 to your computer and use it in GitHub Desktop.
Save centos-wheezy/d26c745802cf0e980f7eafb164b75a36 to your computer and use it in GitHub Desktop.
Extracting Stock Data Using a Web Scraping
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/Logos/organization_logo/organization_logo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n",
"</center>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Extracting Stock Data Using a Web Scraping</h1>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not all stock data is available via API in this assignment; you will use web-scraping to obtain financial data. You will be quizzed on your results. \n",
" Using beautiful soup we will extract historical share data from a web-page.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Table of Contents</h2>\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
" <ul>\n",
" <li>Downloading the Webpage Using Requests Library</li>\n",
" <li>Parsing Webpage HTML Using BeautifulSoup</li>\n",
" <li>Extracting Data and Building DataFrame</li>\n",
" </ul>\n",
"<p>\n",
" Estimated Time Needed: <strong>30 min</strong></p>\n",
"</div>\n",
"\n",
"<hr>\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting bs4\n",
" Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz\n",
"Collecting beautifulsoup4 (from bs4)\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)\n",
"\u001b[K |████████████████████████████████| 122kB 2.4MB/s eta 0:00:01\n",
"\u001b[?25hCollecting soupsieve>1.2; python_version >= \"3.0\" (from beautifulsoup4->bs4)\n",
" Downloading https://files.pythonhosted.org/packages/36/69/d82d04022f02733bf9a72bc3b96332d360c0c5307096d76f6bb7489f7e57/soupsieve-2.2.1-py3-none-any.whl\n",
"Building wheels for collected packages: bs4\n",
" Building wheel for bs4 (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25h Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472\n",
"Successfully built bs4\n",
"Installing collected packages: soupsieve, beautifulsoup4, bs4\n",
"Successfully installed beautifulsoup4-4.9.3 bs4-0.0.1 soupsieve-2.2.1\n"
]
}
],
"source": [
"#!pip install pandas\n",
"#!pip install requests\n",
"!pip install bs4\n",
"#!pip install plotly"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import requests\n",
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Webscraping to Extract Stock Data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the `requests` library to download the webpage [https://finance.yahoo.com/quote/AMZN/history?period1=1451606400&period2=1612137600&interval=1mo&filter=history&frequency=1mo&includeAdjustedClose=true](https://finance.yahoo.com/quote/AMZN/history?period1=1451606400&period2=1612137600&interval=1mo&filter=history&frequency=1mo&includeAdjustedClose=true&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork-23455606&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork-23455606&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork-23455606&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork-23455606&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ). Save the text of the response as a variable named `html_data`.\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://finance.yahoo.com/quote/AMZN/history?period1=1451606400&period2=1612137600&interval=1mo&filter=history&frequency=1mo&includeAdjustedClose=true\"\n",
"html_data = requests.get(url).text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parse the html data using `beautiful_soup`.\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"soup = BeautifulSoup(html_data,\"html5lib\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Question 1</b> what is the content of the title attribute:\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<title>Amazon.com, Inc. (AMZN) Stock Historical Prices &amp; Data - Yahoo Finance</title>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"soup.title"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using beautiful soup extract the table with historical share prices and store it into a dataframe named `amazon_data`. The dataframe should have columns Date, Open, High, Low, Close, Adj Close, and Volume. Fill in each variable with the correct data from the list `col`. \n",
"\n",
"Hint: Print the `col` list to see what data to use\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"amazon_data = pd.DataFrame(columns=[\"Date\", \"Open\", \"High\", \"Low\", \"Close\", \"Volume\"])\n",
"\n",
"for row in soup.find(\"tbody\").find_all(\"tr\"):\n",
" col = row.find_all(\"td\")\n",
" date =col[0].text\n",
" Open =col[1].text\n",
" high =col[2].text\n",
" low =col[3].text\n",
" close =col[4].text\n",
" adj_close =col[5].text\n",
" volume =col[6].text\n",
" \n",
" amazon_data = amazon_data.append({\"Date\":date, \"Open\":Open, \"High\":high, \"Low\":low, \"Close\":close, \"Adj Close\":adj_close, \"Volume\":volume}, ignore_index=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print out the first five rows of the `amazon_data` dataframe you created.\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Open</th>\n",
" <th>High</th>\n",
" <th>Low</th>\n",
" <th>Close</th>\n",
" <th>Volume</th>\n",
" <th>Adj Close</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Jan 01, 2021</td>\n",
" <td>3,270.00</td>\n",
" <td>3,363.89</td>\n",
" <td>3,086.00</td>\n",
" <td>3,206.20</td>\n",
" <td>71,529,900</td>\n",
" <td>3,206.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dec 01, 2020</td>\n",
" <td>3,188.50</td>\n",
" <td>3,350.65</td>\n",
" <td>3,072.82</td>\n",
" <td>3,256.93</td>\n",
" <td>77,567,800</td>\n",
" <td>3,256.93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Nov 01, 2020</td>\n",
" <td>3,061.74</td>\n",
" <td>3,366.80</td>\n",
" <td>2,950.12</td>\n",
" <td>3,168.04</td>\n",
" <td>90,810,500</td>\n",
" <td>3,168.04</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Oct 01, 2020</td>\n",
" <td>3,208.00</td>\n",
" <td>3,496.24</td>\n",
" <td>3,019.00</td>\n",
" <td>3,036.15</td>\n",
" <td>116,242,300</td>\n",
" <td>3,036.15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Sep 01, 2020</td>\n",
" <td>3,489.58</td>\n",
" <td>3,552.25</td>\n",
" <td>2,871.00</td>\n",
" <td>3,148.73</td>\n",
" <td>115,943,500</td>\n",
" <td>3,148.73</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Open High Low Close Volume Adj Close\n",
"0 Jan 01, 2021 3,270.00 3,363.89 3,086.00 3,206.20 71,529,900 3,206.20\n",
"1 Dec 01, 2020 3,188.50 3,350.65 3,072.82 3,256.93 77,567,800 3,256.93\n",
"2 Nov 01, 2020 3,061.74 3,366.80 2,950.12 3,168.04 90,810,500 3,168.04\n",
"3 Oct 01, 2020 3,208.00 3,496.24 3,019.00 3,036.15 116,242,300 3,036.15\n",
"4 Sep 01, 2020 3,489.58 3,552.25 2,871.00 3,148.73 115,943,500 3,148.73"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"amazon_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Question 2</b> What is the name of the columns of the dataframe \n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"amazon_data.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Question 3</b> What is the `Open` of `Jun 01, 2019` of the dataframe?\n"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Open</th>\n",
" <th>High</th>\n",
" <th>Low</th>\n",
" <th>Close</th>\n",
" <th>Volume</th>\n",
" <th>Adj Close</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>Jun 01, 2019</td>\n",
" <td>1,760.01</td>\n",
" <td>1,935.20</td>\n",
" <td>1,672.00</td>\n",
" <td>1,893.63</td>\n",
" <td>74,746,500</td>\n",
" <td>1,893.63</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Open High Low Close Volume Adj Close\n",
"19 Jun 01, 2019 1,760.01 1,935.20 1,672.00 1,893.63 74,746,500 1,893.63"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"amazon_data.loc[amazon_data[\"Date\"]==\"Jun 01, 2019\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>About the Authors:</h2> \n",
"\n",
"<a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\">Joseph Santarcangelo</a> has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n",
"\n",
"Azim Hirjani\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Change Log\n",
"\n",
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"| ----------------- | ------- | ------------- | ------------------------- |\n",
"| 2020-11-10 | 1.1 | Malika Singla | Deleted the Optional part |\n",
"| 2020-08-27 | 1.0 | Malika Singla | Added lab to GitLab |\n",
"\n",
"<hr>\n",
"\n",
"## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n",
"\n",
"<p>\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment