Skip to content

Instantly share code, notes, and snippets.

@kanungo
Last active December 13, 2017 02:42
Show Gist options
  • Save kanungo/b0f976514ce21ac4cbeb503603a02080 to your computer and use it in GitHub Desktop.
Save kanungo/b0f976514ce21ac4cbeb503603a02080 to your computer and use it in GitHub Desktop.
Scraping Sample for Final
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scraping tables\n",
"You are asked to scrape the table in https://www.ncdc.noaa.gov/sotc/global/201613\n",
"\n",
"Your first task is to right click and inspect the HTML code\n",
"\n",
"You find that the table is identified as \n",
"```\n",
"<table id =\"annualtemp\" class=\"records\" style=\"width:650px;\">\n",
"```\n",
"In order to find this table you need to provide as many unique identifiers in the find or findAll command, as shown below.\n",
"\n",
"After that the for loop essentially searchjes for all the rows in the table and within the rows searches for each cell in the row.\n",
"\n",
"Please note how we create a list of lists. This can be converted to a data frame or written to a csv file (not shown here; see class notes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Sample\n",
"import urllib.request\n",
"from bs4 import BeautifulSoup as bs\n",
"\n",
"url = \"https://www.ncdc.noaa.gov/sotc/global/201613\"\n",
"request = urllib.request.Request(url)\n",
"response = urllib.request.urlopen(request)\n",
"data = response.read()\n",
"response.close()\n",
"\n",
"# Create the soup\n",
"soup = bs(data, \"html.parser\")\n",
"\n",
"table = soup.findAll(\"table\", {\"id\" : \"annualtemp\"}, {\"class\" : \"records\"})\n",
"\n",
"print(len(table))\n",
"\n",
"for year in table:\n",
" rows = year.findAll('tr', {\"valign\" : \"middle\"})\n",
" r = []\n",
" for row in rows:\n",
" data = row.findAll('td')\n",
" c = []\n",
" for eachdata in data:\n",
" c.append(eachdata.getText())\n",
" print(c)\n",
" r.append(c)\n",
"print(r)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment