Skip to content

Instantly share code, notes, and snippets.

@carlward
Created May 12, 2016 23:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save carlward/4a87fc6f94f16d7b11753a8f65da9572 to your computer and use it in GitHub Desktop.
Save carlward/4a87fc6f94f16d7b11753a8f65da9572 to your computer and use it in GitHub Desktop.
Parsing HTML and XML Webcast
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parsing HTML and XML"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is a Parser?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A parser reads a series of instructions and breaks them into component parts, then structures them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# HTML"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"HTML is the language of the internet. It is used to define the structure and content of a webpage. For our purposes we are only interested in how to programmatically extract data from it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### HTML is made up of elements\n",
"Elements are defined by tags\n",
"- elements can contain content \n",
"```HTML\n",
"<tag>content</tag>\n",
"```\n",
"- elements can contain other elements\n",
"```HTML\n",
"<tag>\n",
" <sub_tag>sub tag 1</sub_tag>\n",
" <sub_tag>sub tag 2</sub_tag>\n",
"</tag>\n",
"```\n",
"- element tags can have attributes \n",
"```HTML\n",
"<tag id=\"tag_id\", style=\"visibility: hidden;\">\n",
"...\n",
"</tag>\n",
"```\n",
"- elements either have an opening and closing tag \n",
"```HTML\n",
"<tag>content</tag>\n",
"```\n",
" or are self closing\n",
"```HTML\n",
"<tag />\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example HTML Code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```HTML\n",
"<!DOCTYPE html>\n",
"<html lang=\"en\">\n",
" <head>\n",
" <meta charset=\"utf-8\"/>\n",
" <title>\n",
" Soup Title\n",
" </title>\n",
" <meta content=\"Soup\" name=\"description\"/>\n",
" <meta content=\"Soupy Soup\" name=\"author\"/>\n",
" <style type=\"text/css\">\n",
" .tg {border-collapse:collapse;border-spacing:0;border-color:#999;border-width:1px;border-style:solid;margin:0px auto;}\n",
" .tg td {font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}\n",
" .tg th {font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}\n",
" .tg .tg-vn4c {background-color:#D2E4FC}\n",
" </style>\n",
" </head>\n",
" <body>\n",
" <table class=\"tg\">\n",
" <tr>\n",
" <th class=\"tg-031e\">\n",
" Soup\n",
" </th>\n",
" <th class=\"tg-031e\">\n",
" Price\n",
" </th>\n",
" <th class=\"tg-031e\">\n",
" Weight\n",
" </th>\n",
" <th class=\"tg-031e\">\n",
" Rating\n",
" </th>\n",
" <th class=\"tg-031e\">\n",
" Reviews\n",
" </th>\n",
" </tr>\n",
" <tr>\n",
" <td class=\"tg-vn4c\">\n",
" Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" 1.7\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" .100\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" 5\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" 4\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td class=\"tg-031e\">\n",
" Heinz Classic Cream of Tomato Soup\n",
" </td>\n",
" <td class=\"tg-031e\">\n",
" .95\n",
" </td>\n",
" <td class=\"tg-031e\">\n",
" .400\n",
" </td>\n",
" <td class=\"tg-031e\">\n",
" 5\n",
" </td>\n",
" <td class=\"tg-031e\">\n",
" 14\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td class=\"tg-vn4c\">\n",
" Baxters Favourites Cream of Tomato Soup\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" 1.15\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" .400\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" 2\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" 1\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td class=\"tg-031e\">\n",
" Cross &amp; Blackwell Cream of Tomato Soup\n",
" </td>\n",
" <td class=\"tg-031e\">\n",
" 2.00\n",
" </td>\n",
" <td class=\"tg-031e\">\n",
" 4 x .400\n",
" </td>\n",
" <td class=\"tg-031e\">\n",
" 5\n",
" </td>\n",
" <td class=\"tg-031e\">\n",
" 2\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td class=\"tg-vn4c\">\n",
" Morrisons Cream of Tomato Soup\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" .45\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" .100\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" 4\n",
" </td>\n",
" <td class=\"tg-vn4c\">\n",
" 1\n",
" </td>\n",
" </tr>\n",
" </table>\n",
" </body>\n",
"</html>\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<html lang=\"en\">\n",
"<head>\n",
" <meta charset=\"utf-8\">\n",
"\n",
" <title>The HTML5 Herald</title>\n",
" <meta name=\"description\" content=\"Soup\">\n",
" <meta name=\"author\" content=\"Soupy Soup\">\n",
"\n",
" <style type=\"text/css\">\n",
" .tg {border-collapse:collapse;border-spacing:0;border-color:#999;border-width:1px;border-style:solid;margin:0px auto;}\n",
" .tg td {font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}\n",
" .tg th {font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}\n",
" .tg .tg-vn4c {background-color:#D2E4FC}\n",
" </style>\n",
"\n",
"</head>\n",
"\n",
"<body>\n",
"\n",
"<table class=\"tg\">\n",
" <tr>\n",
" <th class=\"tg-031e\">Soup</th>\n",
" <th class=\"tg-031e\">Price</th>\n",
" <th class=\"tg-031e\">Weight</th>\n",
" <th class=\"tg-031e\">Rating</th>\n",
" <th class=\"tg-031e\">Reviews</th>\n",
" </tr>\n",
" <tr>\n",
" <td class=\"tg-vn4c\">Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</td>\n",
" <td class=\"tg-vn4c\">1.7</td>\n",
" <td class=\"tg-vn4c\">.100</td>\n",
" <td class=\"tg-vn4c\">5</td>\n",
" <td class=\"tg-vn4c\">4</td>\n",
" </tr>\n",
" <tr>\n",
" <td class=\"tg-031e\">Heinz Classic Cream of Tomato Soup</td>\n",
" <td class=\"tg-031e\">.95</td>\n",
" <td class=\"tg-031e\">.400</td>\n",
" <td class=\"tg-031e\">5</td>\n",
" <td class=\"tg-031e\">14</td>\n",
" </tr>\n",
" <tr>\n",
" <td class=\"tg-vn4c\">Baxters Favourites Cream of Tomato Soup</td>\n",
" <td class=\"tg-vn4c\">1.15</td>\n",
" <td class=\"tg-vn4c\">.400</td>\n",
" <td class=\"tg-vn4c\">2</td>\n",
" <td class=\"tg-vn4c\">1</td>\n",
" </tr>\n",
" <tr>\n",
" <td class=\"tg-031e\">Cross &amp; Blackwell Cream of Tomato Soup</td>\n",
" <td class=\"tg-031e\">2.00</td>\n",
" <td class=\"tg-031e\">4 x .400</td>\n",
" <td class=\"tg-031e\">5</td>\n",
" <td class=\"tg-031e\">2</td>\n",
" </tr>\n",
" <tr>\n",
" <td class=\"tg-vn4c\">Morrisons Cream of Tomato Soup</td>\n",
" <td class=\"tg-vn4c\">.45</td>\n",
" <td class=\"tg-vn4c\">.100</td>\n",
" <td class=\"tg-vn4c\">4</td>\n",
" <td class=\"tg-vn4c\">1</td>\n",
" </tr>\n",
"</table>\n",
"\n",
"\n",
"</body>\n",
"</html>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parsing with Beautiful Soup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Beautiful soup is library of convenience functions that together with a parser will allow you manipulate HMTL or XML code. Lets parse this table of soup with BeautifulSoup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"\n",
"with open('soup.html', 'rb') as html_file:\n",
" soup = BeautifulSoup(html_file, 'lxml') # Specify that we want to use the lxml parser"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print soup.prettify()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print \"The head tag:\"\n",
"print soup.find('head').prettify()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Different ways to select elements\n",
"print soup.find('head').find('title')\n",
"print soup.find('title')\n",
"print soup.title.get_text()\n",
"title = soup.find('title')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"title.attrs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Select the table"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tables = soup.find_all('table')\n",
"for table in tables:\n",
" print \"Attributes:\", table.attrs\n",
" print \"Number of rows:\", len(table.find_all('tr')) # table row tags\n",
" print \"Number of cells:\", len(table.find_all('td')) # table cells "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"table = soup.find('table', attrs={'class': ['tg']})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Convert the table data into standard Python data structures"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"price.name"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"soup_list = []\n",
"rows = table.find_all('tr')\n",
"header_row = rows.pop(0)\n",
"\n",
"# Loop through each row and extract cells\n",
"for row in rows:\n",
" name, price, weight, rating, reviews = row.find_all('td')\n",
" soup_list.append({\n",
" 'name': name.text,\n",
" 'price': price.text,\n",
" 'weight': weight.text,\n",
" 'rating': rating.text,\n",
" 'reviews': reviews.text \n",
" })\n",
" \n",
"soup_list"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convert to a Pandas DataFrame"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df = pd.DataFrame(soup_list)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# XML"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"XML is the language of data on the internet (sort of)\n",
"\n",
"XML is just like HTML except there are no rules about tag types!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sample XML Code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```XML\n",
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<soupdata>\n",
" <soup>\n",
" <name>Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</name>\n",
" <price currency=\"GBP\">1.7</price>\n",
" <weight units='g'>.100</weight>\n",
" <rating>5</rating>\n",
" <reviews>4</reviews>\n",
" </soup>\n",
" <soup>\n",
" <name>Heinz Classic Cream of Tomato Soup</name>\n",
" <price currency=\"GBP\">.95</price>\n",
" <weight units='g'>.400</weight>\n",
" <rating>5</rating>\n",
" <reviews>14</reviews>\n",
" </soup>\n",
" <soup>\n",
" <name>Baxters Favourites Cream of Tomato Soup</name>\n",
" <price currency=\"GBP\">1.15</price>\n",
" <weight units='g'>.400</weight>\n",
" <rating>2</rating>\n",
" <reviews>1</reviews>\n",
" </soup>\n",
" <soup>\n",
" <name>Cross &amp; Blackwell Cream of Tomato Soup</name>\n",
" <price currency=\"GBP\">2.00</price>\n",
" <weight units='g'>4 x .400</weight>\n",
" <rating>5</rating>\n",
" <reviews>2</reviews>\n",
" </soup>\n",
" <soup>\n",
" <name>Morrisons Cream of Tomato Soup</name>\n",
" <price currency=\"GBP\">.45</price>\n",
" <weight units='g'>.100</weight>\n",
" <rating>4</rating>\n",
" <reviews>1</reviews>\n",
" </soup>\n",
"</soupdata>\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Typical XML Parsing"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import xml.etree.cElementTree as ET\n",
"\n",
"with open('soup.xml', 'r') as f_in:\n",
" tree = ET.parse(f_in)\n",
" root = tree.getroot()\n",
" print \"Tag: %s\" % root.tag\n",
" print \"Attributes: %s\" % root.attrib"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for tag in root.findall('soup'):\n",
" print tag.tag\n",
" for child_tag in tag.iter():\n",
" print \"tag: {0}\\t attributes: {1}\".format(child_tag.tag, child_tag.attrib)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"with open('soup.xml', 'r') as f_in:\n",
" tree = ET.parse(f_in)\n",
" root = tree.getroot()\n",
" soups = []\n",
" for tag in root.findall('soup'):\n",
" soups.append({child_tag.tag: child_tag.text for child_tag in tag})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pd.DataFrame(soups)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parsing iteratively"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes an XML document is too big parse at once. In these situations we need to parse iteratively."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import xml.etree.cElementTree as ET\n",
"\n",
"with open('soup.xml', 'r') as f_in:\n",
" for event, elem in ET.iterparse(f_in, events = (\"start\", \"end\")):\n",
" print \"event type: {}\\t element: {}\".format(event, elem)\n",
" \n",
" # Print some details about elements with tag 'soup'\n",
" if elem.tag == 'soup' and event == \"end\":\n",
" for field in ['name', 'price', 'weight', 'rating', 'reviews']:\n",
" value = elem.find(field).text\n",
" attribs = elem.find(field).attrib\n",
" print \"\\t{}: {} attribs: {}\".format(field, value, attribs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"with open('soup.xml', 'r') as f_in:\n",
" soups = []\n",
" for event, elem in ET.iterparse(f_in, events = (\"start\", \"end\")):\n",
" if elem.tag == 'soup' and event == \"end\":\n",
" soups.append({child.tag: child.text for child in elem})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pd.DataFrame(soups)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Soup Title</title>
<meta name="description" content="Soup">
<meta name="author" content="Soupy Soup">
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;border-color:#999;border-width:1px;border-style:solid;margin:0px auto;}
.tg td {font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}
.tg th {font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}
.tg .tg-vn4c {background-color:#D2E4FC}
</style>
</head>
<body>
<table class="tg">
<tr>
<th class="tg-031e">Soup</th>
<th class="tg-031e">Price</th>
<th class="tg-031e">Weight</th>
<th class="tg-031e">Rating</th>
<th class="tg-031e">Reviews</th>
</tr>
<tr>
<td class="tg-vn4c">Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</td>
<td class="tg-vn4c">1.7</td>
<td class="tg-vn4c">.100</td>
<td class="tg-vn4c">5</td>
<td class="tg-vn4c">4</td>
</tr>
<tr>
<td class="tg-031e">Heinz Classic Cream of Tomato Soup</td>
<td class="tg-031e">.95</td>
<td class="tg-031e">.400</td>
<td class="tg-031e">5</td>
<td class="tg-031e">14</td>
</tr>
<tr>
<td class="tg-vn4c">Baxters Favourites Cream of Tomato Soup</td>
<td class="tg-vn4c">1.15</td>
<td class="tg-vn4c">.400</td>
<td class="tg-vn4c">2</td>
<td class="tg-vn4c">1</td>
</tr>
<tr>
<td class="tg-031e">Cross &amp; Blackwell Cream of Tomato Soup</td>
<td class="tg-031e">2.00</td>
<td class="tg-031e">4 x .400</td>
<td class="tg-031e">5</td>
<td class="tg-031e">2</td>
</tr>
<tr>
<td class="tg-vn4c">Morrisons Cream of Tomato Soup</td>
<td class="tg-vn4c">.45</td>
<td class="tg-vn4c">.100</td>
<td class="tg-vn4c">4</td>
<td class="tg-vn4c">1</td>
</tr>
</table>
</body>
</html>
<?xml version="1.0" encoding="UTF-8"?>
<soupdata>
<soup>
<name>Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</name>
<price currency="GBP">1.7</price>
<weight units='g'>.100</weight>
<rating>5</rating>
<reviews>4</reviews>
</soup>
<soup>
<name>Heinz Classic Cream of Tomato Soup</name>
<price currency="GBP">.95</price>
<weight units='g'>.400</weight>
<rating>5</rating>
<reviews>14</reviews>
</soup>
<soup>
<name>Baxters Favourites Cream of Tomato Soup</name>
<price currency="GBP">1.15</price>
<weight units='g'>.400</weight>
<rating>2</rating>
<reviews>1</reviews>
</soup>
<soup>
<name>Cross &amp; Blackwell Cream of Tomato Soup</name>
<price currency="GBP">2.00</price>
<weight units='g'>4 x .400</weight>
<rating>5</rating>
<reviews>2</reviews>
</soup>
<soup>
<name>Morrisons Cream of Tomato Soup</name>
<price currency="GBP">.45</price>
<weight units='g'>.100</weight>
<rating>4</rating>
<reviews>1</reviews>
</soup>
</soupdata>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment