carlward/Parsing HTML and XML.ipynb

## Parsing HTML and XML.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parsing HTML and XML"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is a Parser?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A parser reads a series of instructions and breaks them into component parts, then structures them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HTML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "HTML is the language of the internet.  It is used to define the structure and content of a webpage.  For our purposes we are only interested in how to programmatically extract data from it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### HTML is made up of elements\n",
    "Elements are defined by tags\n",
    "- elements can contain content \n",
    "```HTML\n",
    "<tag>content</tag>\n",
    "```\n",
    "- elements can contain other elements\n",
    "```HTML\n",
    "<tag>\n",
    "  <sub_tag>sub tag 1</sub_tag>\n",
    "  <sub_tag>sub tag 2</sub_tag>\n",
    "</tag>\n",
    "```\n",
    "- element tags can have attributes \n",
    "```HTML\n",
    "<tag id=\"tag_id\", style=\"visibility: hidden;\">\n",
    "...\n",
    "</tag>\n",
    "```\n",
    "- elements either have an opening and closing tag \n",
    "```HTML\n",
    "<tag>content</tag>\n",
    "```\n",
    "    or are self closing\n",
    "```HTML\n",
    "<tag />\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example HTML Code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```HTML\n",
    "<!DOCTYPE html>\n",
    "<html lang=\"en\">\n",
    " <head>\n",
    "  <meta charset=\"utf-8\"/>\n",
    "  <title>\n",
    "   Soup Title\n",
    "  </title>\n",
    "  <meta content=\"Soup\" name=\"description\"/>\n",
    "  <meta content=\"Soupy Soup\" name=\"author\"/>\n",
    "  <style type=\"text/css\">\n",
    "   .tg  {border-collapse:collapse;border-spacing:0;border-color:#999;border-width:1px;border-style:solid;margin:0px auto;}\n",
    "  .tg td {font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}\n",
    "  .tg th {font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}\n",
    "  .tg .tg-vn4c {background-color:#D2E4FC}\n",
    "  </style>\n",
    " </head>\n",
    " <body>\n",
    "  <table class=\"tg\">\n",
    "   <tr>\n",
    "    <th class=\"tg-031e\">\n",
    "     Soup\n",
    "    </th>\n",
    "    <th class=\"tg-031e\">\n",
    "     Price\n",
    "    </th>\n",
    "    <th class=\"tg-031e\">\n",
    "     Weight\n",
    "    </th>\n",
    "    <th class=\"tg-031e\">\n",
    "     Rating\n",
    "    </th>\n",
    "    <th class=\"tg-031e\">\n",
    "     Reviews\n",
    "    </th>\n",
    "   </tr>\n",
    "   <tr>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     1.7\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     .100\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     5\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     4\n",
    "    </td>\n",
    "   </tr>\n",
    "   <tr>\n",
    "    <td class=\"tg-031e\">\n",
    "     Heinz Classic Cream of Tomato Soup\n",
    "    </td>\n",
    "    <td class=\"tg-031e\">\n",
    "     .95\n",
    "    </td>\n",
    "    <td class=\"tg-031e\">\n",
    "     .400\n",
    "    </td>\n",
    "    <td class=\"tg-031e\">\n",
    "     5\n",
    "    </td>\n",
    "    <td class=\"tg-031e\">\n",
    "     14\n",
    "    </td>\n",
    "   </tr>\n",
    "   <tr>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     Baxters Favourites Cream of Tomato Soup\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     1.15\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     .400\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     2\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     1\n",
    "    </td>\n",
    "   </tr>\n",
    "   <tr>\n",
    "    <td class=\"tg-031e\">\n",
    "     Cross &amp; Blackwell Cream of Tomato Soup\n",
    "    </td>\n",
    "    <td class=\"tg-031e\">\n",
    "     2.00\n",
    "    </td>\n",
    "    <td class=\"tg-031e\">\n",
    "     4 x .400\n",
    "    </td>\n",
    "    <td class=\"tg-031e\">\n",
    "     5\n",
    "    </td>\n",
    "    <td class=\"tg-031e\">\n",
    "     2\n",
    "    </td>\n",
    "   </tr>\n",
    "   <tr>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     Morrisons Cream of Tomato Soup\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     .45\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     .100\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     4\n",
    "    </td>\n",
    "    <td class=\"tg-vn4c\">\n",
    "     1\n",
    "    </td>\n",
    "   </tr>\n",
    "  </table>\n",
    " </body>\n",
    "</html>\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<html lang=\"en\">\n",
    "<head>\n",
    "  <meta charset=\"utf-8\">\n",
    "\n",
    "  <title>The HTML5 Herald</title>\n",
    "  <meta name=\"description\" content=\"Soup\">\n",
    "  <meta name=\"author\" content=\"Soupy Soup\">\n",
    "\n",
    "  <style type=\"text/css\">\n",
    "  .tg  {border-collapse:collapse;border-spacing:0;border-color:#999;border-width:1px;border-style:solid;margin:0px auto;}\n",
    "  .tg td {font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}\n",
    "  .tg th {font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}\n",
    "  .tg .tg-vn4c {background-color:#D2E4FC}\n",
    "  </style>\n",
    "\n",
    "</head>\n",
    "\n",
    "<body>\n",
    "\n",
    "<table class=\"tg\">\n",
    "  <tr>\n",
    "    <th class=\"tg-031e\">Soup</th>\n",
    "    <th class=\"tg-031e\">Price</th>\n",
    "    <th class=\"tg-031e\">Weight</th>\n",
    "    <th class=\"tg-031e\">Rating</th>\n",
    "    <th class=\"tg-031e\">Reviews</th>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td class=\"tg-vn4c\">Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</td>\n",
    "    <td class=\"tg-vn4c\">1.7</td>\n",
    "    <td class=\"tg-vn4c\">.100</td>\n",
    "    <td class=\"tg-vn4c\">5</td>\n",
    "    <td class=\"tg-vn4c\">4</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td class=\"tg-031e\">Heinz Classic Cream of Tomato Soup</td>\n",
    "    <td class=\"tg-031e\">.95</td>\n",
    "    <td class=\"tg-031e\">.400</td>\n",
    "    <td class=\"tg-031e\">5</td>\n",
    "    <td class=\"tg-031e\">14</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td class=\"tg-vn4c\">Baxters Favourites Cream of Tomato Soup</td>\n",
    "    <td class=\"tg-vn4c\">1.15</td>\n",
    "    <td class=\"tg-vn4c\">.400</td>\n",
    "    <td class=\"tg-vn4c\">2</td>\n",
    "    <td class=\"tg-vn4c\">1</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td class=\"tg-031e\">Cross &amp; Blackwell Cream of Tomato Soup</td>\n",
    "    <td class=\"tg-031e\">2.00</td>\n",
    "    <td class=\"tg-031e\">4 x .400</td>\n",
    "    <td class=\"tg-031e\">5</td>\n",
    "    <td class=\"tg-031e\">2</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td class=\"tg-vn4c\">Morrisons Cream of Tomato Soup</td>\n",
    "    <td class=\"tg-vn4c\">.45</td>\n",
    "    <td class=\"tg-vn4c\">.100</td>\n",
    "    <td class=\"tg-vn4c\">4</td>\n",
    "    <td class=\"tg-vn4c\">1</td>\n",
    "  </tr>\n",
    "</table>\n",
    "\n",
    "\n",
    "</body>\n",
    "</html>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parsing with Beautiful Soup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Beautiful soup is library of convenience functions that together with a parser will allow you manipulate HMTL or XML code. Lets parse this table of soup with BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "with open('soup.html', 'rb') as html_file:\n",
    "    soup = BeautifulSoup(html_file, 'lxml')  # Specify that we want to use the lxml parser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print soup.prettify()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print \"The head tag:\"\n",
    "print soup.find('head').prettify()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Different ways to select elements\n",
    "print soup.find('head').find('title')\n",
    "print soup.find('title')\n",
    "print soup.title.get_text()\n",
    "title = soup.find('title')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "title.attrs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select the table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tables = soup.find_all('table')\n",
    "for table in tables:\n",
    "    print \"Attributes:\", table.attrs\n",
    "    print \"Number of rows:\", len(table.find_all('tr')) # table row tags\n",
    "    print \"Number of cells:\", len(table.find_all('td')) # table cells    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "table = soup.find('table', attrs={'class': ['tg']})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert the table data into standard Python data structures"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "price.name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "soup_list = []\n",
    "rows = table.find_all('tr')\n",
    "header_row = rows.pop(0)\n",
    "\n",
    "# Loop through each row and extract cells\n",
    "for row in rows:\n",
    "    name, price, weight, rating, reviews = row.find_all('td')\n",
    "    soup_list.append({\n",
    "        'name': name.text,\n",
    "        'price': price.text,\n",
    "        'weight': weight.text,\n",
    "        'rating': rating.text,\n",
    "        'reviews': reviews.text \n",
    "    })\n",
    "        \n",
    "soup_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert to a Pandas DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame(soup_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# XML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "XML is the language of data on the internet (sort of)\n",
    "\n",
    "XML is just like HTML except there are no rules about tag types!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sample XML Code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```XML\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<soupdata>\n",
    "  <soup>\n",
    "    <name>Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</name>\n",
    "    <price currency=\"GBP\">1.7</price>\n",
    "    <weight units='g'>.100</weight>\n",
    "    <rating>5</rating>\n",
    "    <reviews>4</reviews>\n",
    "  </soup>\n",
    "  <soup>\n",
    "    <name>Heinz Classic Cream of Tomato Soup</name>\n",
    "    <price currency=\"GBP\">.95</price>\n",
    "    <weight units='g'>.400</weight>\n",
    "    <rating>5</rating>\n",
    "    <reviews>14</reviews>\n",
    "  </soup>\n",
    "  <soup>\n",
    "    <name>Baxters Favourites Cream of Tomato Soup</name>\n",
    "    <price currency=\"GBP\">1.15</price>\n",
    "    <weight units='g'>.400</weight>\n",
    "    <rating>2</rating>\n",
    "    <reviews>1</reviews>\n",
    "  </soup>\n",
    "  <soup>\n",
    "    <name>Cross &amp; Blackwell Cream of Tomato Soup</name>\n",
    "    <price currency=\"GBP\">2.00</price>\n",
    "    <weight units='g'>4 x .400</weight>\n",
    "    <rating>5</rating>\n",
    "    <reviews>2</reviews>\n",
    "  </soup>\n",
    "  <soup>\n",
    "    <name>Morrisons Cream of Tomato Soup</name>\n",
    "    <price currency=\"GBP\">.45</price>\n",
    "    <weight units='g'>.100</weight>\n",
    "    <rating>4</rating>\n",
    "    <reviews>1</reviews>\n",
    "  </soup>\n",
    "</soupdata>\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Typical XML Parsing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import xml.etree.cElementTree as ET\n",
    "\n",
    "with open('soup.xml', 'r') as f_in:\n",
    "    tree = ET.parse(f_in)\n",
    "    root = tree.getroot()\n",
    "    print \"Tag: %s\" % root.tag\n",
    "    print \"Attributes: %s\" % root.attrib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for tag in root.findall('soup'):\n",
    "    print tag.tag\n",
    "    for child_tag in tag.iter():\n",
    "        print \"tag: {0}\\t attributes: {1}\".format(child_tag.tag, child_tag.attrib)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with open('soup.xml', 'r') as f_in:\n",
    "    tree = ET.parse(f_in)\n",
    "    root = tree.getroot()\n",
    "    soups = []\n",
    "    for tag in root.findall('soup'):\n",
    "        soups.append({child_tag.tag: child_tag.text for child_tag in tag})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pd.DataFrame(soups)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parsing iteratively"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes an XML document is too big parse at once.  In these situations we need to parse iteratively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import xml.etree.cElementTree as ET\n",
    "\n",
    "with open('soup.xml', 'r') as f_in:\n",
    "    for event, elem in ET.iterparse(f_in, events = (\"start\", \"end\")):\n",
    "        print \"event type: {}\\t element: {}\".format(event, elem)\n",
    "        \n",
    "        # Print some details about elements with tag 'soup'\n",
    "        if elem.tag == 'soup' and event == \"end\":\n",
    "            for field in ['name', 'price', 'weight', 'rating', 'reviews']:\n",
    "                value = elem.find(field).text\n",
    "                attribs = elem.find(field).attrib\n",
    "                print \"\\t{}: {} attribs: {}\".format(field, value, attribs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "with open('soup.xml', 'r') as f_in:\n",
    "    soups = []\n",
    "    for event, elem in ET.iterparse(f_in, events = (\"start\", \"end\")):\n",
    "        if elem.tag == 'soup' and event == \"end\":\n",
    "            soups.append({child.tag: child.text for child in elem})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pd.DataFrame(soups)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}

## soup.html
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">

  <title>Soup Title</title>
  <meta name="description" content="Soup">
  <meta name="author" content="Soupy Soup">

  <style type="text/css">
  .tg  {border-collapse:collapse;border-spacing:0;border-color:#999;border-width:1px;border-style:solid;margin:0px auto;}
  .tg td {font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}
  .tg th {font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}
  .tg .tg-vn4c {background-color:#D2E4FC}
  </style>

</head>

<body>

<table class="tg">
  <tr>
    <th class="tg-031e">Soup</th>
    <th class="tg-031e">Price</th>
    <th class="tg-031e">Weight</th>
    <th class="tg-031e">Rating</th>
    <th class="tg-031e">Reviews</th>
  </tr>
  <tr>
    <td class="tg-vn4c">Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</td>
    <td class="tg-vn4c">1.7</td>
    <td class="tg-vn4c">.100</td>
    <td class="tg-vn4c">5</td>
    <td class="tg-vn4c">4</td>
  </tr>
  <tr>
    <td class="tg-031e">Heinz Classic Cream of Tomato Soup</td>
    <td class="tg-031e">.95</td>
    <td class="tg-031e">.400</td>
    <td class="tg-031e">5</td>
    <td class="tg-031e">14</td>
  </tr>
  <tr>
    <td class="tg-vn4c">Baxters Favourites Cream of Tomato Soup</td>
    <td class="tg-vn4c">1.15</td>
    <td class="tg-vn4c">.400</td>
    <td class="tg-vn4c">2</td>
    <td class="tg-vn4c">1</td>
  </tr>
  <tr>
    <td class="tg-031e">Cross &amp; Blackwell Cream of Tomato Soup</td>
    <td class="tg-031e">2.00</td>
    <td class="tg-031e">4 x .400</td>
    <td class="tg-031e">5</td>
    <td class="tg-031e">2</td>
  </tr>
  <tr>
    <td class="tg-vn4c">Morrisons Cream of Tomato Soup</td>
    <td class="tg-vn4c">.45</td>
    <td class="tg-vn4c">.100</td>
    <td class="tg-vn4c">4</td>
    <td class="tg-vn4c">1</td>
  </tr>
</table>


</body>
</html>

## soup.xml
<?xml version="1.0" encoding="UTF-8"?>
<soupdata>
<soup>
  <name>Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</name>
  <price currency="GBP">1.7</price>
  <weight units='g'>.100</weight>
  <rating>5</rating>
  <reviews>4</reviews>
</soup>
<soup>
  <name>Heinz Classic Cream of Tomato Soup</name>
  <price currency="GBP">.95</price>
  <weight units='g'>.400</weight>
  <rating>5</rating>
  <reviews>14</reviews>
</soup>
<soup>
  <name>Baxters Favourites Cream of Tomato Soup</name>
  <price currency="GBP">1.15</price>
  <weight units='g'>.400</weight>
  <rating>2</rating>
  <reviews>1</reviews>
</soup>
<soup>
  <name>Cross &amp; Blackwell Cream of Tomato Soup</name>
  <price currency="GBP">2.00</price>
  <weight units='g'>4 x .400</weight>
  <rating>5</rating>
  <reviews>2</reviews>
</soup>
<soup>
  <name>Morrisons Cream of Tomato Soup</name>
  <price currency="GBP">.45</price>
  <weight units='g'>.100</weight>
  <rating>4</rating>
  <reviews>1</reviews>
</soup>
</soupdata>
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Parsing HTML and XML"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import pandas as pd"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## What is a Parser?"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A parser reads a series of instructions and breaks them into component parts, then structures them."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# HTML"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"HTML is the language of the internet. It is used to define the structure and content of a webpage. For our purposes we are only interested in how to programmatically extract data from it."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### HTML is made up of elements\n",
	"Elements are defined by tags\n",
	"- elements can contain content \n",
	"```HTML\n",
	"<tag>content</tag>\n",
	"```\n",
	"- elements can contain other elements\n",
	"```HTML\n",
	"<tag>\n",
	" <sub_tag>sub tag 1</sub_tag>\n",
	" <sub_tag>sub tag 2</sub_tag>\n",
	"</tag>\n",
	"```\n",
	"- element tags can have attributes \n",
	"```HTML\n",
	"<tag id=\"tag_id\", style=\"visibility: hidden;\">\n",
	"...\n",
	"</tag>\n",
	"```\n",
	"- elements either have an opening and closing tag \n",
	"```HTML\n",
	"<tag>content</tag>\n",
	"```\n",
	" or are self closing\n",
	"```HTML\n",
	"<tag />\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Example HTML Code"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"```HTML\n",
	"<!DOCTYPE html>\n",
	"<html lang=\"en\">\n",
	" <head>\n",
	" <meta charset=\"utf-8\"/>\n",
	" <title>\n",
	" Soup Title\n",
	" </title>\n",
	" <meta content=\"Soup\" name=\"description\"/>\n",
	" <meta content=\"Soupy Soup\" name=\"author\"/>\n",
	" <style type=\"text/css\">\n",
	" .tg {border-collapse:collapse;border-spacing:0;border-color:#999;border-width:1px;border-style:solid;margin:0px auto;}\n",
	" .tg td {font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}\n",
	" .tg th {font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}\n",
	" .tg .tg-vn4c {background-color:#D2E4FC}\n",
	" </style>\n",
	" </head>\n",
	" <body>\n",
	" <table class=\"tg\">\n",
	" <tr>\n",
	" <th class=\"tg-031e\">\n",
	" Soup\n",
	" </th>\n",
	" <th class=\"tg-031e\">\n",
	" Price\n",
	" </th>\n",
	" <th class=\"tg-031e\">\n",
	" Weight\n",
	" </th>\n",
	" <th class=\"tg-031e\">\n",
	" Rating\n",
	" </th>\n",
	" <th class=\"tg-031e\">\n",
	" Reviews\n",
	" </th>\n",
	" </tr>\n",
	" <tr>\n",
	" <td class=\"tg-vn4c\">\n",
	" Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" 1.7\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" .100\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" 5\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" 4\n",
	" </td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td class=\"tg-031e\">\n",
	" Heinz Classic Cream of Tomato Soup\n",
	" </td>\n",
	" <td class=\"tg-031e\">\n",
	" .95\n",
	" </td>\n",
	" <td class=\"tg-031e\">\n",
	" .400\n",
	" </td>\n",
	" <td class=\"tg-031e\">\n",
	" 5\n",
	" </td>\n",
	" <td class=\"tg-031e\">\n",
	" 14\n",
	" </td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td class=\"tg-vn4c\">\n",
	" Baxters Favourites Cream of Tomato Soup\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" 1.15\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" .400\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" 2\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" 1\n",
	" </td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td class=\"tg-031e\">\n",
	" Cross & Blackwell Cream of Tomato Soup\n",
	" </td>\n",
	" <td class=\"tg-031e\">\n",
	" 2.00\n",
	" </td>\n",
	" <td class=\"tg-031e\">\n",
	" 4 x .400\n",
	" </td>\n",
	" <td class=\"tg-031e\">\n",
	" 5\n",
	" </td>\n",
	" <td class=\"tg-031e\">\n",
	" 2\n",
	" </td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td class=\"tg-vn4c\">\n",
	" Morrisons Cream of Tomato Soup\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" .45\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" .100\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" 4\n",
	" </td>\n",
	" <td class=\"tg-vn4c\">\n",
	" 1\n",
	" </td>\n",
	" </tr>\n",
	" </table>\n",
	" </body>\n",
	"</html>\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<html lang=\"en\">\n",
	"<head>\n",
	" <meta charset=\"utf-8\">\n",
	"\n",
	" <title>The HTML5 Herald</title>\n",
	" <meta name=\"description\" content=\"Soup\">\n",
	" <meta name=\"author\" content=\"Soupy Soup\">\n",
	"\n",
	" <style type=\"text/css\">\n",
	" .tg {border-collapse:collapse;border-spacing:0;border-color:#999;border-width:1px;border-style:solid;margin:0px auto;}\n",
	" .tg td {font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}\n",
	" .tg th {font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}\n",
	" .tg .tg-vn4c {background-color:#D2E4FC}\n",
	" </style>\n",
	"\n",
	"</head>\n",
	"\n",
	"<body>\n",
	"\n",
	"<table class=\"tg\">\n",
	" <tr>\n",
	" <th class=\"tg-031e\">Soup</th>\n",
	" <th class=\"tg-031e\">Price</th>\n",
	" <th class=\"tg-031e\">Weight</th>\n",
	" <th class=\"tg-031e\">Rating</th>\n",
	" <th class=\"tg-031e\">Reviews</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <td class=\"tg-vn4c\">Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</td>\n",
	" <td class=\"tg-vn4c\">1.7</td>\n",
	" <td class=\"tg-vn4c\">.100</td>\n",
	" <td class=\"tg-vn4c\">5</td>\n",
	" <td class=\"tg-vn4c\">4</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td class=\"tg-031e\">Heinz Classic Cream of Tomato Soup</td>\n",
	" <td class=\"tg-031e\">.95</td>\n",
	" <td class=\"tg-031e\">.400</td>\n",
	" <td class=\"tg-031e\">5</td>\n",
	" <td class=\"tg-031e\">14</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td class=\"tg-vn4c\">Baxters Favourites Cream of Tomato Soup</td>\n",
	" <td class=\"tg-vn4c\">1.15</td>\n",
	" <td class=\"tg-vn4c\">.400</td>\n",
	" <td class=\"tg-vn4c\">2</td>\n",
	" <td class=\"tg-vn4c\">1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td class=\"tg-031e\">Cross & Blackwell Cream of Tomato Soup</td>\n",
	" <td class=\"tg-031e\">2.00</td>\n",
	" <td class=\"tg-031e\">4 x .400</td>\n",
	" <td class=\"tg-031e\">5</td>\n",
	" <td class=\"tg-031e\">2</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td class=\"tg-vn4c\">Morrisons Cream of Tomato Soup</td>\n",
	" <td class=\"tg-vn4c\">.45</td>\n",
	" <td class=\"tg-vn4c\">.100</td>\n",
	" <td class=\"tg-vn4c\">4</td>\n",
	" <td class=\"tg-vn4c\">1</td>\n",
	" </tr>\n",
	"</table>\n",
	"\n",
	"\n",
	"</body>\n",
	"</html>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Parsing with Beautiful Soup"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Beautiful soup is library of convenience functions that together with a parser will allow you manipulate HMTL or XML code. Lets parse this table of soup with BeautifulSoup"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"from bs4 import BeautifulSoup\n",
	"\n",
	"with open('soup.html', 'rb') as html_file:\n",
	" soup = BeautifulSoup(html_file, 'lxml') # Specify that we want to use the lxml parser"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"print soup.prettify()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"print \"The head tag:\"\n",
	"print soup.find('head').prettify()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Different ways to select elements\n",
	"print soup.find('head').find('title')\n",
	"print soup.find('title')\n",
	"print soup.title.get_text()\n",
	"title = soup.find('title')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"title.attrs"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Select the table"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"tables = soup.find_all('table')\n",
	"for table in tables:\n",
	" print \"Attributes:\", table.attrs\n",
	" print \"Number of rows:\", len(table.find_all('tr')) # table row tags\n",
	" print \"Number of cells:\", len(table.find_all('td')) # table cells "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"table = soup.find('table', attrs={'class': ['tg']})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"table"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Convert the table data into standard Python data structures"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"price.name"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"soup_list = []\n",
	"rows = table.find_all('tr')\n",
	"header_row = rows.pop(0)\n",
	"\n",
	"# Loop through each row and extract cells\n",
	"for row in rows:\n",
	" name, price, weight, rating, reviews = row.find_all('td')\n",
	" soup_list.append({\n",
	" 'name': name.text,\n",
	" 'price': price.text,\n",
	" 'weight': weight.text,\n",
	" 'rating': rating.text,\n",
	" 'reviews': reviews.text \n",
	" })\n",
	" \n",
	"soup_list"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Convert to a Pandas DataFrame"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"df = pd.DataFrame(soup_list)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"df"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# XML"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"XML is the language of data on the internet (sort of)\n",
	"\n",
	"XML is just like HTML except there are no rules about tag types!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Sample XML Code"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"```XML\n",
	"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
	"<soupdata>\n",
	" <soup>\n",
	" <name>Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</name>\n",
	" <price currency=\"GBP\">1.7</price>\n",
	" <weight units='g'>.100</weight>\n",
	" <rating>5</rating>\n",
	" <reviews>4</reviews>\n",
	" </soup>\n",
	" <soup>\n",
	" <name>Heinz Classic Cream of Tomato Soup</name>\n",
	" <price currency=\"GBP\">.95</price>\n",
	" <weight units='g'>.400</weight>\n",
	" <rating>5</rating>\n",
	" <reviews>14</reviews>\n",
	" </soup>\n",
	" <soup>\n",
	" <name>Baxters Favourites Cream of Tomato Soup</name>\n",
	" <price currency=\"GBP\">1.15</price>\n",
	" <weight units='g'>.400</weight>\n",
	" <rating>2</rating>\n",
	" <reviews>1</reviews>\n",
	" </soup>\n",
	" <soup>\n",
	" <name>Cross & Blackwell Cream of Tomato Soup</name>\n",
	" <price currency=\"GBP\">2.00</price>\n",
	" <weight units='g'>4 x .400</weight>\n",
	" <rating>5</rating>\n",
	" <reviews>2</reviews>\n",
	" </soup>\n",
	" <soup>\n",
	" <name>Morrisons Cream of Tomato Soup</name>\n",
	" <price currency=\"GBP\">.45</price>\n",
	" <weight units='g'>.100</weight>\n",
	" <rating>4</rating>\n",
	" <reviews>1</reviews>\n",
	" </soup>\n",
	"</soupdata>\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Typical XML Parsing"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import xml.etree.cElementTree as ET\n",
	"\n",
	"with open('soup.xml', 'r') as f_in:\n",
	" tree = ET.parse(f_in)\n",
	" root = tree.getroot()\n",
	" print \"Tag: %s\" % root.tag\n",
	" print \"Attributes: %s\" % root.attrib"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"for tag in root.findall('soup'):\n",
	" print tag.tag\n",
	" for child_tag in tag.iter():\n",
	" print \"tag: {0}\\t attributes: {1}\".format(child_tag.tag, child_tag.attrib)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"with open('soup.xml', 'r') as f_in:\n",
	" tree = ET.parse(f_in)\n",
	" root = tree.getroot()\n",
	" soups = []\n",
	" for tag in root.findall('soup'):\n",
	" soups.append({child_tag.tag: child_tag.text for child_tag in tag})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"pd.DataFrame(soups)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Parsing iteratively"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Sometimes an XML document is too big parse at once. In these situations we need to parse iteratively."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import xml.etree.cElementTree as ET\n",
	"\n",
	"with open('soup.xml', 'r') as f_in:\n",
	" for event, elem in ET.iterparse(f_in, events = (\"start\", \"end\")):\n",
	" print \"event type: {}\\t element: {}\".format(event, elem)\n",
	" \n",
	" # Print some details about elements with tag 'soup'\n",
	" if elem.tag == 'soup' and event == \"end\":\n",
	" for field in ['name', 'price', 'weight', 'rating', 'reviews']:\n",
	" value = elem.find(field).text\n",
	" attribs = elem.find(field).attrib\n",
	" print \"\\t{}: {} attribs: {}\".format(field, value, attribs)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"with open('soup.xml', 'r') as f_in:\n",
	" soups = []\n",
	" for event, elem in ET.iterparse(f_in, events = (\"start\", \"end\")):\n",
	" if elem.tag == 'soup' and event == \"end\":\n",
	" soups.append({child.tag: child.text for child in elem})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"pd.DataFrame(soups)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.11"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}
	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="utf-8">

	<title>Soup Title</title>
	<meta name="description" content="Soup">
	<meta name="author" content="Soupy Soup">

	<style type="text/css">
	.tg {border-collapse:collapse;border-spacing:0;border-color:#999;border-width:1px;border-style:solid;margin:0px auto;}
	.tg td {font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#444;background-color:#F7FDFA;}
	.tg th {font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#999;color:#fff;background-color:#26ADE4;}
	.tg .tg-vn4c {background-color:#D2E4FC}
	</style>

	</head>

	<body>

	<table class="tg">
	<tr>
	<th class="tg-031e">Soup</th>
	<th class="tg-031e">Price</th>
	<th class="tg-031e">Weight</th>
	<th class="tg-031e">Rating</th>
	<th class="tg-031e">Reviews</th>
	</tr>
	<tr>
	<td class="tg-vn4c">Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</td>
	<td class="tg-vn4c">1.7</td>
	<td class="tg-vn4c">.100</td>
	<td class="tg-vn4c">5</td>
	<td class="tg-vn4c">4</td>
	</tr>
	<tr>
	<td class="tg-031e">Heinz Classic Cream of Tomato Soup</td>
	<td class="tg-031e">.95</td>
	<td class="tg-031e">.400</td>
	<td class="tg-031e">5</td>
	<td class="tg-031e">14</td>
	</tr>
	<tr>
	<td class="tg-vn4c">Baxters Favourites Cream of Tomato Soup</td>
	<td class="tg-vn4c">1.15</td>
	<td class="tg-vn4c">.400</td>
	<td class="tg-vn4c">2</td>
	<td class="tg-vn4c">1</td>
	</tr>
	<tr>
	<td class="tg-031e">Cross & Blackwell Cream of Tomato Soup</td>
	<td class="tg-031e">2.00</td>
	<td class="tg-031e">4 x .400</td>
	<td class="tg-031e">5</td>
	<td class="tg-031e">2</td>
	</tr>
	<tr>
	<td class="tg-vn4c">Morrisons Cream of Tomato Soup</td>
	<td class="tg-vn4c">.45</td>
	<td class="tg-vn4c">.100</td>
	<td class="tg-vn4c">4</td>
	<td class="tg-vn4c">1</td>
	</tr>
	</table>


	</body>
	</html>
	<?xml version="1.0" encoding="UTF-8"?>
	<soupdata>
	<soup>
	<name>Amy's Kitchen Organic Gluten Free Low Fat Chunky Tomato Soup</name>
	<price currency="GBP">1.7</price>
	<weight units='g'>.100</weight>
	<rating>5</rating>
	<reviews>4</reviews>
	</soup>
	<soup>
	<name>Heinz Classic Cream of Tomato Soup</name>
	<price currency="GBP">.95</price>
	<weight units='g'>.400</weight>
	<rating>5</rating>
	<reviews>14</reviews>
	</soup>
	<soup>
	<name>Baxters Favourites Cream of Tomato Soup</name>
	<price currency="GBP">1.15</price>
	<weight units='g'>.400</weight>
	<rating>2</rating>
	<reviews>1</reviews>
	</soup>
	<soup>
	<name>Cross & Blackwell Cream of Tomato Soup</name>
	<price currency="GBP">2.00</price>
	<weight units='g'>4 x .400</weight>
	<rating>5</rating>
	<reviews>2</reviews>
	</soup>
	<soup>
	<name>Morrisons Cream of Tomato Soup</name>
	<price currency="GBP">.45</price>
	<weight units='g'>.100</weight>
	<rating>4</rating>
	<reviews>1</reviews>
	</soup>
	</soupdata>