Skip to content

Instantly share code, notes, and snippets.

@jonathanmorgan
Last active January 12, 2017 11:31
Show Gist options
  • Save jonathanmorgan/1373860ed08fe8ecc319 to your computer and use it in GitHub Desktop.
Save jonathanmorgan/1373860ed08fe8ecc319 to your computer and use it in GitHub Desktop.
Scraping_information_from_the_web
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Scraping Information from the web\n",
"\n",
"# Table of Contents\n",
"\n",
"- [Introduction](#Introduction)\n",
"\n",
" - [Polite scraping - terms of service and rate-limiting](#Polite-scraping---terms-of-service-and-rate-limiting)\n",
" \n",
"- [HTML and web pages](#HTML-and-web-pages)\n",
"\n",
" - [Basic HTML and XML syntax](#Basic-HTML-and-XML-syntax)\n",
"\n",
" - [XML](#XML)\n",
" - [HTML](#HTML)\n",
" \n",
" - [HTML and web page structure](#HTML-and-web-page-structure)\n",
" - [Useful HTML elements and attributes](#Useful-HTML-elements-and-attributes)\n",
" \n",
" - [HTML elements](#HTML-elements)\n",
" - [HTML attributes](#HTML-attributes)\n",
" \n",
" - [Dynamic web pages](#Dynamic-web-pages)\n",
" \n",
"- [Getting, parsing and interacting with HTML](#Getting,-parsing-and-interacting-with-HTML)\n",
"\n",
" - [Using Python to get HTML for a page](#Using-Python-to-get-HTML-for-a-page)\n",
" - [Using Beautiful Soup 4 to parse and interact with HTML](#Using-Beautiful-Soup-4-to-parse-and-interact-with-HTML)\n",
" \n",
" - [Parsing HTML with Beautiful Soup 4](#Parsing-HTML-with-Beautiful-Soup-4)\n",
" - [Using Beautiful Soup 4 to interact with HTML](#Using-Beautiful-Soup-4-to-interact-with-HTML)\n",
" \n",
"- [HOWTO - Common scraping tasks](#HOWTO---Common-scraping-tasks)\n",
"\n",
" - [Retrieving information from HTML web page](#Retrieving-information-from-HTML-web-page)\n",
" - [Submitting a form](#Submitting-a-form)\n",
" - [Crawling a site](#Crawling-a-site)\n",
"- [Examples](#Examples)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"APIs are great, but lots of useful information on the Internet is not packaged so neatly as APIs package information. Useful information and information sources are embedded in HTML all over the Internet. You can use Python to scrape and crawl the world wide web, but the programs you write to do this have the potential to be significantly more complicated than API calls.\n",
"\n",
"For example:\n",
"\n",
"- HTML has been around for many years now, and it has changed significantly over that time, so you can't just write an HTML client - you have to tailor your code to each site you are interested in, and sometimes even make custom code per page you access.\n",
"- Much of the Internet is still implemented by hand, and people make mistakes, so code will sometimes not actually be valid, well-formed HTML.\n",
"- If the information you want is stored in a structured data store that consists of multiple structured pages on a web site, you might not only have to build a program to extract data from pages, you also might have to build a program that detects the structure of a site and traverses that structure.\n",
"- Many modern web pages make extensive use of Javascript and CSS to build what you see in your browser dynamically, based on your use of the site. This can be complicated to deal with when you are using some sites - it is moreso when you have to build a program to deal with it.\n",
"\n",
"In this lesson, we'll:\n",
"\n",
"- look at the basics of HTML and web pages.\n",
"- learn about a Python library named BeautifulSoup that makes it easy to interact with HTML.\n",
"- walk through some examples of scraping and crawling tasks you might need to do for research."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Polite scraping - terms of service and rate-limiting\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Before we begin, once again note that before you scrape a web site, you should read its terms of service and make sure that scraping is permitted. Some sites will not want you to scrape their content, and when a site makes that clear in its terms of service, you should respect their wishes. In addition, even if a site is OK with you scraping their pages, it is polite to spread your requests at least a few seconds apart, so that you don't put too much load on their servers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# HTML and web pages\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"## Basic HTML and XML syntax\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"HTML stands for ***H***yper***T***ext ***M***arkup ***L***anguage. HTML is for the most part a dialect of XML, which means that it is element-based, where an element can contain other elements, and have attributes assigned to it. In order to understand HTML, you first need to understand XML."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### XML\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"XML is a string data format that is designed to hold hierarchically structured data.\n",
"\n",
"XML documents are made up of elements. An element consists of a start tag, and end tag, and content that is between the start and end tags. Example for an element named \"status\":\n",
"\n",
" <status>Success!</status>\n",
" \n",
"Where:\n",
"\n",
"- `<status>` is the start tag of the status element.\n",
"- \"Success!\" is the content of the status element.\n",
"- `</status>` is the end tag of the status element.\n",
"- You must always close an element that you open (though HTML doesn't always obey this).\n",
"- You can make an empty element by making a single tag that contains the name of the element and ends with \" />\"\n",
"\n",
" - for example, an empty status element would look like `<status />`.\n",
"\n",
"Elements can contain any content, including one or more child elements.\n",
"\n",
"A given element's start tag can also contain attributes - name-value pairs of additional information about a given element and the elements inside it. With our status element, an example where we've added an \"updated_by\" attribute, set to \"Benny\":\n",
"\n",
" <status updated_by=\"Benny\">Success!</status>\n",
" \n",
"Where:\n",
"\n",
"- Attributes must always have a value, even if it is just an empty string ( \"\" ).\n",
"- The attribute name is never in quotation marks and never contains spaces.\n",
"- The attribute value always is in quotation marks.\n",
"- Name and value are separated by an equal sign ( \"=\" ).\n",
"\n",
"An example XML document that models a pizza order:\n",
"\n",
" <order>\n",
" <crust>original</crust>\n",
" <toppings>\n",
" <topping>cheese</topping>\n",
" <topping>pepperoni</topping>\n",
" <topping>garlic</topping>\n",
" </toppings>\n",
" <status updated_by=\"JMo\">cooking</status>\n",
" <customer>\n",
" <name>Brian</name>\n",
" <phone>573-111-1111</phone>\n",
" </customer>\n",
" </order>\n",
" \n",
"Notice that every element has both a start tag AND an end tag, and elements are closed in the same order as they were opened. When an XML document always has matched start and end tags that are closed in the same order as they were opened, it is said to be ***valid*** XML. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### HTML\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"XML is a very general specification. In XML, you can name elements and attributes whatever you want, and you can put whatever you want inside an element. ***HTML*** is (for the most part) a dialect of XML - a dialect of XML is a Markup Language made up of a particular set of XML elements that have rules about what attributes you can assign to them, what they should contain, and how they are used.\n",
"\n",
"HTML is made up of elements, some of which have attributes. An HTML document has a particular structure, and different elements that have specific functions and so must be used in certain ways. HTML is the language in which the World Wide Web is written, and so it is the base language that developers use to tell browsers how to display web pages.\n",
"\n",
"HTML is only an XML dialect ***\"for the most part\"*** because it doesn't always strictly conform to the XML specification. Modern HTML is mostly valid XML (and XHTML is a version of HTML that is fully XML-compliant, but not too commonly used). Older HTML can sometimes be very much invalid, however, and even inconsistent within documents. In particular in old code, many HTML elements don't have closing tags (`<p>`, `<br>`, `<hr>`, etc.) attributes sometimes don't have values, and attribute values are often not enclosed in quotation marks (heaven help you if you have an attribute value with a space in it and no quotation marks).\n",
"\n",
"HTML consistency and validity have improved over time, especially as browsers have become more sophisticated and aware of invalid rendering. The Internet is old, however, and sometimes old pages will be the only source of information you are interested in. When this is the case, you will need to learn to recognize and code to the HTML that contains the information you care about, which means that HTML is effectively a range of languages, not a single language.\n",
"\n",
"There are some things about HTML that are pretty standard regardless of quality or style of HTML:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## HTML and web page structure\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"_Includes information from: Mozilla's Introduction to HTML: [https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Introduction](https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Introduction)._\n",
"\n",
"An HTML document has a standard structure that has been consistent since the start of the Internet:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"<!DOCTYPE html>\n",
"<html lang=\"en\">\n",
" <head>\n",
" <title>A RMScreenprint</title>\n",
" </head>\n",
" <body>\n",
" <h1>Main heading in my document</h1>\n",
" <!-- Note that it is \"h\" + \"1\", not \"h\" + the letters \"one\" --> \n",
" <p>Look Ma, I am coding <abbr title=\"Hyper Text Markup Language\">HTML</abbr>.</p>\n",
" </body>\n",
"</html>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where:\n",
"\n",
"- the document always starts with a DOCTYPE element that tells the browser what version of HTML the document uses. This element is non-standard: it has an exclamation point at the beginning, and it has no closing tag. Some common DOCTYPEs:\n",
"\n",
" - `<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">`\n",
" - `<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">`\n",
" - `<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Frameset//EN\" \"http://www.w3.org/TR/html4/frameset.dtd\">`\n",
" - `<!DOCTYPE html>`\n",
"\n",
" - HTML 5\n",
"\n",
"- An HTML document is always wrapped in an outer `<html>` element that can contain a \"lang\" attribute that specifies the language of the text in the HTML, among other attributes.\n",
"- Inside the `<html>` element there are always two child elements: `<head>`, where meta-information about the document is stored; and `<body>`, where the contents of the document are stored.\n",
"- The `<head>` element contains the `<title>` of a page. It also can contain other meta-data, and tags that tell the browser to load in Javascript (JS) and Cascading Style Sheet (CSS) files that are used by the page.\n",
"- The `<body>` element contains the actual HTML that is rendered by a browser to display a web page to you. It can contain as little or as much HTML as is needed to output a web page's data. The above page is very simple. `<body>` tags for complex web pages often contain thousands of lines of code.\n",
"- Comments are placed in a special tag structure that starts with `<!-- ` and ends with ` -->`. HTML comments can span multiple lines, and can be located anywhere in a document.\n",
"- Indentation can be very helpful for understanding the structure of a document, but it is not required, nor is it required to be consistent like it is in Python (to the detriment of HTML, I'd argue)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Useful HTML elements and attributes\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"HTML is a complicated and rich markup language. An accounting of all the HTML elements is beyond the scope of this lesson. If you are interested in learning more about HTML, the following are good resources:\n",
"\n",
"- The Mozilla Developer Network HTML Developer Guide: [https://developer.mozilla.org/en-US/docs/Web/Guide/HTML](https://developer.mozilla.org/en-US/docs/Web/Guide/HTML)\n",
"- W3Schools HTML Tutorial and Reference: [http://www.w3schools.com/htmL/](http://www.w3schools.com/htmL/)\n",
"- IDocs - For old-school HTML (and a history lesson, if you are interested - this is where I learned HTML): [http://www.idocs.com/tags/](http://www.idocs.com/tags/)\n",
"\n",
"That said, the following attributes and elements tend to be particularly useful when one is scraping or crawling web pages:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### HTML elements\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"It is a best practice to always make HTML element names all lower case, but you often will find older pages where the case is mixed or all caps.\n",
"\n",
"- ***`<div>`*** and ***`<span>`*** - `<div>` and `<span>` elements are the most common containers in modern web pages. They often have \"name\", \"id\" and \"class\" attributes that allow one to easily retrieve them from an HTML document.\n",
"\n",
" Example:\n",
" \n",
" <div name=\"portant_stuff\" id=\"portant_stuff\" class=\"portant_stuff\">\n",
" <!-- Place cat picture here. -->\n",
" </div>\n",
"\n",
"\n",
"- ***h***eading elements (***`<h1>`***, ***`<h2>`***, ***`<h3>`***, and so on) - Heading elements are often placed at the top of sections of a web page, sometimes just inside `<div>` tags that wrap different sections of a document. This makes them useful for targeting a parser to a specific region of an HTML document.\n",
"\n",
"\n",
"- ***`<p>`*** - ***p***aragraph tags ( `<p>` ) are another container that is generally used to contain body text, but in the past have also been used like a `<div>` or `<span>`, wrapping logical sections of a page, not just discrete paragraphs.\n",
"\n",
"\n",
"- ***`<a>`*** - ***a***nchor tags are used to create links within a web page. Any time you see a clickable link on a page, in the HTML source, that links is wrapped in an anchor tag. \n",
"\n",
" Example:\n",
"\n",
" <a href=\"http://data.jrn.cas.msu.edu/sourcenet/admin\">sourcenet admin</a>\n",
" \n",
" Where:\n",
" \n",
" - the `href` attribute contains the URL of the web page to which the link should take a user when clicked. This can be:\n",
" \n",
" - a _full URL_, as above: `http://data.jrn.cas.msu.edu/sourcenet/admin`\n",
" - _absolute path_ - the path to a resource from the root of the server, without the domain (assumes that it is on the same server as the current page): `/sourcenet/admin`\n",
" - _relative path_ - the path to a resource relative to the directory in which the current page resides (also assumes same server): `./admin`, `../sourcenet/admin`, `other_page_in_directory.html`\n",
" \n",
" - The contents of the `<a>` element are the link text that is displayed and clickable in the web page.\n",
" \n",
"\n",
"- ***`<img>`*** - `<img>` tags specify an image that should appear on a web page.\n",
"\n",
" Example:\n",
" \n",
" <img src=\"http://cdn.arstechnica.net/wp-content/uploads/2015/03/A1689-zD1-640x665.jpg\" />\n",
" \n",
" Where:\n",
" \n",
" - the `src` attribute contains the URL of the image to be displayed.\n",
" - the `<img>` element is usually empty. In older HTML, it will not have the empty tag syntax (` />` at the end of the tag). There are a number of other attributes that can be assigned to an `<img>` tag.\n",
" \n",
"\n",
"- ***`<form>`***, ***`<input>`***, ***`<select>`***, and ***`<textarea>`*** elements - `<form>`, `<input>`, `<select>`, and `<textarea>` elements are used to create web forms that users can enter information into, then submit to a web page. You will often run into information that you can interact with in part on a web page through forms, but which the site does not provide in its entirety. In this scenario, you can use the contents of the elements used to specify the form to reverse-engineer an aPI of sorts that you can use to interact with the site's underlying data. You'll still need to parse the information you want out of the HTML returned by the form submission, but it is better than having to grab all the data by hand (more details below)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### HTML attributes\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"It is a best practice to always make attribute names all lower case, and to always place quotation marks around attribute values, but it is not required by the HTML specification. In older pages you might see mixed-case or all caps attribute names, and attribute values that are not enclosed in quotation marks.\n",
"\n",
"- ***`id`*** - The `id` attribute can be applied to any element, and is a great way to target specific elements within a document when it is present.\n",
"\n",
"- ***`name`*** - The `name` attribute can be applied to any element, and is another great way to target specific elements within a document when it is present.\n",
"\n",
"- ***`class`*** - The class attribute can be applied to any element and is yet another great way to target specific elements within a document when it is used."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dynamic web pages\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"More and more modern web pages are partially or fully rendered using Javascript and CSS in the viewer's web browser, rather than rendered into HTML on the server and then streamed in their entirety to a browser. If the part of a page you are interested in is rendered in this way, it can make the task of retrieving information from that part of the page programmatically substantially more difficult (perhaps even approaching impossible, though nothing is truly impossible).\n",
"\n",
"If confronted with a page that contains some in-browser rendering, you'll need to:\n",
"\n",
"- learn how to view, filter, and search inside the HTML result of any in-browser rendering so you can figure out what you need to target\n",
"\n",
"\n",
"- then, when programmatically interacting with the page, choose an HTML parsing library that can handle the complexity of the page well enough to get you access to the data you need (more on this in a bit).\n",
"\n",
"For interacting with a dynamically generated web page yourself, there are a few tools for Firefox that can be very helpful:\n",
"\n",
"- ***View selection source*** command - this command, available when you select text in a web page, then right click on the selection, will pull up just the HTML for the selected area of the web page. This doesn't give you context, but it can be extremely useful if you are trying to get an `id`, `class`, or `name` for a particular element.\n",
"\n",
"\n",
"- ***Web Developer*** toolbar - [https://addons.mozilla.org/en-US/firefox/addon/web-developer/](https://addons.mozilla.org/en-US/firefox/addon/web-developer/) - The web developer toolbar provides many tools that can be helpful in analyzing a page you want to scrape or crawl through. In particular, a few essential features:\n",
"\n",
" - In the \"View Source\" menu, \"View Generated Source\" will present for you the source of the page once any in-browser rendering is done. This is the source that you are looking at, not necessarily the HTML that was initially sent over to your browser. You can use this source to plan how you'll interact with the final result of rendering the page.\n",
" - In the \"Forms\" menu, the command \"Display Form Details\" will show you details on each form on the page, including all the `<inputs>` for each form. This goes a long way toward helping you figure out how a form you want to try to interact with works.\n",
" - The \"Outline\" menu contains numerous tools that allow you to see more information about a part of the current web page when you hover your mouse over it.\n",
"\n",
"\n",
"- ***Firebug*** add-on - [http://getfirebug.com/](http://getfirebug.com/) - Firebug is another web developer tool that helps you better interact with the current page. Once you turn it on and enable all the Panels, Firebug lets you interact with the current page's HTML, CSS, and javascript, including making changes on the fly. It can be confusing at first, but once you get the hang of it, it is another powerful tool."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting, parsing and interacting with HTML\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"In order to scrape or crawl a web site, you will need to do the following:\n",
"\n",
"- Make an HTTP request to the server for the resource you want to work with. If you are trying to interact with a form, this will include parameters to match the inputs of the form you are trying to interact with.\n",
"\n",
" - For a refresher on HTTP: [http://nbviewer.ipython.org/gist/jonathanmorgan/e98b308aaf3e25b55d03#HTTP](http://nbviewer.ipython.org/gist/jonathanmorgan/e98b308aaf3e25b55d03#HTTP)\n",
" - For a refresher on using the `requests` library for HTTP requests: [http://nbviewer.ipython.org/gist/jonathanmorgan/e98b308aaf3e25b55d03#HTTP-with-Python-(requests)](http://nbviewer.ipython.org/gist/jonathanmorgan/e98b308aaf3e25b55d03#HTTP-with-Python-(requests))\n",
"\n",
"\n",
"- Take the body of the HTTP response (the HTML for the page you want to scrape or crawl) and use a python library to parse the HTML into a form that is easy to search, filter, and interact with.\n",
"\n",
"\n",
"- Interact with the parsed HTML to retrieve the information you care about, then do with it what you will."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Python to get HTML for a page\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"There are a number of different ways to retrieve HTML pages for scraping or crawling. The most common are listed below, in order from simplest to most complicated in terms of both: 1) ease of use; and 2) ability to deal with complex HTML (and eventually javascript and CSS). \n",
"\n",
"The simplest options in this list are quick and easy to use, but don't offer the same features as a full web browser (like javascript engines and cookies). As the options become more complicated, they become somewhat harder to use, but they also become much more capable, up to providing a coding interface that lets you use Python to drive an actual instance of Firefox.\n",
"\n",
"- ***`requests`*** - The relatively easy to use and simple HTTP client we've discussed in class before. Works well for most things, supports sessions and cookies, but does not support Javascript.\n",
"\n",
" - Links:\n",
" \n",
" - _requests documentation_ - [http://docs.python-requests.org/en/latest/](http://docs.python-requests.org/en/latest/)\n",
"\n",
"\n",
" Example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import requests\n",
"import requests\n",
"\n",
"# declare variables\n",
"request_url = \"\"\n",
"response = None\n",
"\n",
"# make a simple GET request\n",
"request_url = 'http://google.com'\n",
"response = requests.get( request_url )\n",
"\n",
"# HTML contained in the body of the response\n",
"print( \"Contents of response body = \" + response.text )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- ***`mechanize`*** - `mechanize` is a slightly more advanced HTTP client that includes more browser-like features, including \"forward()\" and \"back()\" methods, a \"click()\" method, and methods to help make it easy to populate and submit forms. This library contains code to interact with an HTML page in addition to just retrieving its HTML. It can make it much easier to build programs that crawl a web site. It also can be a little squirrely with complex or more recent web pages since it is not actively supported anymore. And, when it comes down to interacting with a web page, `BeautifulSoup` is easier to use than `mechanize`.\n",
"\n",
" - Links:\n",
" \n",
" - _Mechanize home_ - [http://wwwsearch.sourceforge.net/mechanize/](http://wwwsearch.sourceforge.net/mechanize/)\n",
"\n",
"\n",
"- ***`PhantomJS (webkit)`*** - `Webkit` is the browser engine that Apple's OS X and iOS Safari browsers use. `PhantomJS` is a headless implementation of a webkit browser that is relatively easy to install and use. If you need javascript, the most straightforward way to get it is to use a headless webkit browser like `PhantomJS`, which is a webkit browser engine that doesn't actually render or paint the web page it loads. This can be a complicated proposition to get set up and working. The easiest way I've seen is to install `PhantomJS`, then use the `Selenium` browser control library to interact with `PhantomJS` (this is much more straightforward than using Selenium with other browsers...).\n",
"\n",
" - Links\n",
" \n",
" - _PhantomJS home_ - [http://phantomjs.org/](http://phantomjs.org/)\n",
" - _Instructions on using Selenium with PhantomJS_ - [http://stackoverflow.com/a/15699761](http://stackoverflow.com/a/15699761)\n",
"\n",
"\n",
"- ***`selenium`*** - Sometimes you might need to use a particular browser to make a page render as you need it to. If a headless webkit browser doesn't work for you, then your next option is to use the `Selenium` browser control package. `Selenium` allows you to write Python code that can control one of the actual browsers on your computer - Firefox, Chrome, IE, or Opera. This is a powerful method, but it also is complicated to set up, and tends to be brittle over time. If you get to this point, you might consider another option for gathering your data... =)\n",
"\n",
" - Links\n",
"\n",
" - _Selenium Home_ - [http://docs.seleniumhq.org/](http://docs.seleniumhq.org/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Beautiful Soup 4 to parse and interact with HTML\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Beautiful Soup 4 is a Python package that makes it relatively easy to search, filter, and interact with HTML documents. To install it, at your computer's command prompt (NOT IPYTHON), run:\n",
"\n",
" conda install beautiful-soup\n",
" \n",
"OR\n",
"\n",
" pip install beautifulsoup4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parsing HTML with Beautiful Soup 4\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Beautiful Soup 4 lets you interact with an HTML document through the BeautifulSoup object. When you create a BeautifulSoup object, you pass the HTML you want to parse in to the constructor:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import requests and BeautifulSoup\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"# declare variables\n",
"request_url = \"\"\n",
"response = None\n",
"response_html = \"\"\n",
"soup = None\n",
"\n",
"# make a simple GET request\n",
"request_url = 'http://google.com'\n",
"response = requests.get( request_url )\n",
"\n",
"# HTML contained in the body of the response\n",
"response_html = response.text\n",
"\n",
"# initialize BeautifulSoup instance\n",
"soup = BeautifulSoup( response_html )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If an HTML document is malformed or invalid (as most are in one way or another), different HTML parsers will render that HTML differently, some better than others. For example, when the built-in Python HTML parser, BeautifulSoup's default, encounters invalid HTML, it sometimes fails, resulting in large portions of the document being inaccessible because they are lumped together in a single string.\n",
"\n",
"For cases where a particular parser doesn't work for a given document, BeautifulSoup supports explicitly selecting from multiple parsers, selected by passing an optional second argument to the BeautifulSoup constructor:\n",
"\n",
"- ***html.parser*** - Python’s html.parser - in Python versions 2.7.3 and 3.2 and later, this is a reasonably fast, lenient parser that doesn't require you to install any external packages. In versions earlier than 2.7.3 and 3.2, it is not lenient, so breaks a lot and is said by the Beautiful Soup person to be unusable.\n",
"\n",
" - Usage:\n",
" \n",
" # built-in Python parser for HTML\n",
" BeautifulSoup( response_html, \"html.parser\" )\n",
"\n",
"\n",
"- ***lxml*** - fast, lenient parser that can process both HTML and XML.\n",
"\n",
" - Installation:\n",
" \n",
" conda install lxml\n",
" \n",
" OR\n",
" \n",
" pip install lxml\n",
" \n",
" - Usage:\n",
" \n",
" # lxml for HTML\n",
" BeautifulSoup( response_html, \"lxml\" )\n",
" \n",
" # lxml for XML\n",
" BeautifulSoup( response_html, [ \"lxml\", \"xml\" ] )\n",
" BeautifulSoup( response_html, \"xml\" )\n",
"\n",
"\n",
"- ***html5lib*** - extremely lenient, parses the way a web browser does, and makes valid HTML 5. But, very slow compared to lxml and Python's built-in HTML parser.\n",
"\n",
" - Installation:\n",
" \n",
" conda install html5lib\n",
" \n",
" OR\n",
" \n",
" pip install html5lib\n",
" \n",
" - Usage:\n",
" \n",
" # html5lib for HTML\n",
" BeautifulSoup( response_html, \"html5lib\" )\n",
" \n",
"- More information: [http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using Beautiful Soup 4 to interact with HTML\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Once you have your HTML loaded into a BeautifulSoup instance, you can use it to interact with the HTML inside the document, including looking for elements of a particular type or attribute value and pulling the text out of attributes or elements that you find."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import requests and BeautifulSoup\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"# declare variables\n",
"request_url = \"\"\n",
"response = None\n",
"response_html = \"\"\n",
"soup = None\n",
"anchor = None\n",
"attributes = {}\n",
"attr_value = \"\"\n",
"text = \"\"\n",
"anchor_list = []\n",
"link = None\n",
"parent_element = None\n",
"sibling = None\n",
"\n",
"# make a simple GET request\n",
"request_url = 'http://google.com'\n",
"response = requests.get( request_url )\n",
"\n",
"# HTML contained in the body of the response\n",
"response_html = response.text\n",
"\n",
"# initialize BeautifulSoup instance\n",
"soup = BeautifulSoup( response_html )\n",
"\n",
"# pretty-print the document\n",
"#print( soup.prettify() )\n",
"\n",
"# get title element - gets first element with the given name.\n",
"print( soup.title )\n",
"\n",
"# get name of title element\n",
"print( soup.title.name )\n",
"\n",
"# get first <a> in document.\n",
"anchor = soup.a\n",
"print( anchor )\n",
"\n",
"# get list of attributes in anchor tag\n",
"attributes = anchor.attrs\n",
"print( \"- Anchor attrs: \" + str( attributes ) )\n",
"\n",
"# get href attribute\n",
"attr_value = anchor[ \"href\" ]\n",
"# OR attr_value = anchor.get( \"href\" )\n",
"print( \"- Anchor href: \" + attr_value )\n",
"\n",
"# Get text inside anchor\n",
"text = anchor.string\n",
"print( \"- Anchor text: \" + text )\n",
"\n",
"# get all anchors in document\n",
"anchor_list = soup.find_all('a')\n",
"for link in anchor_list:\n",
"\n",
" print( \"- Anchor = \" + link.get('href') )\n",
" print( \" - attributes: \" + str( link.attrs ) )\n",
" \n",
"#-- END loop over anchors. --#\n",
"\n",
"# Extract all text from page:\n",
"#print( soup.get_text() )\n",
"\n",
"# get anchor by ID\n",
"anchor = soup.find( \"a\", id=\"gb_70\")\n",
"print( \"- Anchor id=\\\"gb_70\\\": \" + str( anchor ) )\n",
"\n",
"# get parent element of anchor\n",
"parent_element = anchor.parent\n",
"print( \"- Parent of anchor = \" + parent_element.name + \": \" + str( parent_element ) )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- For more information, see: [http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# HOWTO - Common scraping tasks\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Retrieving information from HTML web page\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Steps to get at information in a page:\n",
"\n",
"- Load the page in a browser and look it over, identify information you want.\n",
"- Get source for page.\n",
"- Use traits of page like bits of text displayed around the area of interest to focus in on the part of the page you are interested in.\n",
"- Once you find the information you want, look in the element where it lives to see if there is an identifier (`\"id\"` or `\"name\"` attribute, for example, or sometimes a `\"class\"` attribute).\n",
"\n",
" - if yes, make sure it isn't duplicated elsewhere (do a text search in the document for it).\n",
" - if no, start to move up the hierarchy of elements from there, looking for an element that you can target, then descend from. Write down your findings as you ascend.\n",
" \n",
"- Once you figure out how to target the information, write a program to load the page, then feed the HTML into Beautiful Soup 4, then try out your targeting and see if it works."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submitting a form\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Steps for submitting a web form:\n",
"\n",
"- Load the page in a browser and look it over, identify form you want to submit programmatically.\n",
"- Get source for page.\n",
"- In source, find the form. Options:\n",
"\n",
" - use \"web developer\" toolbar in Firefox, go to \"Forms\"-->\"View Form Information\".\n",
" - use the \"web developer\" toolbar --> \"View Source\" --> \"View Generated Source\", then do a text search for text associated with the form.\n",
" - Enable Firebug and all firebug panels, load the page you are interested in, then open up Firebug window, go to the HTML tab, then search for terms you are looking for.\n",
"\n",
"- Once you find the form:\n",
"\n",
" - look for the `<form>` element so you can figure out if the form is expecting a GET or a POST request, and the URL where the form should be submitted.\n",
" \n",
" Example:\n",
" \n",
" <form action=\"action_page.php\" method=\"GET\">\n",
" \n",
" Where:\n",
" \n",
" - action = URL of page where form should be submitted.\n",
" - method = type of request to make (usually will be either \"get\" or \"post\").\n",
"\n",
" - then, look for all the `<input>`s, `<textarea>`s, and `<select>`s to the form, so you can get their names and figure out what information you have to pass to the form in each parameter to get results back.\n",
" \n",
" Example:\n",
" \n",
" First name:<br>\n",
" <input type=\"text\" name=\"firstname\">\n",
" <br>\n",
" Last name:<br>\n",
" <input type=\"text\" name=\"lastname\">\n",
"\n",
"- Once you know names and values you need to pass, then build the code to submit requests to the FORM."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import requests\n",
"import requests\n",
"\n",
"# declare variables\n",
"response = None\n",
"request_url = \"\"\n",
"parameter_dict = None\n",
"\n",
"# make a simple GET request\n",
"request_url = 'action_page.php'\n",
"\n",
"# pass parameters - names from inputs are names here, too.\n",
"parameter_dict = { 'firstname': 'value1', 'lastname': 'value2' }\n",
"\n",
"# request with parameters as well, passed appropriately for which\n",
"# ever method you call (on URL for GET, in body for POST, etc.).\n",
"response = requests.get( request_url, params = parameter_dict )\n",
"\n",
"# check the status code\n",
"print( \"Status code = \" + str( response.status_code ) )\n",
"\n",
"# Text contained in the body of the response\n",
"print( \"Contents of response body = \" + response.text )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Once you have the body of the page, then see the step above (\"Getting at information on a page\") for guidance on how to get at the information of interest in the page."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Crawling a site\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Steps for crawling a site:\n",
"\n",
"- To walk a site, first pick a page as your starting point.\n",
"- Then, look at all the URLs on the page and see which ones you want to crawl. If there is a robots.txt, make sure to check it to make sure you aren't violating any rules for how bots are to behave on the site.\n",
"- Once you identify the URLs you want to follow, use the strategy in the section \"[Retrieving information from HTML web page](#Retrieving-information-from-HTML-web-page)\" above to figure out how you'd target the anchor tags.\n",
"- Once you can target the appropriate anchor tags, then pull each back, parse the document and then start in again with the strategies outlined in \"[Retrieving information from HTML web page](#Retrieving-information-from-HTML-web-page)\" above."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Examples\n",
"\n",
"- Back to [Table of Contents](#Table-of-Contents)\n",
"\n",
"Examples:\n",
"\n",
"- [http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/)\n",
"- [http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python](http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment