Skip to content

Instantly share code, notes, and snippets.

@reginafcompton
Last active March 29, 2017 15:02
Show Gist options
  • Save reginafcompton/26ae2178b48436736569c1dab0f812f5 to your computer and use it in GitHub Desktop.
Save reginafcompton/26ae2178b48436736569c1dab0f812f5 to your computer and use it in GitHub Desktop.
Scraping the Web

Scraping the Web

Get started

What is scraping?

Every webpage contains data. Sometimes, this data proves useful in a context other than the user interface of the webpage itself. Happily, with the right tools, we can extract data from any webpage. Indeed, we might think of a webpage as a “really bad” API (credit: Forest Gregg), with which we can interact and collect information.

At DataMade, we use the lxml library to access webpages. The lxml library parses and processes HTML. It’s well-documented, expansive, and popular.

Want to read more?

The lxml documentation provides extensive discussion of the library and all it has to offer. And this guide gives a snappy overview of scraping, in general.

For now, we will cover a few basics and then consider a sophisticated scraper: Open Civic Data.

Fundamentals of lxml: A Tutorial

Keep your project organized with a virtualized environment, which safely isolates requirements for a unique Python app. We recommend using virtualenv and virtualenvwrapper. Learn about how to set up virtualenv. Then, do the following:

mkvirtualenv scraper-tutorial
pip install request
pip install lxml

Let’s write some code. Create a new python document (e.g., new_scraper.py), and import two libraries at the top of the file:

import requests
import lxml.html
from lxml.etree import tostring

Now, identify a website that holds useful, precious, interesting data. For this tutorial, we will use the calendar of events available in Legistar from the Los Angeles Metro Transportation Authority: https://metro.legistar.com/Calendar.aspx

Get the text of the webpage, using the requests library:

entry = requests.get(‘ https://metro.legistar.com/Calendar.aspx ‘).text

Parse the retrieved page with the lxml library:

 page = lxml.html.fromstring(entry)

The #fromstring function returns HTML - either a full HTML document or an HTML fragment, depending on the input.

Next, we need to locate and isolate elements within the DOM. We can do this with “xpath” - a collection of precise expressions used to traverse an XML document.

Let’s first inspect the webpage in question: visit the site, and “right click” or press “cmd+ctrl+i”. That’s a messy DOM! Notice that the page contains a body (of course) and a form with several nested divs and tables.

<body><form><div id=’ctl00_Div1’>
		<div id=’ctl00_Div2’>
			<div id=’ctl00_divMiddle’><div id=’ctl00_ContentPlaceHolder1_MultiPageCalendar’><table>
					...
					...

</body>

We want to access a particular table, i.e., the table with information about Metro meetings. To do so, we can write something like this:

div_id = 'ctl00_ContentPlaceHolder1_divGrid'
events_table = page.xpath("//body/form//div[@id='%s']//table" % div_id)[0]

Let’s break this down.

“/” tells xpath to look for direct children: with “body/form,” xpath looks for instances of forms that sit directly within the body, rather than forms nested further within the DOM.

On the other hand, “//” tells xpath to look for all descendents: with “//table,” xpath looks for all table descendants of the specified

. How to specify a div? @id allows us to search for a div with a unique identifier. It prevents the hassle of iterating over handfuls of nested divs, which - in this example - could get rather sticky.

Xpath returns a list. In this example, we know that we want the first element in this list - so we conclude the above code snippet by grabbing the zeroeth index.

Want to see what’s inside the element? You can print your table to the console. Call the #tostring function, and check out the results.

print(tostring(events_table))

We want to get content inside the table cells. First, get all table rows, and then, iterate over each row, get tds from the row, and iterate over the tds - isolating the content of individual table cells.

table_rows = events_table.xpath(".//tr")

for row in table_rows:
    tds = row.xpath("./td")
    for td in tds:
        print(td.text_content())

You can do whatever you like with the the table cell content. Save it in a list! Create a dict with the headers! Import it to the database! Or simply print it, and enjoy your mastery over the DOM.

Open Civic Data

The Open Civic Data (OCD) scrapers service Councilmatic. They collect data from Legistar and its API and, then, load that data into the OCD API. Subsequently, the Councilmatic apps import this information to a database, which ultimately gets displayed on the frontend of the Councilmatic webpages. Webpage-to-webpage. Sea-to-shining-sea.

Let’s consider how information about government events (e.g., committee meetings, meetings of the board of the directors) gets scraped from Legistar and transplanted to Councilmatic. Open Civic Data uses Pupa, scraping framework for collecting and organizing civic data. Pupa has a “Scrape” module, which contains classes for bills, events, and people. The Pupa Event class looks like this: https://github.com/opencivicdata/pupa/blob/master/pupa/scrape/event.py#L62

Pupa also has a base Scraper class, upon which Open Civic Data builds its own Scraper: https://github.com/opencivicdata/python-legistar-scraper/blob/master/legistar/base.py

The OCD base Scraper acts as the parent for the its events scrapers - one for the Legistar site, and one for the Legistar API: https://github.com/opencivicdata/python-legistar-scraper/blob/master/legistar/events.py

These scrapers appear as city-specific instances. For example, in LA Metro, we create a new instance of the site scraper, like so:

web_scraper = LegistarEventsScraper(None, None)

When calling #events on this class instance, the Scraper parses through the events page, grabs the events table, and looks at each row. For each row, it returns a dict, wherein the table headers act as keys and the table content acts as the values. Select data in this dict, eventually, gets stored inside an an instance of an Event, which gets loaded into a database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment