Skip to content

Instantly share code, notes, and snippets.

@econchick
Created March 3, 2013 18:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save econchick/ce60a24b82f30bedbebc to your computer and use it in GitHub Desktop.
Save econchick/ce60a24b82f30bedbebc to your computer and use it in GitHub Desktop.

This tutorial will walk you through how to make a web scraper, save the data to a database, and schedule the scraper to run daily. We will also introduce you to some simple queries to use on the database so you can query the information you scraped at your leisure.

The Project

We will build a web scraper using Scrapy to scroll through LivingSocial, and save local deals to our Postgres database. We'll schedule our computer to run this script daily, so that rather than getting those annoying emails, we can query our database when we want a deal on sky diving or hot yoga.

Goals

TODO what should folks expect to learn from this tutorial

About Web Scrapers

Web scraping is a technique for gathering data or information on web pages. You could revisit your favorite web site every time it updates for new information. Or you could write a web scraper to have it do it for you!

A scraper is just a script that parses an HTML site - much like the parser we wrote for our CSV data in our [DataViz]({{ get_url('dataviz/')}}) tutorial.

About Scrapy

Scrapy is one of the popular web scraping frameworks written in Python. It uses Twisted, a Python networking engine, and lxml, a Python XML + HTML parser.

Note for the curious: The lxml library builds on C libraries for parsing, giving the lxml library speed. This is why we needed to install a compiler.

Scrapy also has this great tutorial which this follows closely, but extends beyond it with the use of Postgres and a cronjob.

About SQLAlchemy

SQLAlchemy is a Python library that allows developers to interact with databases (Postgres, MySQL, MS SQL, etc) without needing to write raw SQL code within a database shell. It also provides an ORM – Object Relational Mapper – to developers that essentially abstracts the database further. You can define tables and fields with classes and instance variables instead. If you have worked through the Django tutorial, perhaps you remember its ORM through the setup of models.py and the querying of data within the Django dbshell.

About Postgres

Postgres is a very popular database that is free and open source. Other popular databases include MySql, MS SQL, and MongoDB. Which database you choose depends on what you'll need it for.

To learn why Postgres is go great, Craig Kerstiens of Heroku wrote up a nice explanation.

As an aside: when I first started learning how to code, the concept of having a datbase on my computer blew me away. I assumed databases lived in headless computers that could handle the ubiquitous data. Turns out, it's just like a simple program on your comptuer. Sure, if you're a company, you'd want machines dedicated for serving up production-level data. But we're not heavy-duty number crunching (yet!).

What is a cronjob?

A cron is a job scheduler for Unix-like computers. It basis its schedule off of a crontab (cron table), of which each line on the table is a job (cron job).

You can schedule a cron job to go every minute, every hour, every day, etc. Wiki has an overview of the actual syntax to use for a cron job.

If you're on a Windows Machine, the cron-equivalent is the Windows Task Scheduler. The scope of the tutorial does not cover how to configure the Windows Task Scheduler, but you can read how or use the schtask tool.

[Move onto the Setup →]({{ get_url("/Part-1-Scraper-Setup/")}})

@alecxe
Copy link

alecxe commented Mar 3, 2013

Few typos:

  • MySql -> MySQL
  • datbase -> database

And, I'd note that postgres is object-relational DBMS. And may be you should note why and what for twisted is used in scrapy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment