Skip to content

Instantly share code, notes, and snippets.

@bmschmidt
Created November 7, 2013 19:22
Show Gist options
  • Save bmschmidt/7360386 to your computer and use it in GitHub Desktop.
Save bmschmidt/7360386 to your computer and use it in GitHub Desktop.

Make sure it works

Download one by hand. See if you get the Stanford NLTK running to extract places and dates. And see if it works!

Downloading files

Download each html file--write a python script to do one at a time.

use the urllib2 library which can connect to the web: read the files with urllib2. open command to open files: open("output.html","w") and write out to them from inside Python. Some examples are here.

Once you have files

Automatically extract place names.

  1. Named Entity extraction: Stanford natural language toolkit. Install nltk: run python on the downloaded html files to search for places and dates (probably possible?). http://nltk.org/

Run through the files with nltk and write out lists of matched names and places.

Check and see if it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment