bmschmidt/gist:7360386

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Make sure it works

Download one by hand.
See if you get the Stanford NLTK running to extract places and dates. And see if it works!
Downloading files

Download each html file--write a python script to do one at a time.
use the urllib2 library which can connect to the web:
read the files with urllib2.
open command to open files: open("output.html","w") and write out to them from inside Python.
Some examples are here.
Once you have files

Automatically extract place names.

Named Entity extraction: Stanford natural language toolkit.
Install nltk: run python on the downloaded html files to search for places and dates (probably possible?).
http://nltk.org/

Run through the files with nltk and write out lists of matched names and places.
Check and see if it works!