Skip to content

Instantly share code, notes, and snippets.

@skillachie
Last active September 30, 2015 20:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save skillachie/18109e558bc4124057f0 to your computer and use it in GitHub Desktop.
Save skillachie/18109e558bc4124057f0 to your computer and use it in GitHub Desktop.
Collect news articles related to specific events
from news_corpus_builder import NewsCorpusGenerator
from pprint import pprint
import sys
# Location where you want to save the news articles
corpus_dir = '/home/skillachie/Development/event_articles'
category_total = 300
extractor = NewsCorpusGenerator(corpus_dir,'file')
def get_links(terms,category):
category_articles = []
article_count = int(category_total/len(terms))
for term in terms:
category_articles.extend(extractor.google_news_search(term,category,article_count))
return category_articles
# Hurricane search terms. Can be multiple
storm_terms = ['Hurricane Joaquin']
# 'Storms' represents the category to assign to this event. Articles will be saved in a Storms folder on your filesystem
article_links = get_links(storm_terms,'Storms')
print len(article_links)
# Extract Content & Create Corpus
print "Total %d links to extract" % len(article_links)
extractor.generate_corpus(article_links)
print extractor.get_stats()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment