Skip to content

Instantly share code, notes, and snippets.

@jgoodall
Created March 7, 2014 02:11
Show Gist options
  • Save jgoodall/9403743 to your computer and use it in GitHub Desktop.
Save jgoodall/9403743 to your computer and use it in GitHub Desktop.
This script will download web pages and spit out the title, url, and text in a separate file

This script will download web pages and spit out the title, url, and text in a separate file.

Usage:

  1. Create a file articles with newline separated list of URLs to download
  2. Run python extract.py

Prerequisite: newspaper is used for text extraction. Install: pip install newspaper

http://krebsonsecurity.com/2013/09/adobe-microsoft-push-critical-security-fixes-2/
http://technet.microsoft.com/en-us/security/bulletin/ms13-sep
http://blogs.computerworld.com/windows/22809/its-raining-updates-microsofts-september-patch-tuesday
http://krebsonsecurity.com/2013/10/adobe-microsoft-push-critical-security-fixes-3/
http://technet.microsoft.com/en-us/security/bulletin/ms13-oct
http://blogs.computerworld.com/windows/22956/urgent-fixes-patch-tuesdays-10th-anniversary
http://krebsonsecurity.com/2013/11/zero-days-rule-novembers-patch-tuesday/
http://technet.microsoft.com/en-us/security/bulletin/ms13-nov
http://blogs.computerworld.com/windows/23147/november-patch-tuesday-light-critical-updates
http://krebsonsecurity.com/2013/12/zero-day-fixes-from-adobe-microsoft/
http://technet.microsoft.com/en-us/security/bulletin/ms13-dec
http://blogs.computerworld.com/windows/23267/thats-wrap-one-more-urgent-fix-last-patch-tuesday-2013
http://krebsonsecurity.com/2014/01/security-updates-for-windows-flash-reader/
http://technet.microsoft.com/en-us/security/bulletin/ms14-jan
http://blogs.computerworld.com/windows/23398/microsoft-patch-tuesday-january-easy-start-year
http://krebsonsecurity.com/2014/02/security-updates-for-shockwave-windows/
http://technet.microsoft.com/en-us/security/bulletin/ms14-feb
http://blogs.computerworld.com/windows/23530/microsoft-patch-tuesday-february-date-destiny
import newspaper
import codecs
from newspaper import Article
from urlparse import urlparse
articleFile = "articles"
urls = [line.rstrip('\n') for line in open(articleFile)]
for url in urls:
article = Article(url=url, memoize_articles=False)
article.download()
article.parse()
u = url.split("//")[1]
name = u.replace("/", "-") + ".txt"
print "Got article " + article.url
f = codecs.open(name, "w", "utf-8")
f.write("Title:\n" + article.title + "\n\n")
f.write("URL:\n" + article.url + "\n\n")
f.write("Text:\n" + article.text + "\n\n")
f.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment