Skip to content

Instantly share code, notes, and snippets.

@jackbandy
Last active May 13, 2020 16:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jackbandy/e683e89467ae95eb8d74b47178bb61b6 to your computer and use it in GitHub Desktop.
Save jackbandy/e683e89467ae95eb8d74b47178bb61b6 to your computer and use it in GitHub Desktop.
'''
headline_scraper.py
A simple scrapy spider to collect web page titles
'''
import scrapy
from pandas import read_csv
from readability.readability import Document
PATH_TO_DATA = 'https://gist.githubusercontent.com/jackbandy/208028b404d8c6a6f822397e306a5a34/raw/ef7f73357e77c29c63b5b7632d840a923327e179/100_urls_sample.csv'
class HeadlineSpider(scrapy.Spider):
name = "headline_spider"
start_urls = read_csv(PATH_TO_DATA).url.tolist()
def parse(self, response):
doc = Document(response.text)
yield {
'short_title': doc.short_title(),
'full_title': doc.title(),
'url': response.url
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment