Skip to content

Instantly share code, notes, and snippets.

@raphapassini
Created May 17, 2016 18:53
Show Gist options
  • Save raphapassini/2e4d069fd297afa2f4bc2ab305ca3e96 to your computer and use it in GitHub Desktop.
Save raphapassini/2e4d069fd297afa2f4bc2ab305ca3e96 to your computer and use it in GitHub Desktop.
Extract news from BBC given a set of keywords
#!/usr/bin/python
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from keywords import ocean_keywords as keywords
class BBCSpider(scrapy.Spider):
name = 'bbcnews'
allowed_domains = ['bbc.co.uk']
start_urls = [
"http://www.bbc.co.uk",
]
def parse(self, response):
news_le = LinkExtractor('-\d+$')
links = news_le.extract_links(response)
for link in links:
yield Request(link.url, callback=self.parse_story)
def parse_story(self, response):
# the article seems to be the news content always
text = ''.join(response.css('article p::text').extract())
# create a list of words that are found inside the text and make all
# words lowercase
all_words = set([t.lower() for t in text.split(' ')])
tags = all_words.intersection(keywords)
return {
'url': response.url,
'headline': response.xpath("//title/text()").extract_first(),
'body': text,
'tags': list(tags),
}
@raphapassini
Copy link
Author

Instead of this in you original code:

container = response.css("div.container")
            urls = container.css("a ::attr(href)").extract()

            #coloca dominio pros links proprios do site
            for url in urls:
                if url[0]=='/':
                    next_url = start_urls[0] + url
                else:
                    next_url = url

I'm using LinkExtractors (http://doc.scrapy.org/en/latest/topics/link-extractors.html)
If you ever need to join urls use response.urljoin('path/to/something') instead

Instead of this in your original code:

 tags = []
        for keyword in keywords:
            try: 
                if keyword in text:
                    tags.append(keyword)
            except:
                if str(keyword) in text:
                    tags.append(keyword) 

I'm using the Sets data structure from python. It's way faster, and this will matter in a crawler project :)

I'm sure that you'll have to improve a lot this line: text = ''.join(response.css('article p::text').extract()) to fulfil your requirements and also to make sure that all the story contents is going to be extracted.

Also instead of using print function you should use self.logger.info|debug|error|exception please refer to Python logging package, the Scrapy uses it under the hood - http://doc.scrapy.org/en/latest/topics/logging.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment