Skip to content

Instantly share code, notes, and snippets.

@LittleYenMin
Last active July 13, 2019 15:12
Show Gist options
  • Save LittleYenMin/234f32a3fa7e8bf3c64d29fbf7a47492 to your computer and use it in GitHub Desktop.
Save LittleYenMin/234f32a3fa7e8bf3c64d29fbf7a47492 to your computer and use it in GitHub Desktop.
Scrapy爬蟲第六章新聞正文的function
def parse_content(self, response):
for body in response.xpath('//div[contains(@class, "articlebody")]'):
title = body.xpath('./h1/text()').get()
view_time = body.xpath('.//span[contains(@class, "viewtime")]/text()').get()
contents = body.xpath('.//div[contains(@class, "text")]//p//text()').extract()
content = ' '.join(contents)
if len(content) > 300:
content = content[:300] # 如果字長度超過300則擷取前300字
# 確認我們所需要的資料都不為空,如為空則不存入
if response.url and title and view_time and content:
yield {
'url': response.url,
'title': title,
'date': view_time,
'content': content,
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment