Last active
July 13, 2019 15:12
-
-
Save LittleYenMin/234f32a3fa7e8bf3c64d29fbf7a47492 to your computer and use it in GitHub Desktop.
Scrapy爬蟲第六章新聞正文的function
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def parse_content(self, response): | |
for body in response.xpath('//div[contains(@class, "articlebody")]'): | |
title = body.xpath('./h1/text()').get() | |
view_time = body.xpath('.//span[contains(@class, "viewtime")]/text()').get() | |
contents = body.xpath('.//div[contains(@class, "text")]//p//text()').extract() | |
content = ' '.join(contents) | |
if len(content) > 300: | |
content = content[:300] # 如果字長度超過300則擷取前300字 | |
# 確認我們所需要的資料都不為空,如為空則不存入 | |
if response.url and title and view_time and content: | |
yield { | |
'url': response.url, | |
'title': title, | |
'date': view_time, | |
'content': content, | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment