Skip to content

Instantly share code, notes, and snippets.

@embiem
Last active August 27, 2020 09:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save embiem/2e5253cada21a24a461a841ff2640e6e to your computer and use it in GitHub Desktop.
Save embiem/2e5253cada21a24a461a841ff2640e6e to your computer and use it in GitHub Desktop.
Scrapy Spider that scrapes Google's Codelabs for Category, Description, Link, Last Updated, Duration and Tags.
import scrapy
class CodelabsSpider(scrapy.Spider):
name = "codelabs"
def start_requests(self):
urls = [
'https://codelabs.developers.google.com/'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for card in response.css('a.codelab-card'):
yield {
'description': card.css('a div.description::text').extract_first(),
'link': response.urljoin(card.css('a::attr(href)').extract_first()),
'category': card.css('a::attr(data-category)').extract_first(),
'updated' : card.css('a::attr(data-updated)').extract_first(),
'duration': card.css('a::attr(data-duration)').extract_first(),
'tags': card.css('a::attr(data-tags)').extract_first()
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment