Skip to content

Instantly share code, notes, and snippets.

@rvth
Created September 2, 2021 10:23
Show Gist options
  • Save rvth/de9f8fb8d0996681873d1af41541ebb1 to your computer and use it in GitHub Desktop.
Save rvth/de9f8fb8d0996681873d1af41541ebb1 to your computer and use it in GitHub Desktop.
from scrapy.spiders import CrawlSpider, Rule
class SuperSpider(CrawlSpider):
name = 'follower'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Web_scraping']
base_url = 'https://en.wikipedia.org'
custom_settings = {
'DEPTH_LIMIT': 1
}
def parse(self, response):
for next_page in response.xpath('.//div/p/a'):
yield response.follow(next_page, self.parse)
for quote in response.xpath('.//h1/text()'):
yield {'quote': quote.extract() }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment