Skip to content

Instantly share code, notes, and snippets.

@SauravKanchan
Created September 18, 2017 18:13
Show Gist options
  • Save SauravKanchan/bc3c5da55b9257ea0d6178560ec13fa3 to your computer and use it in GitHub Desktop.
Save SauravKanchan/bc3c5da55b9257ea0d6178560ec13fa3 to your computer and use it in GitHub Desktop.
Crawling web pages with scrapy
import scrapy
crawled=set('https://ves.ac.in/')
class MySpider(scrapy.Spider):
name = 'my_spider'
allowed_domains = ['ves.ac.in']
def start_requests(self):yield scrapy.Request("https://ves.ac.in/",self.parse)
def parse(self, response):
for url in response.xpath('//a/@href').extract():
if url not in crawled:
crawled.add(url)
yield {'url':response.urljoin(url)}
if url[-4:] not in [".pdf" ,".png" ,".jpg",".gif"] :
yield scrapy.Request(url=response.urljoin(url), callback=self.parse)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment