Skip to content

Instantly share code, notes, and snippets.

@h2rd
Created May 17, 2013 21:09
Show Gist options
  • Save h2rd/5601980 to your computer and use it in GitHub Desktop.
Save h2rd/5601980 to your computer and use it in GitHub Desktop.
import os
from scrapy.dupefilter import RFPDupeFilter
from scrapy.utils.request import request_fingerprint
class CustomFilter(RFPDupeFilter):
"""A dupe filter that considers specific ids in the url"""
def __getid(self, url):
mm = url.split("&refer")[0] #or something like that
return mm
def request_seen(self, request):
fp = self.__getid(request.url)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
"""
Then you need to set the correct DUPFILTER_CLASS in settings.py
DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'
"""
@murdrae
Copy link

murdrae commented Dec 12, 2013

can we use regex inside of ("&refer")? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment