Skip to content

Instantly share code, notes, and snippets.

@rafikahmed
Last active April 21, 2020 11:11
Show Gist options
  • Save rafikahmed/d73895ae854b5001f5888377bf4c8e9e to your computer and use it in GitHub Desktop.
Save rafikahmed/d73895ae854b5001f5888377bf4c8e9e to your computer and use it in GitHub Desktop.
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.emails_seen = set()
def process_item(self, item, spider):
if item['email'] in self.emails_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['email'])
return item
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment