Skip to content

Instantly share code, notes, and snippets.

@harrywang
Last active April 11, 2021 10:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save harrywang/74263ad71ca26ca47dbebab9032c0b60 to your computer and use it in GitHub Desktop.
Save harrywang/74263ad71ca26ca47dbebab9032c0b60 to your computer and use it in GitHub Desktop.
class DuplicatesPipeline(object):
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates tables.
"""
engine = db_connect()
create_table(engine)
self.Session = sessionmaker(bind=engine)
logging.info("****DuplicatesPipeline: database connected****")
def process_item(self, item, spider):
session = self.Session()
exist_quote = session.query(Quote).filter_by(quote_content = item["quote_content"]).first()
session.close()
if exist_quote is not None: # the current quote exists
raise DropItem("Duplicate item found: %s" % item["quote_content"])
else:
return item
@juananpe
Copy link

juananpe commented Apr 9, 2021

The session.close() method calls are unreachable.
The first one is placed after a raise clause.
https://gist.github.com/harrywang/74263ad71ca26ca47dbebab9032c0b60#file-duplicates_pipeline-py-L18
The second one is located after a return.
https://gist.github.com/harrywang/74263ad71ca26ca47dbebab9032c0b60#file-duplicates_pipeline-py-L21
These errors will cause a connection exhaustion quickly, bringing down SQLAlchemy

@harrywang
Copy link
Author

The session.close() method calls are unreachable.
The first one is placed after a raise clause.
https://gist.github.com/harrywang/74263ad71ca26ca47dbebab9032c0b60#file-duplicates_pipeline-py-L18
The second one is located after a return.
https://gist.github.com/harrywang/74263ad71ca26ca47dbebab9032c0b60#file-duplicates_pipeline-py-L21
These errors will cause a connection exhaustion quickly, bringing down SQLAlchemy

Thanks for pointing this out. I have changed the gist - does the changed version make sense?

@juananpe
Copy link

Thanks to you for a great tutorial on Scrapy and your quick reply. Your solution looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment