Skip to content

Instantly share code, notes, and snippets.

View curita's full-sized avatar

Julia Medina curita

  • Scrapinghub
  • Córdoba, Argentina
View GitHub Profile
"""
There are cases where jobs can fail abruptly in such a way that Spidermon
(or any other extensions that run at the end of Scrapy) won't run.
In these situations, we won't be alerted that something happened because
Spidermon didn't run at the end, so it won't generate alerts and ScrapyCloud
also won't warn about them.
This script has the objective of helping identifying those jobs.
In order to use it (either locally or in scrapy cloud), put the following script
in your project:
.. code-block:: python
@curita
curita / gist:a45abcfc2e19d7474f3bff0ab36ad478
Created October 27, 2023 19:53
Ad-hoc JustWatch job (Oct 27 2023)
https://www.justwatch.com/us/tv-show/the-bear
https://www.justwatch.com/us/tv-show/the-boys
https://www.justwatch.com/us/tv-show/the-wheel-of-time
https://www.justwatch.com/us/movie/no-one-will-save-you
https://www.justwatch.com/us/tv-show/family-guy
https://www.justwatch.com/us/tv-show/wilderness
https://www.justwatch.com/us/tv-show/what-we-do-in-the-shadows
https://www.boxofficemojo.com/title/tt26907957/
https://www.boxofficemojo.com/title/tt10638522/
https://www.boxofficemojo.com/title/tt15837338/
https://www.imdb.com/title/tt26907957/
https://www.imdb.com/title/tt10638522/
https://www.imdb.com/title/tt15837338/
@curita
curita / testing-goodreads-book-urls.txt
Last active October 17, 2023 15:08
Testing crawl_source file for Goodreads
https://www.goodreads.com/book/show/22837718-qualia-the-purple
https://www.goodreads.com/book/show/57916643-the-year-s-midnight
https://www.goodreads.com/book/show/31312596-letters-from-a-shipwreck-in-the-sea-of-suns-and-moons
https://www.goodreads.com/book/show/44539716-the-nothing-within
https://www.goodreads.com/book/show/60286274-the-reyes-incident
https://www.goodreads.com/book/show/42348385-the-narrows
https://www.goodreads.com/book/show/56135545-the-spark
https://www.goodreads.com/book/show/55962500-legacy-of-the-brightwash
https://www.goodreads.com/book/show/33965336-seek-the-throat-from-which-we-sing
https://www.goodreads.com/book/show/60214731-into-the-fire
@curita
curita / crawl_source.txt
Created September 12, 2023 18:24
IMDb crawl_source file
https://www.imdb.com/title/tt0141842/
@curita
curita / check_dataloss_retries.py
Last active June 29, 2017 09:58
Check unsuccessful dataloss retries in ScrapyCloud
from hubstorage import HubstorageClient
hs = HubstorageClient('[REDACTED]')
project = hs.get_project('1887')
def examine_logs(job):
n_dataloss_requests = 0
n_failed_dataloss_requests = 0
crawlera_enabled = int(job.metadata['scrapystats'].get('crawlera/request', 0))
@curita
curita / sum-middle-shelves.py
Last active June 29, 2017 10:07
Calculate total amount of products in middle shelves from [REDACTED] SH job
from hubstorage import HubstorageClient
hs = HubstorageClient('<API_KEY>')
class Shelf():
def __init__(self):
self.children = defaultdict(Shelf)
self.products = 0
def __iter__(self):
for child in self.children.values():
@curita
curita / check-mismatching-categories.py
Last active June 29, 2017 10:14
Check mismatching breadcrumbs parsing for same shelf url
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import pprint
import argparse
from itertools import groupby
from operator import itemgetter
from w3lib.url import url_query_cleaner