Skip to content

Instantly share code, notes, and snippets.

Julia Medina curita

Block or report user

Report or block curita

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
curita /
Last active Jun 29, 2017
Check unsuccessful dataloss retries in ScrapyCloud
from hubstorage import HubstorageClient
hs = HubstorageClient('[REDACTED]')
project = hs.get_project('1887')
def examine_logs(job):
n_dataloss_requests = 0
n_failed_dataloss_requests = 0
crawlera_enabled = int(job.metadata['scrapystats'].get('crawlera/request', 0))
curita /
Last active Jun 29, 2017
Calculate total amount of products in middle shelves from [REDACTED] SH job
from hubstorage import HubstorageClient
hs = HubstorageClient('<API_KEY>')
class Shelf():
def __init__(self):
self.children = defaultdict(Shelf)
self.products = 0
def __iter__(self):
for child in self.children.values():
curita /
Last active Jun 29, 2017
Check mismatching breadcrumbs parsing for same shelf url
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import pprint
import argparse
from itertools import groupby
from operator import itemgetter
from w3lib.url import url_query_cleaner
curita /
Created Mar 19, 2015
item_scrapped signal
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import scrapy
from scrapy import signals, log
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
curita / log.txt
Created Mar 4, 2015
Output from changing module loggers to their __name__s
View log.txt
➜ scrapy git:(python-logging) ✗ scrapy fetch ''
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Scrapy 0.25.1 started (bot: scrapybot)
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Optional features available: ssl, http11
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Overridden settings: {'LOG_FILE': 'name.log'}
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, SpiderState
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-03-04 15:51:22+0000 [scrapy.middleware
curita / gist:f644baf1215ddfe9d313
Last active Aug 29, 2015
Hubstorage script for dumping collections
View gist:f644baf1215ddfe9d313
from hubstorage import HubstorageClient
hsclient = HubstorageClient(auth=<SH_APIKEY>)
project = hsclient.get_project(<SH_PROYECT_ID:6089>)
collections = project.collections
news = collections.new_store(<COLLECTION_NAME:'news'>)
for item in news.iter_values():
print item
curita / gist:3bcd9cd062ca510f2b60
Last active Jan 26, 2017
Hubstorage collections endpoint
View gist:3bcd9cd062ca510f2b60
> curl -u $SHUB_APIKEY:
> curl -u $SHUB_APIKEY:\?start\=0bfd9919d2503cce
> curl -u $SHUB_APIKEY:\?start\=15d480526db98c72
View gist:daa2f01383231f3f0018
> sudo dmidecode -t memory
# dmidecode 2.12
SMBIOS 2.7 present.
Handle 0x0005, DMI type 5, 20 bytes
Memory Controller Information
Error Detecting Method: None
Error Correcting Capabilities:
curita / Proposal.rst
Last active Jan 26, 2017
Google Summer of Code 2014 Proposal
View Proposal.rst

Scrapy Project's Proposal

This proposal intends to add support to a new Scrapy feature, per-spider settings, for what it'll take a significant core API cleanup. It's based on a careful revision of the Scrapy Enhancement Proposal `Sep019`_ draft regarding this project.


You can’t perform that action at this time.