Skip to content

Instantly share code, notes, and snippets.


Julia Medina curita

View GitHub Profile
curita /
Last active Jun 29, 2017
Check unsuccessful dataloss retries in ScrapyCloud
from hubstorage import HubstorageClient
hs = HubstorageClient('[REDACTED]')
project = hs.get_project('1887')
def examine_logs(job):
n_dataloss_requests = 0
n_failed_dataloss_requests = 0
crawlera_enabled = int(job.metadata['scrapystats'].get('crawlera/request', 0))
curita /
Last active Jun 29, 2017
Calculate total amount of products in middle shelves from [REDACTED] SH job
from hubstorage import HubstorageClient
hs = HubstorageClient('<API_KEY>')
class Shelf():
def __init__(self):
self.children = defaultdict(Shelf)
self.products = 0
def __iter__(self):
for child in self.children.values():
curita /
Last active Jun 29, 2017
Check mismatching breadcrumbs parsing for same shelf url
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import pprint
import argparse
from itertools import groupby
from operator import itemgetter
from w3lib.url import url_query_cleaner
curita /
Created Mar 19, 2015
item_scrapped signal
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import scrapy
from scrapy import signals, log
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
curita / log.txt
Created Mar 4, 2015
Output from changing module loggers to their __name__s
View log.txt
➜ scrapy git:(python-logging) ✗ scrapy fetch ''
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Scrapy 0.25.1 started (bot: scrapybot)
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Optional features available: ssl, http11
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Overridden settings: {'LOG_FILE': 'name.log'}
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, SpiderState
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-03-04 15:51:22+0000 [scrapy.middleware
curita / gist:f644baf1215ddfe9d313
Last active Aug 29, 2015
Hubstorage script for dumping collections
View gist:f644baf1215ddfe9d313
from hubstorage import HubstorageClient
hsclient = HubstorageClient(auth=<SH_APIKEY>)
project = hsclient.get_project(<SH_PROYECT_ID:6089>)
collections = project.collections
news = collections.new_store(<COLLECTION_NAME:'news'>)
for item in news.iter_values():
print item
curita / gist:3bcd9cd062ca510f2b60
Last active Jan 26, 2017
Hubstorage collections endpoint
View gist:3bcd9cd062ca510f2b60
> curl -u $SHUB_APIKEY:
> curl -u $SHUB_APIKEY:\?start\=0bfd9919d2503cce
> curl -u $SHUB_APIKEY:\?start\=15d480526db98c72
View gist:daa2f01383231f3f0018
> sudo dmidecode -t memory
# dmidecode 2.12
SMBIOS 2.7 present.
Handle 0x0005, DMI type 5, 20 bytes
Memory Controller Information
Error Detecting Method: None
Error Correcting Capabilities:
curita / Proposal.rst
Last active Jan 26, 2017
Google Summer of Code 2014 Proposal
View Proposal.rst

Scrapy Project's Proposal

This proposal intends to add support to a new Scrapy feature, per-spider settings, for what it'll take a significant core API cleanup. It's based on a careful revision of the Scrapy Enhancement Proposal `Sep019`_ draft regarding this project.