Skip to content

Instantly share code, notes, and snippets.

Julia Medina curita

Block or report user

Report or block curita

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@curita
curita / check_dataloss_retries.py
Last active Jun 29, 2017
Check unsuccessful dataloss retries in ScrapyCloud
View check_dataloss_retries.py
from hubstorage import HubstorageClient
hs = HubstorageClient('[REDACTED]')
project = hs.get_project('1887')
def examine_logs(job):
n_dataloss_requests = 0
n_failed_dataloss_requests = 0
crawlera_enabled = int(job.metadata['scrapystats'].get('crawlera/request', 0))
@curita
curita / sum-middle-shelves.py
Last active Jun 29, 2017
Calculate total amount of products in middle shelves from [REDACTED] SH job
View sum-middle-shelves.py
from hubstorage import HubstorageClient
hs = HubstorageClient('<API_KEY>')
class Shelf():
def __init__(self):
self.children = defaultdict(Shelf)
self.products = 0
def __iter__(self):
for child in self.children.values():
@curita
curita / check-mismatching-categories.py
Last active Jun 29, 2017
Check mismatching breadcrumbs parsing for same shelf url
View check-mismatching-categories.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import pprint
import argparse
from itertools import groupby
from operator import itemgetter
from w3lib.url import url_query_cleaner
@curita
curita / scrapy-0.24.py
Created Mar 19, 2015
item_scrapped signal
View scrapy-0.24.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import scrapy
from scrapy import signals, log
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
@curita
curita / log.txt
Created Mar 4, 2015
Output from changing module loggers to their __name__s
View log.txt
➜ scrapy git:(python-logging) ✗ scrapy fetch 'https://scrapinghub.com'
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Scrapy 0.25.1 started (bot: scrapybot)
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Optional features available: ssl, http11
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Overridden settings: {'LOG_FILE': 'name.log'}
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, SpiderState
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-03-04 15:51:22+0000 [scrapy.middleware
@curita
curita / gist:f644baf1215ddfe9d313
Last active Aug 29, 2015
Hubstorage script for dumping collections
View gist:f644baf1215ddfe9d313
from hubstorage import HubstorageClient
hsclient = HubstorageClient(auth=<SH_APIKEY>)
project = hsclient.get_project(<SH_PROYECT_ID:6089>)
collections = project.collections
news = collections.new_store(<COLLECTION_NAME:'news'>)
for item in news.iter_values():
print item
@curita
curita / gist:3bcd9cd062ca510f2b60
Last active Jan 26, 2017
Hubstorage collections endpoint
View gist:3bcd9cd062ca510f2b60
> curl -u $SHUB_APIKEY: https://storage.scrapinghub.com/collections/6171/s/news/count
{"nextstart":"0bfd9919d2503cce","count":151564,"scanned":152151}%
> curl -u $SHUB_APIKEY: https://storage.scrapinghub.com/collections/6171/s/news/count\?start\=0bfd9919d2503cce
{"nextstart":"15d480526db98c72","count":124668,"scanned":125124}%
> curl -u $SHUB_APIKEY: https://storage.scrapinghub.com/collections/6171/s/news/count\?start\=15d480526db98c72
{"nextstart":"2473cab00e8c9c16","count":184477,"scanned":185184}%
View gist:daa2f01383231f3f0018
> sudo dmidecode -t memory
# dmidecode 2.12
SMBIOS 2.7 present.
Handle 0x0005, DMI type 5, 20 bytes
Memory Controller Information
Error Detecting Method: None
Error Correcting Capabilities:
None
@curita
curita / Proposal.rst
Last active Jan 26, 2017
Google Summer of Code 2014 Proposal
View Proposal.rst

Scrapy Project's Proposal

This proposal intends to add support to a new Scrapy feature, per-spider settings, for what it'll take a significant core API cleanup. It's based on a careful revision of the Scrapy Enhancement Proposal `Sep019`_ draft regarding this project.

Motivation

You can’t perform that action at this time.