Skip to content

Instantly share code, notes, and snippets.

View curita's full-sized avatar

Julia Medina curita

  • Scrapinghub
  • Córdoba, Argentina
View GitHub Profile
@curita
curita / scrapy-0.24.py
Created March 19, 2015 05:37
item_scrapped signal
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import scrapy
from scrapy import signals, log
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
@curita
curita / log.txt
Created March 4, 2015 19:01
Output from changing module loggers to their __name__s
➜ scrapy git:(python-logging) ✗ scrapy fetch 'https://scrapinghub.com'
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Scrapy 0.25.1 started (bot: scrapybot)
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Optional features available: ssl, http11
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Overridden settings: {'LOG_FILE': 'name.log'}
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, SpiderState
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-03-04 15:51:22+0000 [scrapy.middleware
@curita
curita / gist:f644baf1215ddfe9d313
Last active August 29, 2015 14:16
Hubstorage script for dumping collections
from hubstorage import HubstorageClient
hsclient = HubstorageClient(auth=<SH_APIKEY>)
project = hsclient.get_project(<SH_PROYECT_ID:6089>)
collections = project.collections
news = collections.new_store(<COLLECTION_NAME:'news'>)
for item in news.iter_values():
print item
@curita
curita / gist:3bcd9cd062ca510f2b60
Last active January 26, 2017 06:50
Hubstorage collections endpoint
> curl -u $SHUB_APIKEY: https://storage.scrapinghub.com/collections/6171/s/news/count
{"nextstart":"0bfd9919d2503cce","count":151564,"scanned":152151}%
> curl -u $SHUB_APIKEY: https://storage.scrapinghub.com/collections/6171/s/news/count\?start\=0bfd9919d2503cce
{"nextstart":"15d480526db98c72","count":124668,"scanned":125124}%
> curl -u $SHUB_APIKEY: https://storage.scrapinghub.com/collections/6171/s/news/count\?start\=15d480526db98c72
{"nextstart":"2473cab00e8c9c16","count":184477,"scanned":185184}%
@curita
curita / gist:daa2f01383231f3f0018
Created January 21, 2015 14:55
Memory output
> sudo dmidecode -t memory
# dmidecode 2.12
SMBIOS 2.7 present.
Handle 0x0005, DMI type 5, 20 bytes
Memory Controller Information
Error Detecting Method: None
Error Correcting Capabilities:
None
@curita
curita / Proposal.rst
Last active January 26, 2017 06:45
Google Summer of Code 2014 Proposal

Scrapy Project's Proposal

This proposal intends to add support to a new Scrapy feature, per-spider settings, for what it'll take a significant core API cleanup. It's based on a careful revision of the Scrapy Enhancement Proposal Sep019 draft regarding this project.

Motivation