Skip to content

Instantly share code, notes, and snippets.

Julia Medina curita

View GitHub Profile
@curita
curita / check-mismatching-categories.py
Last active Jun 29, 2017
Check mismatching breadcrumbs parsing for same shelf url
View check-mismatching-categories.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import pprint
import argparse
from itertools import groupby
from operator import itemgetter
from w3lib.url import url_query_cleaner
@curita
curita / sum-middle-shelves.py
Last active Jun 29, 2017
Calculate total amount of products in middle shelves from [REDACTED] SH job
View sum-middle-shelves.py
from hubstorage import HubstorageClient
hs = HubstorageClient('<API_KEY>')
class Shelf():
def __init__(self):
self.children = defaultdict(Shelf)
self.products = 0
def __iter__(self):
for child in self.children.values():
@curita
curita / check_dataloss_retries.py
Last active Jun 29, 2017
Check unsuccessful dataloss retries in ScrapyCloud
View check_dataloss_retries.py
from hubstorage import HubstorageClient
hs = HubstorageClient('[REDACTED]')
project = hs.get_project('1887')
def examine_logs(job):
n_dataloss_requests = 0
n_failed_dataloss_requests = 0
crawlera_enabled = int(job.metadata['scrapystats'].get('crawlera/request', 0))
@curita
curita / scrapy-0.24.py
Created Mar 19, 2015
item_scrapped signal
View scrapy-0.24.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import scrapy
from scrapy import signals, log
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
@curita
curita / gist:3bcd9cd062ca510f2b60
Last active Jan 26, 2017
Hubstorage collections endpoint
View gist:3bcd9cd062ca510f2b60
> curl -u $SHUB_APIKEY: https://storage.scrapinghub.com/collections/6171/s/news/count
{"nextstart":"0bfd9919d2503cce","count":151564,"scanned":152151}%
> curl -u $SHUB_APIKEY: https://storage.scrapinghub.com/collections/6171/s/news/count\?start\=0bfd9919d2503cce
{"nextstart":"15d480526db98c72","count":124668,"scanned":125124}%
> curl -u $SHUB_APIKEY: https://storage.scrapinghub.com/collections/6171/s/news/count\?start\=15d480526db98c72
{"nextstart":"2473cab00e8c9c16","count":184477,"scanned":185184}%
@curita
curita / Proposal.rst
Last active Jan 26, 2017
Google Summer of Code 2014 Proposal
View Proposal.rst

Scrapy Project's Proposal

This proposal intends to add support to a new Scrapy feature, per-spider settings, for what it'll take a significant core API cleanup. It's based on a careful revision of the Scrapy Enhancement Proposal `Sep019`_ draft regarding this project.

Motivation

@curita
curita / log.txt
Created Mar 4, 2015
Output from changing module loggers to their __name__s
View log.txt
➜ scrapy git:(python-logging) ✗ scrapy fetch 'https://scrapinghub.com'
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Scrapy 0.25.1 started (bot: scrapybot)
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Optional features available: ssl, http11
2015-03-04 15:51:17+0000 [scrapy.utils.log] INFO: Overridden settings: {'LOG_FILE': 'name.log'}
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, SpiderState
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-04 15:51:22+0000 [scrapy.middleware] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-03-04 15:51:22+0000 [scrapy.middleware
@curita
curita / gist:f644baf1215ddfe9d313
Last active Aug 29, 2015
Hubstorage script for dumping collections
View gist:f644baf1215ddfe9d313
from hubstorage import HubstorageClient
hsclient = HubstorageClient(auth=<SH_APIKEY>)
project = hsclient.get_project(<SH_PROYECT_ID:6089>)
collections = project.collections
news = collections.new_store(<COLLECTION_NAME:'news'>)
for item in news.iter_values():
print item
View gist:daa2f01383231f3f0018
> sudo dmidecode -t memory
# dmidecode 2.12
SMBIOS 2.7 present.
Handle 0x0005, DMI type 5, 20 bytes
Memory Controller Information
Error Detecting Method: None
Error Correcting Capabilities:
None
You can’t perform that action at this time.