Skip to content

Instantly share code, notes, and snippets.

View whalebot-helmsman's full-sized avatar

Vostretsov Nikita whalebot-helmsman

  • Scrapinghub
  • Santa Ana
View GitHub Profile
from collections import defaultdict
import sys
import dask
from dask.distributed import Client
def s0(x):
print(f's0 {x}')
return x
def iterate(n, m, j, k):
for i1 in range(n):
for i2 in range(m):
for i3 in range(j):
for i4 in range(k):
yield (i1, i2, i3, i4)
for record in iterate(10, 20, 30, 40):
import scrapy
class Redirect(scrapy.Spider):
name = 'redirect'
start_urls = ['http://www.wikipedia.net/wiki/Hello']
def parse(self, response):
self.logger.info(response.meta.get('redirect_urls', []))
self.logger.info(response.url)
<!DOCTYPE html>
<html>
<head>
<title>Form test</title>
</head>
<body>
<form action="http://example.com/f?a=1&d=2" method="get">
Get Form<br>
a:<br>
@whalebot-helmsman
whalebot-helmsman / py27DAPQ_serve
Last active December 5, 2018 10:19
Results of scrapy-bench for new priority queues
Executing scrapy-bench --n-runs 1 broadworm in /home/nikita/ves/scrapy-bench-2.7/
/home/nikita/ves/scrapy-bench-2.7/local/lib/python2.7/site-packages/cryptography/hazmat/primitives/constant_time.py:26: CryptographyDeprecationWarning: Support for your Python version is deprecated. The next version of cryptography will remove support. Please upgrade to a 2.7.x release that supports hmac.compare_digest as soon as possible.
utils.DeprecatedIn23,
2018-12-04 16:16:22 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: broadspider)
2018-12-04 16:16:22 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.6 (default, Nov 23 2017, 15:49:48) - [GCC 4.8.4], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j 20 Nov 2018), cryptography 2.4.2, Platform Linux-4.4.0-134-generic-x86_64-with-Ubuntu-14.04-trusty
2018-12-04 16:16:22 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'broad.spiders', 'CLOSESPIDER_ITEMCOUNT': 800, 'FEED_URI': 'items.c
@whalebot-helmsman
whalebot-helmsman / py27n5bracnchDAPQ
Created November 29, 2018 10:31
scrapy-bench for round-robin and downloaderaware
Executing scrapy-bench --n-runs 5 --book_url http://localhost:8000/ bookworm in /home/ec2-user/ves/scrapy-bench-2.7/
2018-11-29 10:17:09 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: books)
2018-11-29 10:17:09 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.5 (default, Sep 12 2018, 05:31:16) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j 20 Nov 2018), cryptography 2.4.2, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-redhat-7.4-Maipo
2018-11-29 10:17:09 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'books.spiders', 'CLOSESPIDER_ITEMCOUNT': 1000, 'FEED_URI': 'items.csv', 'LOG_LEVEL': 'INFO', 'MEMDEBUG_ENABLED': True, 'CONCURRENT_REQUESTS': 120, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['books.spiders'], 'BOT_NAME': 'books', 'LOGSTATS_INTERVAL': 3, 'FEED_FORMAT': 'csv', 'SCHEDULER_PRIORITY_QUEUE': 'scrapy.pqueues.DownloaderAwarePriorityQueue'}
2018-11-29 10:17:09 [scr
@whalebot-helmsman
whalebot-helmsman / asyncio_ex.py
Last active September 22, 2018 07:53
Mix asyncio with twisted
import asyncio
import logging
async def start():
logging.warning('started')
await asyncio.sleep(2)
logging.warning('finished')
@whalebot-helmsman
whalebot-helmsman / branch.log
Created August 28, 2018 09:56
Performance comprasion
Executing scrapy-bench --n-runs 10 --book_url http://localhost:8080/books.toscrape.com/ bookworm in /home/nikita/ves/scrapy-bench
2018-08-28 09:07:03 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: books)
2018-08-28 09:07:03 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 (default, Jun 4 2018, 10:24:41) - [GCC 4.8.4], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Linux-4.4.0-96-generic-x86_64-with-debian-jessie-sid
2018-08-28 09:07:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'books', 'CLOSESPIDER_ITEMCOUNT': 1000, 'CONCURRENT_REQUESTS': 120, 'FEED_FORMAT': 'csv', 'FEED_URI': 'items.csv', 'LOGSTATS_INTERVAL': 3, 'LOG_LEVEL': 'INFO', 'MEMDEBUG_ENABLED': True, 'NEWSPIDER_MODULE': 'books.spiders', 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['books.spiders']}
2018-08-28 09:07:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.e
This file has been truncated, but you can view the full file.
0
SECTION
2
HEADER
9
$ACADVER
1
AC1015
9
$ACADMAINTVER
import multiprocessing
import twisted.web.http
import twisted.internet
def run_mock(q):
factory = twisted.web.http.HTTPFactory()
port = twisted.internet.reactor.listenTCP(0, factory)