Skip to content

Instantly share code, notes, and snippets.

@chenkovsky
Last active May 8, 2020 03:48
Show Gist options
  • Save chenkovsky/3a945e84dfd9ce76a31e to your computer and use it in GitHub Desktop.
Save chenkovsky/3a945e84dfd9ce76a31e to your computer and use it in GitHub Desktop.
scrapy configure
from pybloom import BloomFilter
from scrapy.utils.job import job_dir
from scrapy.dupefilter import BaseDupeFilter
class BLOOMDupeFilter(BaseDupeFilter):
"""Request Fingerprint duplicates filter"""
def __init__(self, path=None):
self.file = None
self.fingerprints = BloomFilter(2000000, 0.00001)
@classmethod
def from_settings(cls, settings):
return cls(job_dir(settings))
def request_seen(self, request):
fp = request.url
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
def close(self, reason):
self.fingerprints = None
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
DUPEFILTER_CLASS = "project_name.bloom_filter.BLOOMDupeFilter"
1.如果想要爬取的质量更高,尽量使用宽度优先的策略,在配置里设置 SCHEDULER_ORDER = ‘BFO’
2.修改单爬虫的最大并行请求数 CONCURRENT_REQUESTS_PER_SPIDER
3.修改twisted的线程池大小,默认值是10。参考( Using Threads in Twisted )
在scrapy/core/manage.py爬虫启动前加上
reactor.suggestThreadPoolSize(poolsize)
4.可以开启dns cache来提高性能
在配置里面加上 EXTENSIONS={’scrapy.contrib.resolver.CachingResolver’: 0,}
5.如果自己实现duplicate filter的话注意要保证它是一直可用的,dupfilter里的异常是不会出现在日志文件中的,好像外面做了try-expect处理,我也没仔细看这部分
这是接着之前的( 一 )写的,上一篇里主要是写了一些解决性能问题的思路。时间过去快半年了,我们抓取的页面也不止百万了。我们在爬虫上也做了一些小改进,比如改善了链接提取器,(一)里提到的四个问题也都有不同程度的改进,但是还是有一些问题迟迟没能解决。
To start a spider with persistence supported enabled, run it like this:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
from scrapy.contrib.spidermiddleware.offsite import OffsiteMiddleware
import re
class DomainMiddleware(OffsiteMiddleware):
def get_host_regex(self, spider):
allowed_domains = getattr(spider, 'allowed_domains', None)
#if (not allowed_domains):
#return re.compile('')
domains = allowed_domains#[d.replace('.', r'\.') for d in allowed_domains]
regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
return re.compile(regex)
from bs4 import BeautifulSoup, Comment
if type(response) != scrapy.http.HtmlResponse:
return
#print(response.body)
soup = BeautifulSoup(response.body)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
[s.extract() for s in soup('script')]
[s.extract() for s in soup('style')]
txt = soup.extract().text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment