Last active
May 8, 2020 03:48
-
-
Save chenkovsky/3a945e84dfd9ce76a31e to your computer and use it in GitHub Desktop.
scrapy configure
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pybloom import BloomFilter | |
from scrapy.utils.job import job_dir | |
from scrapy.dupefilter import BaseDupeFilter | |
class BLOOMDupeFilter(BaseDupeFilter): | |
"""Request Fingerprint duplicates filter""" | |
def __init__(self, path=None): | |
self.file = None | |
self.fingerprints = BloomFilter(2000000, 0.00001) | |
@classmethod | |
def from_settings(cls, settings): | |
return cls(job_dir(settings)) | |
def request_seen(self, request): | |
fp = request.url | |
if fp in self.fingerprints: | |
return True | |
self.fingerprints.add(fp) | |
def close(self, reason): | |
self.fingerprints = None |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
DEPTH_PRIORITY = 1 | |
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue' | |
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue' | |
DUPEFILTER_CLASS = "project_name.bloom_filter.BLOOMDupeFilter" | |
1.如果想要爬取的质量更高,尽量使用宽度优先的策略,在配置里设置 SCHEDULER_ORDER = ‘BFO’ | |
2.修改单爬虫的最大并行请求数 CONCURRENT_REQUESTS_PER_SPIDER | |
3.修改twisted的线程池大小,默认值是10。参考( Using Threads in Twisted ) | |
在scrapy/core/manage.py爬虫启动前加上 | |
reactor.suggestThreadPoolSize(poolsize) | |
4.可以开启dns cache来提高性能 | |
在配置里面加上 EXTENSIONS={’scrapy.contrib.resolver.CachingResolver’: 0,} | |
5.如果自己实现duplicate filter的话注意要保证它是一直可用的,dupfilter里的异常是不会出现在日志文件中的,好像外面做了try-expect处理,我也没仔细看这部分 | |
这是接着之前的( 一 )写的,上一篇里主要是写了一些解决性能问题的思路。时间过去快半年了,我们抓取的页面也不止百万了。我们在爬虫上也做了一些小改进,比如改善了链接提取器,(一)里提到的四个问题也都有不同程度的改进,但是还是有一些问题迟迟没能解决。 | |
To start a spider with persistence supported enabled, run it like this: | |
scrapy crawl somespider -s JOBDIR=crawls/somespider-1 | |
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command: | |
scrapy crawl somespider -s JOBDIR=crawls/somespider-1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from scrapy.contrib.spidermiddleware.offsite import OffsiteMiddleware | |
import re | |
class DomainMiddleware(OffsiteMiddleware): | |
def get_host_regex(self, spider): | |
allowed_domains = getattr(spider, 'allowed_domains', None) | |
#if (not allowed_domains): | |
#return re.compile('') | |
domains = allowed_domains#[d.replace('.', r'\.') for d in allowed_domains] | |
regex = r'^(.*\.)?(%s)$' % '|'.join(domains) | |
return re.compile(regex) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from bs4 import BeautifulSoup, Comment | |
if type(response) != scrapy.http.HtmlResponse: | |
return | |
#print(response.body) | |
soup = BeautifulSoup(response.body) | |
comments = soup.findAll(text=lambda text:isinstance(text, Comment)) | |
[comment.extract() for comment in comments] | |
[s.extract() for s in soup('script')] | |
[s.extract() for s in soup('style')] | |
txt = soup.extract().text |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment