AlloVince/scrapy.md

## scrapy.md

      
    Raw
  

              scrapy.md
            
          
    如何严肃的写一个爬虫

2018.8 @AlloVince

知名爬虫框架有什么


Scrapy (Python) 
pyspider (Python) 
Nutch (Java) 
colly (go) 


More:

awesome-crawler  

FurtherMore


神箭手
Scrapinghub


框架解决什么问题


异步任务调度 
爬行算法 
URL去重和管理 
数据ETL脚手架 


Scrapy

An open source and collaborative framework for extracting the data you need from websites.

8行代码爬取豆瓣电影

class DoubanMovieSpider(CrawlSpider):
    start_urls = ['https://movie.douban.com/chart']
    rules = (
        Rule(LinkExtractor(allow='^https://movie.douban.com/(typerank|celebrity|tag)/.*', ), follow=True, ),
        Rule(LinkExtractor(allow='^https://movie.douban.com/subject/\d+/$', ), follow=True, callback='handle_item'),
    )
    def handle_item(self, response: Response) -> RawHtmlItem:
        return RawHtmlItem(url=response.url, html=response.text)

Scrapy的内置功能


爬虫

基于事件驱动的异步任务管理
自动处理重复url
深度优先 / 广度优先爬行算法
基于Xpath的dom解析


架构

丰富的 middleware / pipeline
Shell console for debug & dev
编码自动处理
支持 telnet signal


Scrapy 是如何工作的


Twisted

An event-driven networking framework.

event loop

core: reactor
event dispatcher: deferred
communication: TCP/UDP/Process
IO handler: threads pool


Notes


Event Loop: 程序架构, Event Driven 的具体实现方式
reactor: 一种设计模式, 提供异步事件的通用接口，底层根据不同操作系统选择不同的异步API，Linux下默认使用epoll
threads pool: 处理IO操作, 默认10线程，延迟处理


同步阻塞

import time
time.sleep(3)
print('Hello world')

Twisted 异步

from twisted.internet import reactor

def aSillyBlockingMethod(x):
    import time
    time.sleep(3)
    print('Hello world')

reactor.callInThread(aSillyBlockingMethod, "3 seconds have passed")
reactor.run()

python3.4 异步io

import asyncio

@asyncio.coroutine
def hello():
    r = yield from asyncio.sleep(1)
    print("hello world")

loop = asyncio.get_event_loop()
loop.run_until_complete(hello())
loop.close()

python3.5 async / await

import asyncio

async def hello():
    r = await asyncio.sleep(1)
    print("Hello world!")

Scrapy 的优势


抽象程度高，组件化做的非常好
覆盖了80%常用的爬虫功能
社区成熟, 易扩展


Scrapy 的问题


多线程模型，只支持单机使用  
抓取和ETL过程耦合  
Twisted增加了复杂度 


生产中经常遇到的问题


API/Ajax抓取 
反爬取 
全量和增量爬取 


让Scrapy支持分布式

scrapy-redis

原理


Scrapy Engine

Dupefilter <--> Redis Dupefilter


Scheduler <--> Redis Scheduler
Downloader
Spiders <--> RedisCrawlSpider
Item Pipeline <--> Redis Item Pipeline
Downloader middlewares
Spider middlewares


分离抓取与ETL


调试困难
重试困难
数据库成为瓶颈


link

EvaScrapy


分布式抓取: scrapy-redis
全量抓取和增量抓取
存储支持本地磁盘和OSS
消息队列通知
支持Docker


TODO


如何应对反爬取

频控  
代理池  
隧道  
其他?  


Ajax内容
API爬取


Q & A