gustavorps/awesome-crawler.md Secret

## awesome-crawler.md

      
    Raw
  

              awesome-crawler.md
            
          
    Awesome-crawler 

A collection of awesome web crawler,spider and resources in different languages.
Contents


Python
Java
C#
JavaScript
PHP
C++
C
Ruby
R
Erlang
Perl
Go
Scala

Python


Name
Description
Last commit


Scrapy
A fast high-level screen scraping and web crawling framework.


django-dynamic-scraper
Creating Scrapy scrapers via the Django admin interface.


Scrapy-Redis
Redis-based components for Scrapy.


scrapy-cluster
Uses Redis and Kafka to create a distributed on demand scraping cluster.


distribute_crawler
Uses scrapy,redis, mongodb,graphite to create a distributed spider.


pyspider
A powerful spider system.


CoCrawler
A versatile web crawler built using modern tools and concurrency.


cola
A distributed crawling framework.


Demiurge
PyQuery-based scraping micro-framework.


Scrapely
A pure-python HTML screen-scraping library.


feedparser
feed parser.


you-get
Dumb downloader that scrapes the web.


Grab
Site scraping framework.


MechanicalSoup
A Python library for automating interaction with websites.


portia
Visual scraping for Scrapy.


crawley
Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.


RoboBrowser
A simple, Pythonic library for browsing the web without a standalone web browser.


MSpider
A simple ,easy spider using gevent and js render.


brownant
A lightweight web data extracting framework.


PSpider
A simple spider frame in Python3.


Gain
Web crawling framework based on asyncio for everyone.


sukhoi
Minimalist and powerful Web Crawler.


spidy
The simple, easy to use command line web crawler.


newspaper
News, full-text, and article metadata extraction in Python 3


aspider
An async web scraping micro-framework based on asyncio.


Java


Name
Description
Last commit


ACHE Crawler
An easy to use web crawler for domain-specific search.


Apache Nutch
Highly extensible, highly scalable web crawler for production environment.


anthelion
A plugin for Apache Nutch to crawl semantic annotations within HTML pages.


Crawler4j
Simple and lightweight web crawler.


JSoup
Scrapes, parses, manipulates and cleans HTML.


websphinx
Website-Specific Processors for HTML information extraction.


Open Search Server
A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.


Gecco
A easy to use lightweight web crawler


WebCollector
Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.


Webmagic
A scalable crawler framework.


Spiderman
A scalable ,extensible, multi-threaded web crawler.


Spiderman2
A distributed  web crawler framework,support js render.


Heritrix3
Extensible, web-scale, archival-quality web crawler project.


SeimiCrawler
An agile, distributed crawler framework.


StormCrawler
An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm


Spark-Crawler
Evolving Apache Nutch to run on Spark.


webBee
A DFS web spider.


spider-flow
A visual spider framework, it's so good that you don't need to write any code to crawl the website.


C#


Name
Description
Last commit


ccrawler
Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content.


SimpleCrawler
Simple spider base on mutithreading, regluar expression.


DotnetSpider
This is a cross platfrom, ligth spider develop by C#.


Abot
C# web crawler built for speed and flexibility.


Hawk
Advanced Crawler and ETL tool written in C#/WPF.


SkyScraper
An asynchronous web scraper / web crawler using async / await and Reactive Extensions.


Infinity Crawler
A simple but powerful web crawler library in C#.


JavaScript


Name
Description
Last commit


scraperjs
A complete and versatile web scraper.


scrape-it
A Node.js scraper for humans.


simplecrawler
Event driven web crawler.


node-crawler
Node-crawler has clean,simple api.


js-crawler
Web crawler for Node.JS, both HTTP and HTTPS are supported.


webster
A reliable web crawling framework which can scrape ajax and js rendered content in a web page.


x-ray
Web scraper with pagination and crawler support.


node-osmosis
HTML/XML parser and web scraper for Node.js.


web-scraper-chrome-extension
Web data extraction tool implemented as chrome extension.


supercrawler
Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.


headless-chrome-crawler
Headless Chrome crawls with jQuery support


Squidwarc
High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head


PHP


Name
Description
Last commit


Goutte
A screen scraping and web crawling library for PHP.


laravel-goutte
Laravel 5 Facade for Goutte.


dom-crawler
The DomCrawler component eases DOM navigation for HTML and XML documents.


QueryList
The progressive PHP crawler framework.


pspider
Parallel web crawler written in PHP.


php-spider
A configurable and extensible PHP web spider.


spatie/crawler
An easy to use, powerful crawler implemented in PHP. Can execute Javascript.


crawlzone/crawlzone
Crawlzone is a fast asynchronous internet crawling framework for PHP.


C++


Name
Description
Last commit


open-source-search-engine
A distributed open source search engine and spider/crawler written in C/C++.


C


Name
Description
Last commit


httrack
Copy websites to your computer.


Ruby


Name
Description
Last commit


Nokogiri
A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.


upton
A batteries-included framework for easy web-scraping. Just add CSS(Or do more).


wombat
Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.


RubyRetriever
RubyRetriever is a Web Crawler, Scraper & File Harvester.


Spidr
Spider a site ,multiple domains, certain links or infinitely.


Cobweb
Web crawler with very flexible crawling options, standalone or using sidekiq.


mechanize
Automated web interaction & crawling.


R


Name
Description
Last commit


rvest
Simple web scraping for R.


Erlang


Name
Description
Last commit


ebot
A scalable, distribuited and highly configurable web cawler.


Perl


Name
Description
Last commit


web-scraper
Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.


Go


Name
Description
Last commit


pholcus
A distributed, high concurrency and powerful web crawler.


gocrawl
Polite, slim and concurrent web crawler.


fetchbot
A simple and flexible web crawler that follows the robots.txt policies and crawl delays.


go_spider
An awesome Go concurrent Crawler(spider) framework.


dht
BitTorrent DHT Protocol && DHT Spider.


ants-go
A open source, distributed, restful crawler engine in golang.


scrape
A simple, higher level interface for Go web scraping.


creeper
The Next Generation Crawler Framework (Go).


colly
Fast and Elegant Scraping Framework for Gophers.


ferret
Declarative web scraping.


Dataflow kit
Extract structured data from web pages. Web sites scraping.


Hakrawler
Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application


Scala


Name
Description
Last commit


crawler
Scala DSL for web crawling.


scrala
Scala crawler(spider) framework, inspired by scrapy.


ferrit
Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.
Name	Description	Last commit
Scrapy	A fast high-level screen scraping and web crawling framework.
django-dynamic-scraper	Creating Scrapy scrapers via the Django admin interface.
Scrapy-Redis	Redis-based components for Scrapy.
scrapy-cluster	Uses Redis and Kafka to create a distributed on demand scraping cluster.
distribute_crawler	Uses scrapy,redis, mongodb,graphite to create a distributed spider.
pyspider	A powerful spider system.
CoCrawler	A versatile web crawler built using modern tools and concurrency.
cola	A distributed crawling framework.
Demiurge	PyQuery-based scraping micro-framework.
Scrapely	A pure-python HTML screen-scraping library.
feedparser	feed parser.
you-get	Dumb downloader that scrapes the web.
Grab	Site scraping framework.
MechanicalSoup	A Python library for automating interaction with websites.
portia	Visual scraping for Scrapy.
crawley	Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
RoboBrowser	A simple, Pythonic library for browsing the web without a standalone web browser.
MSpider	A simple ,easy spider using gevent and js render.
brownant	A lightweight web data extracting framework.
PSpider	A simple spider frame in Python3.
Gain	Web crawling framework based on asyncio for everyone.
sukhoi	Minimalist and powerful Web Crawler.
spidy	The simple, easy to use command line web crawler.
newspaper	News, full-text, and article metadata extraction in Python 3
aspider	An async web scraping micro-framework based on asyncio.
Name	Description	Last commit
ACHE Crawler	An easy to use web crawler for domain-specific search.
Apache Nutch	Highly extensible, highly scalable web crawler for production environment.
anthelion	A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
Crawler4j	Simple and lightweight web crawler.
JSoup	Scrapes, parses, manipulates and cleans HTML.
websphinx	Website-Specific Processors for HTML information extraction.
Open Search Server	A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
Gecco	A easy to use lightweight web crawler
WebCollector	Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
Webmagic	A scalable crawler framework.
Spiderman	A scalable ,extensible, multi-threaded web crawler.
Spiderman2	A distributed web crawler framework,support js render.
Heritrix3	Extensible, web-scale, archival-quality web crawler project.
SeimiCrawler	An agile, distributed crawler framework.
StormCrawler	An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
Spark-Crawler	Evolving Apache Nutch to run on Spark.
webBee	A DFS web spider.
spider-flow	A visual spider framework, it's so good that you don't need to write any code to crawl the website.
Name	Description	Last commit
ccrawler	Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content.
SimpleCrawler	Simple spider base on mutithreading, regluar expression.
DotnetSpider	This is a cross platfrom, ligth spider develop by C#.
Abot	C# web crawler built for speed and flexibility.
Hawk	Advanced Crawler and ETL tool written in C#/WPF.
SkyScraper	An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
Infinity Crawler	A simple but powerful web crawler library in C#.
Name	Description	Last commit
scraperjs	A complete and versatile web scraper.
scrape-it	A Node.js scraper for humans.
simplecrawler	Event driven web crawler.
node-crawler	Node-crawler has clean,simple api.
js-crawler	Web crawler for Node.JS, both HTTP and HTTPS are supported.
webster	A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
x-ray	Web scraper with pagination and crawler support.
node-osmosis	HTML/XML parser and web scraper for Node.js.
web-scraper-chrome-extension	Web data extraction tool implemented as chrome extension.
supercrawler	Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
headless-chrome-crawler	Headless Chrome crawls with jQuery support
Squidwarc	High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Name	Description	Last commit
Goutte	A screen scraping and web crawling library for PHP.
laravel-goutte	Laravel 5 Facade for Goutte.
dom-crawler	The DomCrawler component eases DOM navigation for HTML and XML documents.
QueryList	The progressive PHP crawler framework.
pspider	Parallel web crawler written in PHP.
php-spider	A configurable and extensible PHP web spider.
spatie/crawler	An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
crawlzone/crawlzone	Crawlzone is a fast asynchronous internet crawling framework for PHP.
Name	Description	Last commit
Nokogiri	A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
upton	A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
wombat	Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
RubyRetriever	RubyRetriever is a Web Crawler, Scraper & File Harvester.
Spidr	Spider a site ,multiple domains, certain links or infinitely.
Cobweb	Web crawler with very flexible crawling options, standalone or using sidekiq.
mechanize	Automated web interaction & crawling.
Name	Description	Last commit
pholcus	A distributed, high concurrency and powerful web crawler.
gocrawl	Polite, slim and concurrent web crawler.
fetchbot	A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
go_spider	An awesome Go concurrent Crawler(spider) framework.
dht	BitTorrent DHT Protocol && DHT Spider.
ants-go	A open source, distributed, restful crawler engine in golang.
scrape	A simple, higher level interface for Go web scraping.
creeper	The Next Generation Crawler Framework (Go).
colly	Fast and Elegant Scraping Framework for Gophers.
ferret	Declarative web scraping.
Dataflow kit	Extract structured data from web pages. Web sites scraping.
Hakrawler	Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
Name	Description	Last commit
crawler	Scala DSL for web crawling.
scrala	Scala crawler(spider) framework, inspired by scrapy.
ferrit	Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.