A collection of awesome web crawler,spider and resources in different languages.
Name | Description | Last commit |
---|---|---|
Scrapy | A fast high-level screen scraping and web crawling framework. | |
django-dynamic-scraper | Creating Scrapy scrapers via the Django admin interface. | |
Scrapy-Redis | Redis-based components for Scrapy. | |
scrapy-cluster | Uses Redis and Kafka to create a distributed on demand scraping cluster. | |
distribute_crawler | Uses scrapy,redis, mongodb,graphite to create a distributed spider. | |
pyspider | A powerful spider system. | |
CoCrawler | A versatile web crawler built using modern tools and concurrency. | |
cola | A distributed crawling framework. | |
Demiurge | PyQuery-based scraping micro-framework. | |
Scrapely | A pure-python HTML screen-scraping library. | |
feedparser | feed parser. | |
you-get | Dumb downloader that scrapes the web. | |
Grab | Site scraping framework. | |
MechanicalSoup | A Python library for automating interaction with websites. | |
portia | Visual scraping for Scrapy. | |
crawley | Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. | |
RoboBrowser | A simple, Pythonic library for browsing the web without a standalone web browser. | |
MSpider | A simple ,easy spider using gevent and js render. | |
brownant | A lightweight web data extracting framework. | |
PSpider | A simple spider frame in Python3. | |
Gain | Web crawling framework based on asyncio for everyone. | |
sukhoi | Minimalist and powerful Web Crawler. | |
spidy | The simple, easy to use command line web crawler. | |
newspaper | News, full-text, and article metadata extraction in Python 3 | |
aspider | An async web scraping micro-framework based on asyncio. |
Name | Description | Last commit |
---|---|---|
ACHE Crawler | An easy to use web crawler for domain-specific search. | |
Apache Nutch | Highly extensible, highly scalable web crawler for production environment. | |
anthelion | A plugin for Apache Nutch to crawl semantic annotations within HTML pages. | |
Crawler4j | Simple and lightweight web crawler. | |
JSoup | Scrapes, parses, manipulates and cleans HTML. | |
websphinx | Website-Specific Processors for HTML information extraction. | |
Open Search Server | A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything. | |
Gecco | A easy to use lightweight web crawler | |
WebCollector | Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. | |
Webmagic | A scalable crawler framework. | |
Spiderman | A scalable ,extensible, multi-threaded web crawler. | |
Spiderman2 | A distributed web crawler framework,support js render. | |
Heritrix3 | Extensible, web-scale, archival-quality web crawler project. | |
SeimiCrawler | An agile, distributed crawler framework. | |
StormCrawler | An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm | |
Spark-Crawler | Evolving Apache Nutch to run on Spark. | |
webBee | A DFS web spider. | |
spider-flow | A visual spider framework, it's so good that you don't need to write any code to crawl the website. |
Name | Description | Last commit |
---|---|---|
ccrawler | Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content. | |
SimpleCrawler | Simple spider base on mutithreading, regluar expression. | |
DotnetSpider | This is a cross platfrom, ligth spider develop by C#. | |
Abot | C# web crawler built for speed and flexibility. | |
Hawk | Advanced Crawler and ETL tool written in C#/WPF. | |
SkyScraper | An asynchronous web scraper / web crawler using async / await and Reactive Extensions. | |
Infinity Crawler | A simple but powerful web crawler library in C#. |
Name | Description | Last commit |
---|---|---|
scraperjs | A complete and versatile web scraper. | |
scrape-it | A Node.js scraper for humans. | |
simplecrawler | Event driven web crawler. | |
node-crawler | Node-crawler has clean,simple api. | |
js-crawler | Web crawler for Node.JS, both HTTP and HTTPS are supported. | |
webster | A reliable web crawling framework which can scrape ajax and js rendered content in a web page. | |
x-ray | Web scraper with pagination and crawler support. | |
node-osmosis | HTML/XML parser and web scraper for Node.js. | |
web-scraper-chrome-extension | Web data extraction tool implemented as chrome extension. | |
supercrawler | Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. | |
headless-chrome-crawler | Headless Chrome crawls with jQuery support | |
Squidwarc | High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head |
Name | Description | Last commit |
---|---|---|
Goutte | A screen scraping and web crawling library for PHP. | |
laravel-goutte | Laravel 5 Facade for Goutte. | |
dom-crawler | The DomCrawler component eases DOM navigation for HTML and XML documents. | |
QueryList | The progressive PHP crawler framework. | |
pspider | Parallel web crawler written in PHP. | |
php-spider | A configurable and extensible PHP web spider. | |
spatie/crawler | An easy to use, powerful crawler implemented in PHP. Can execute Javascript. | |
crawlzone/crawlzone | Crawlzone is a fast asynchronous internet crawling framework for PHP. |
Name | Description | Last commit |
---|---|---|
open-source-search-engine | A distributed open source search engine and spider/crawler written in C/C++. |
Name | Description | Last commit |
---|---|---|
httrack | Copy websites to your computer. |
Name | Description | Last commit |
---|---|---|
Nokogiri | A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support. | |
upton | A batteries-included framework for easy web-scraping. Just add CSS(Or do more). | |
wombat | Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages. | |
RubyRetriever | RubyRetriever is a Web Crawler, Scraper & File Harvester. | |
Spidr | Spider a site ,multiple domains, certain links or infinitely. | |
Cobweb | Web crawler with very flexible crawling options, standalone or using sidekiq. | |
mechanize | Automated web interaction & crawling. |
Name | Description | Last commit |
---|---|---|
rvest | Simple web scraping for R. |
Name | Description | Last commit |
---|---|---|
ebot | A scalable, distribuited and highly configurable web cawler. |
Name | Description | Last commit |
---|---|---|
web-scraper | Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions. |
Name | Description | Last commit |
---|---|---|
pholcus | A distributed, high concurrency and powerful web crawler. | |
gocrawl | Polite, slim and concurrent web crawler. | |
fetchbot | A simple and flexible web crawler that follows the robots.txt policies and crawl delays. | |
go_spider | An awesome Go concurrent Crawler(spider) framework. | |
dht | BitTorrent DHT Protocol && DHT Spider. | |
ants-go | A open source, distributed, restful crawler engine in golang. | |
scrape | A simple, higher level interface for Go web scraping. | |
creeper | The Next Generation Crawler Framework (Go). | |
colly | Fast and Elegant Scraping Framework for Gophers. | |
ferret | Declarative web scraping. | |
Dataflow kit | Extract structured data from web pages. Web sites scraping. | |
Hakrawler | Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application |
Name | Description | Last commit |
---|---|---|
crawler | Scala DSL for web crawling. | |
scrala | Scala crawler(spider) framework, inspired by scrapy. | |
ferrit | Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra. |