Skip to content

Instantly share code, notes, and snippets.

@gustavorps
Created March 31, 2022 17:59
Show Gist options
  • Save gustavorps/98a9b03143db24c5acedab343b7e7a55 to your computer and use it in GitHub Desktop.
Save gustavorps/98a9b03143db24c5acedab343b7e7a55 to your computer and use it in GitHub Desktop.

Awesome-crawler Awesome

A collection of awesome web crawler,spider and resources in different languages.

Contents

Python

Name Description Last commit
Scrapy A fast high-level screen scraping and web crawling framework.
django-dynamic-scraper Creating Scrapy scrapers via the Django admin interface.
Scrapy-Redis Redis-based components for Scrapy.
scrapy-cluster Uses Redis and Kafka to create a distributed on demand scraping cluster.
distribute_crawler Uses scrapy,redis, mongodb,graphite to create a distributed spider.
pyspider A powerful spider system.
CoCrawler A versatile web crawler built using modern tools and concurrency.
cola A distributed crawling framework.
Demiurge PyQuery-based scraping micro-framework.
Scrapely A pure-python HTML screen-scraping library.
feedparser feed parser.
you-get Dumb downloader that scrapes the web.
Grab Site scraping framework.
MechanicalSoup A Python library for automating interaction with websites.
portia Visual scraping for Scrapy.
crawley Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
RoboBrowser A simple, Pythonic library for browsing the web without a standalone web browser.
MSpider A simple ,easy spider using gevent and js render.
brownant A lightweight web data extracting framework.
PSpider A simple spider frame in Python3.
Gain Web crawling framework based on asyncio for everyone.
sukhoi Minimalist and powerful Web Crawler.
spidy The simple, easy to use command line web crawler.
newspaper News, full-text, and article metadata extraction in Python 3
aspider An async web scraping micro-framework based on asyncio.

Java

Name Description Last commit
ACHE Crawler An easy to use web crawler for domain-specific search.
Apache Nutch Highly extensible, highly scalable web crawler for production environment.
anthelion A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
Crawler4j Simple and lightweight web crawler.
JSoup Scrapes, parses, manipulates and cleans HTML.
websphinx Website-Specific Processors for HTML information extraction.
Open Search Server A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
Gecco A easy to use lightweight web crawler
WebCollector Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
Webmagic A scalable crawler framework.
Spiderman A scalable ,extensible, multi-threaded web crawler.
Spiderman2 A distributed web crawler framework,support js render.
Heritrix3 Extensible, web-scale, archival-quality web crawler project.
SeimiCrawler An agile, distributed crawler framework.
StormCrawler An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
Spark-Crawler Evolving Apache Nutch to run on Spark.
webBee A DFS web spider.
spider-flow A visual spider framework, it's so good that you don't need to write any code to crawl the website.

C#

Name Description Last commit
ccrawler Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content.
SimpleCrawler Simple spider base on mutithreading, regluar expression.
DotnetSpider This is a cross platfrom, ligth spider develop by C#.
Abot C# web crawler built for speed and flexibility.
Hawk Advanced Crawler and ETL tool written in C#/WPF.
SkyScraper An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
Infinity Crawler A simple but powerful web crawler library in C#.

JavaScript

Name Description Last commit
scraperjs A complete and versatile web scraper.
scrape-it A Node.js scraper for humans.
simplecrawler Event driven web crawler.
node-crawler Node-crawler has clean,simple api.
js-crawler Web crawler for Node.JS, both HTTP and HTTPS are supported.
webster A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
x-ray Web scraper with pagination and crawler support.
node-osmosis HTML/XML parser and web scraper for Node.js.
web-scraper-chrome-extension Web data extraction tool implemented as chrome extension.
supercrawler Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
headless-chrome-crawler Headless Chrome crawls with jQuery support
Squidwarc High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

PHP

Name Description Last commit
Goutte A screen scraping and web crawling library for PHP.
laravel-goutte Laravel 5 Facade for Goutte.
dom-crawler The DomCrawler component eases DOM navigation for HTML and XML documents.
QueryList The progressive PHP crawler framework.
pspider Parallel web crawler written in PHP.
php-spider A configurable and extensible PHP web spider.
spatie/crawler An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
crawlzone/crawlzone Crawlzone is a fast asynchronous internet crawling framework for PHP.

C++

Name Description Last commit
open-source-search-engine A distributed open source search engine and spider/crawler written in C/C++.

C

Name Description Last commit
httrack Copy websites to your computer.

Ruby

Name Description Last commit
Nokogiri A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
upton A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
wombat Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
RubyRetriever RubyRetriever is a Web Crawler, Scraper & File Harvester.
Spidr Spider a site ,multiple domains, certain links or infinitely.
Cobweb Web crawler with very flexible crawling options, standalone or using sidekiq.
mechanize Automated web interaction & crawling.

R

Name Description Last commit
rvest Simple web scraping for R.

Erlang

Name Description Last commit
ebot A scalable, distribuited and highly configurable web cawler.

Perl

Name Description Last commit
web-scraper Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

Go

Name Description Last commit
pholcus A distributed, high concurrency and powerful web crawler.
gocrawl Polite, slim and concurrent web crawler.
fetchbot A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
go_spider An awesome Go concurrent Crawler(spider) framework.
dht BitTorrent DHT Protocol && DHT Spider.
ants-go A open source, distributed, restful crawler engine in golang.
scrape A simple, higher level interface for Go web scraping.
creeper The Next Generation Crawler Framework (Go).
colly Fast and Elegant Scraping Framework for Gophers.
ferret Declarative web scraping.
Dataflow kit Extract structured data from web pages. Web sites scraping.
Hakrawler Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Scala

Name Description Last commit
crawler Scala DSL for web crawling.
scrala Scala crawler(spider) framework, inspired by scrapy.
ferrit Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment