Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save khchine5/5e8d32470f068a6d478fd3d9c22ac3a3 to your computer and use it in GitHub Desktop.
Save khchine5/5e8d32470f068a6d478fd3d9c22ac3a3 to your computer and use it in GitHub Desktop.
Scraping Resources

Python Modules for Scraping:

Scraping and Parsing

  • selectolax
  • AdvancedHTMLParser
  • grequests
  • parsel
  • mechanicalsoup
  • beautifulsoup4
  • gazpacho
  • cloudscraper
  • cfscrape
  • ipwhois
  • saas
  • parse-utils
  • looter
  • xlseries
  • sriram-twitter-scraper

Scrapy

  • scrapy
  • scrapyrt
  • scrapy-splash
  • scrapy-autoextract
  • scrapy-pagestorage
  • scrapy-jsonschema
  • scrapy-wayback-middleware
  • scrapy-rss
  • scrapy-rotating-proxies
  • django-dynamic-scraper

Specific

  • yt-videos-list
  • twint
  • play-scraper
  • instagramscraper
  • instalooter
  • instabotnet
  • linkedin-scraper
  • google-search-results-serpwow
  • youtubedata
  • TikTokApi
  • imgur-scraper
  • tropescraper
  • google-search-results
  • pastepwn
  • wikitablescrape
  • recipe-scrapers
  • name-scraper
  • lyrics-extractor
  • newsman
  • ludoj-scraper

JSON

  • python-rapidjson
  • orjson
  • jsonslicer
  • nujson
  • yapic.json

Text & Data Manipulation

  • htmldate
  • newspaper3k
  • acora
  • hext
  • boltons (boltons.strutils)
  • w3lib
  • textnormaliser
  • hyperlink
  • shorttext
  • postal
  • readability
  • cypunct
  • justext
  • iso4217parse
  • isbnlib

Image, Audio & File Manipulation

  • tesserocr
  • imagecodecs
  • imagecodecs-lite
  • miniaudio
  • pysndfile
  • pdfquery

Performance

  • cnamedtuples
  • pybase64
  • lz4 and zstd
  • pikepdf and PyMuPDF
  • fortuna, pyewacket, and rng
  • cytoolz
  • psutil
  • libuuid
  • hoedowm

PS: Not going to include the obvious ones like requests, pandas and numpy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment