Skip to content

Instantly share code, notes, and snippets.

@xeoncross
Created December 12, 2021 04:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save xeoncross/9c1f919fe2a71cf84c22cfb21eebf137 to your computer and use it in GitHub Desktop.
Save xeoncross/9c1f919fe2a71cf84c22cfb21eebf137 to your computer and use it in GitHub Desktop.
Common Crawl Corpus parsing in Go

Common Crawl Corpus

Why crawl the web when someone already does for us?

Demo walking through the common crawl data: https://tech.marksblogg.com/petabytes-of-website-data-spark-emr.html

Steps

  1. Multiple times a year, there is a new crawl of the "whole" web completed. This will create an 'index' called YYYY-WW (year+week combo)
  2. This index has a master list of all files that contain all crawls
  3. The content of that index paths file contains the 50k+ URL's of all the warc.gz files which make up the data in the index.
  4. For each warc.gz file, stream it, unzip it, and parse it using go-warc as shown by ccrawl

Simple Demo

There is a simple example of crawling all indexes looking for all *.au domains and listing when they were first and last seen acording to the common crawl corpus: https://gist.github.com/Xeoncross/020c283e334a94539676f029e3039c86

Most recent crawl index

aws s3 ls s3://commoncrawl/crawl-data/ | grep "CC-MAIN-$(date +%Y)" | tail -n 1

# or via JSON at: http://index.commoncrawl.org/collinfo.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment