xeoncross/common_crawl_corpus_in_go.md

## common_crawl_corpus_in_go.md

      
    Raw
  

              common_crawl_corpus_in_go.md
            
          
    Common Crawl Corpus

Why crawl the web when someone already does for us?
Demo walking through the common crawl data:
https://tech.marksblogg.com/petabytes-of-website-data-spark-emr.html
Steps


Multiple times a year, there is a new crawl of the "whole" web completed.
This will create an 'index' called YYYY-WW (year+week combo)
This index has a master list of all files that contain all crawls

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/warc.paths.gz


The content of that index paths file contains the 50k+ URL's of all the warc.gz files
which make up the data in the index.
For each warc.gz file, stream it, unzip it, and parse it using
go-warc as shown by
ccrawl

Simple Demo

There is a simple example of crawling all indexes
looking for all *.au domains and listing when they were first and last seen
acording to the common crawl corpus:
https://gist.github.com/Xeoncross/020c283e334a94539676f029e3039c86
Most recent crawl index

aws s3 ls s3://commoncrawl/crawl-data/ | grep "CC-MAIN-$(date +%Y)" | tail -n 1

# or via JSON at: http://index.commoncrawl.org/collinfo.json