Skip to content

Instantly share code, notes, and snippets.

@tfmorris
Created April 24, 2015 18:55
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save tfmorris/ab89ed13e2e52830aa6c to your computer and use it in GitHub Desktop.
Save tfmorris/ab89ed13e2e52830aa6c to your computer and use it in GitHub Desktop.
Analyze Common Crawl index - http://index.commoncrawl.org/
# -*- coding: utf-8 -*-
"""
common-crawl-cdx.py
A simple example program to analyze the Common Crawl index.
This is implemented as a single stream job which accesses S3 via HTTP,
so that it can be easily be run from any laptop, but it could easily be
converted to an EMR job which processed the 300 index files in parallel.
If you are only interested in a certain set of TLDs or PLDs, the program
could be enhanced to use cluster.idx to figure out the correct subset of
the 300 shards to process. e.g. the .us TLD is entirely contained in
cdx-00298 & cdx-00299
Created on Wed Apr 22 15:05:54 2015
@author: Tom Morris <tfmorris@gmail.com>
"""
from collections import Counter
import json
import requests
import zlib
BASEURL = 'https://aws-publicdatasets.s3.amazonaws.com/'
INDEX1 = 'common-crawl/cc-index/collections/CC-MAIN-2015-11/indexes/'
INDEX2 = 'common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/'
SPLITS = 300
def process_index(index):
total_length = 0
total_processed = 0
total_urls = 0
mime_types = Counter()
for i in range(SPLITS):
unconsumed_text = ''
filename = 'cdx-%05d.gz' % i
url = BASEURL + index + filename
response = requests.get(url, stream=True)
length = int(response.headers['content-length'].strip())
decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)
total = 0
for chunk in response.iter_content(chunk_size=2048):
total += len(chunk)
if len(decompressor.unused_data) > 0:
# restart decompressor if end of a chunk
to_decompress = decompressor.unused_data + chunk
decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)
else:
to_decompress = decompressor.unconsumed_tail + chunk
s = unconsumed_text + decompressor.decompress(to_decompress)
unconsumed_text = ''
if len(s) == 0:
# Not sure why this happens, but doesn't seem to affect things
print 'Decompressed nothing %2.2f%%' % (total*100.0/length),\
length, total, len(chunk), filename
for l in s.split('\n'):
pieces = l.split(' ')
if len(pieces) < 3 or l[-1] != '}':
unconsumed_text = l
else:
json_string = ' '.join(pieces[2:])
try:
metadata = json.loads(json_string)
except:
print 'JSON load failed: ', total, l
assert False
url = metadata['url']
if 'mime' in metadata:
mime_types[metadata['mime']] += 1
else:
mime_types['<none>'] += 1
# print 'No mime type for ', url
total_urls += 1
# print url
print 'Done with ', filename
total_length += length
total_processed += total
print 'Processed %2.2f %% of %d bytes (compressed). Found %d URLs' %\
((total_processed * 100.0 / total_length), total_length, total_urls)
print "Mime types:"
for k, v in mime_types.most_common():
print '%5d %s' % (v, k)
for index in [INDEX2]:
print 'Processing index: ', index
process_index(index)
print 'Done processing index: ', index
@sebastian-nagel
Copy link

Regarding the 'Decompressed nothing' warnings. Gzip has an 8 byte footer containing a checksum. The cdx format contains multiple gzipped blocks, each block containing 3000 lines/records. If a chunk read from HTTP ends exactly before the footer of one block (or inside it) calling decompressor.decompress(...) will first finish the current block and does not return anything because it's already done. The next call of decompress() will start the with the next block. In short: "it doesn't seem to affect things" as stated.

@sebastian-nagel
Copy link

Index lines on the end of a cdx-00xxx.gz file may get lost sometimes if the last chunk of the HTTP response contains a second gzipped block. E.g., the cdx-00071.gz contains 7293007 lines / URLs / WARC records according to

aws s3 cp s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2016-07/indexes/cdx-00071.gz - | gunzip | wc -l
7293007

If common-crawl-cdx.py is modified to process only this cdx file, it logs:

Done with  cdx-00071.gz
Processed 100.00 % of 476417933 bytes (compressed).  Found 7293000 URLs

7 URLs are missing. This happens when all chunks of the HTTP response have been processed but uncompressed data is still hold in the decompressor object. Staying in the inner processing loop until decompressor.unused_data is empty will fix this (see [https://gist.github.com/sebastian-nagel/4790f1a7c5fed813ec2a700370cb73b9/revisions]):

Done with  cdx-00071.gz
Processed 100.00 % of 476417933 bytes (compressed).  Found 7293007 URLs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment