Skip to content

Instantly share code, notes, and snippets.

@sebastian-nagel
Forked from tfmorris/common-crawl-cdx.py
Last active May 16, 2016 13:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sebastian-nagel/4790f1a7c5fed813ec2a700370cb73b9 to your computer and use it in GitHub Desktop.
Save sebastian-nagel/4790f1a7c5fed813ec2a700370cb73b9 to your computer and use it in GitHub Desktop.
Analyze Common Crawl index - http://index.commoncrawl.org/
# -*- coding: utf-8 -*-
"""
common-crawl-cdx.py
A simple example program to analyze the Common Crawl index.
This is implemented as a single stream job which accesses S3 via HTTP,
so that it can be easily be run from any laptop, but it could easily be
converted to an EMR job which processed the 300 index files in parallel.
If you are only interested in a certain set of TLDs or PLDs, the program
could be enhanced to use cluster.idx to figure out the correct subset of
the 300 shards to process. e.g. the .us TLD is entirely contained in
cdx-00298 & cdx-00299
Created on Wed Apr 22 15:05:54 2015
@author: Tom Morris <tfmorris@gmail.com>
"""
from collections import Counter
import json
import requests
import zlib
BASEURL = 'https://aws-publicdatasets.s3.amazonaws.com/'
INDEX1 = 'common-crawl/cc-index/collections/CC-MAIN-2015-11/indexes/'
INDEX2 = 'common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/'
SPLITS = 300
def process_index(index):
total_length = 0
total_processed = 0
total_urls = 0
mime_types = Counter()
for i in range(SPLITS):
unconsumed_text = ''
filename = 'cdx-%05d.gz' % i
url = BASEURL + index + filename
response = requests.get(url, stream=True)
length = int(response.headers['content-length'].strip())
decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)
total = 0
response_iter = response.iter_content(chunk_size=2048)
chunk = next(response_iter, '')
while len(chunk) > 0 or len(decompressor.unused_data) > 0:
total += len(chunk)
if len(decompressor.unused_data) > 0:
# restart decompressor if end of a chunk
to_decompress = decompressor.unused_data + chunk
decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)
else:
to_decompress = decompressor.unconsumed_tail + chunk
s = unconsumed_text + decompressor.decompress(to_decompress)
unconsumed_text = ''
if len(s) == 0:
# Not sure why this happens, but doesn't seem to affect things
print 'Decompressed nothing %2.2f%%' % (total*100.0/length),\
length, total, len(chunk), filename
for l in s.split('\n'):
pieces = l.split(' ')
if len(pieces) < 3 or l[-1] != '}':
unconsumed_text = l
else:
json_string = ' '.join(pieces[2:])
try:
metadata = json.loads(json_string)
except:
print 'JSON load failed: ', total, l
assert False
url = metadata['url']
if 'mime' in metadata:
mime_types[metadata['mime']] += 1
else:
mime_types['<none>'] += 1
# print 'No mime type for ', url
total_urls += 1
# print url
chunk = next(response_iter, '')
print 'Done with ', filename
total_length += length
total_processed += total
print 'Processed %2.2f %% of %d bytes (compressed). Found %d URLs' %\
((total_processed * 100.0 / total_length), total_length, total_urls)
print "Mime types:"
for k, v in mime_types.most_common():
print '%5d %s' % (v, k)
for index in [INDEX2]:
print 'Processing index: ', index
process_index(index)
print 'Done processing index: ', index
@sebastian-nagel
Copy link
Author

Regarding the 'Decompressed nothing' warnings. Gzip has an 8 byte footer containing a checksum. The cdx format contains multiple gzipped blocks, each block containing 3000 lines/records. If a chunk read from HTTP ends exactly before the footer of one block (or inside it) calling decompressor.decompress(...) will first finish the current block and does not return anything because it's already done. The next call of decompress() will start the with the next block. In short: "it doesn't seem to affect things" as stated.

@sebastian-nagel
Copy link
Author

Index lines on the end of a cdx-00xxx.gz file may get lost sometimes if the last chunk of the HTTP response contains a second gzipped block. E.g., the cdx-00071.gz contains 7293007 lines / URLs / WARC records according to

aws s3 cp s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2016-07/indexes/cdx-00071.gz - | gunzip | wc -l
7293007

If common-crawl-cdx.py is modified to process only this cdx file, it logs:

Done with  cdx-00071.gz
Processed 100.00 % of 476417933 bytes (compressed).  Found 7293000 URLs

7 URLs are missing. This happens when all chunks of the HTTP response have been processed but uncompressed data is still hold in the decompressor object. Staying in the inner processing loop until decompressor.unused_data is empty will fix this (see revision #2 2016-05-13):

Done with  cdx-00071.gz
Processed 100.00 % of 476417933 bytes (compressed).  Found 7293007 URLs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment