Skip to content

Instantly share code, notes, and snippets.

View Smerity's full-sized avatar

Stephen Merity Smerity

View GitHub Profile
@Smerity
Smerity / gist:3c79ff41d154053f014a
Created September 26, 2014 23:20
Common Crawl Server List (single WARC file) -- from https://github.com/commoncrawl/cc-mrjob
"Apache/2.2.11" 35
"nginx/1.0.5" 35
"Nginx" 35
"Apache/2.2.27 (Unix) mod_ssl/2.2.27 OpenSSL/0.9.8e-fips-rhel5 mod_bwlimited/1.4" 36
"eBay Server" 36
"Polyvore Web Server" 36
"WWW" 37
"YTS/1.20.29" 37
"Apache/2.2.15 (Scientific Linux)" 38
"Apache/2.2.22 (Unix)" 39
@Smerity
Smerity / s3cmd.log
Created September 4, 2014 10:47
Testing access to a Common Crawl file via Amazon S3
smerity@pegasus:/tmp$ s3cmd info s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00023-ip-10-180-212-248.ec2.internal.warc.gz
s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00023-ip-10-180-212-248.ec2.internal.warc.gz (object):
File size: 765658891
Last mod: Thu, 17 Jul 2014 11:47:18 GMT
MIME type: application/octet-stream
MD5 sum: 5ce4fcbbb18442ce218e75a1d0f6d3dc
ACL: gil: FULL_CONTROL
ACL: *anon*: READ
URL: http://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00023-ip-10-180-212-248.ec2.internal.warc.gz
@Smerity
Smerity / gist:ab4c2e8e87cb715e1f0b
Created August 14, 2014 18:02
Open Graph tags
<meta name="description" content="How do you design a property so that it compliments its surrounding so perfectly that it&#039;s virtually invisible to the naked eye? 2014&#039;s Australian House of the Year did just that."/>
<link rel="canonical" href="http://www.realestate.com.au/blog/invisible-house-takes-australian-house-year-australian-house-year/" />
<link rel="publisher" href="https://plus.google.com/+realestatecomau"/>
<meta property="og:locale" content="en_US" />
<meta property="og:type" content="article" />
<meta property="og:title" content="Australian house of the year is ... invisible?" />
<meta property="og:description" content="How do you design a property so that it compliments its surrounding so perfectly that it&#039;s virtually invisible to the naked eye? 2014&#039;s Australian House of the Year did just that." />
<meta property="og:url" content="http://www.realestate.com.au/blog/invisible-house-takes-australian-house-year-australian-house-year/" />
<meta property="og:site_name" content="Th
@Smerity
Smerity / mrcc.py
Last active August 29, 2015 14:05
mrjob example for Common Crawl
import re
#
from collections import Counter
#
import boto
import warc
#
from boto.s3.key import Key
from gzipstream import GzipStreamFile
from mrjob.job import MRJob
@Smerity
Smerity / calc_size.py
Created August 8, 2014 23:28
Estimating the average page size of text/html in a portion of the Common Crawl July 2014 dataset
import sys
###
from gzipstream import GzipStreamFile
import warc
if __name__ == '__main__':
k = open('CC-MAIN-20140728011800-00009-ip-10-146-231-18.ec2.internal.warc.gz')
f = warc.WARCFile(fileobj=GzipStreamFile(k))
html_sizes = []
for num, record in enumerate(f):
@Smerity
Smerity / gist:2704d3d65aa191ff5f27
Last active May 1, 2017 19:45
About the data

Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.

As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

  • [ARC] Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
  • [ARC] Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
  • [ARC] Archived Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012
  • [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/
@Smerity
Smerity / stream_warc.py
Created July 22, 2014 22:51
WARC stream decompress
import boto
from boto.s3.key import Key
import zlib
def stream_decompress_multi(stream):
dec = zlib.decompressobj(16 + zlib.MAX_WBITS)
while True:
chunk = stream.read(1024 * 8)
if not chunk:
@Smerity
Smerity / intro.py
Created July 18, 2014 20:00
Preview of the "Introduction to the Common Crawl dataset"
import re
#
from collections import Counter
from glob import glob
from urlparse import urlparse
#
import warc
# Extract the names and total usage count of all the opening HTML tags in the document
@Smerity
Smerity / warc
Created July 16, 2014 18:52
Example of WARC S3 listing
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00000-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00001-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00002-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00003-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00004-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00005-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00006-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/C
@Smerity
Smerity / keybase.md
Created July 9, 2014 00:35
Keybase Proof

Keybase proof

I hereby claim:

  • I am smerity on github.
  • I am smerity (https://keybase.io/smerity) on keybase.
  • I have a public key whose fingerprint is 56A2 5996 3078 B205 1053 883A 6615 0186 B74F 858B

To claim this, I am signing this object: