Stephen Merity Smerity

## gist:3c79ff41d154053f014a
"Apache/2.2.11"	35
"nginx/1.0.5"	35
"Nginx"	35
"Apache/2.2.27 (Unix) mod_ssl/2.2.27 OpenSSL/0.9.8e-fips-rhel5 mod_bwlimited/1.4"	36
"eBay Server"	36
"Polyvore Web Server"	36
"WWW"	37
"YTS/1.20.29"	37
"Apache/2.2.15 (Scientific Linux)"	38
"Apache/2.2.22 (Unix)"	39

## s3cmd.log
smerity@pegasus:/tmp$ s3cmd info s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00023-ip-10-180-212-248.ec2.internal.warc.gz
s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00023-ip-10-180-212-248.ec2.internal.warc.gz (object):
   File size: 765658891
   Last mod:  Thu, 17 Jul 2014 11:47:18 GMT
   MIME type: application/octet-stream
   MD5 sum:   5ce4fcbbb18442ce218e75a1d0f6d3dc
   ACL:       gil: FULL_CONTROL
   ACL:       *anon*: READ
   URL:       http://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00023-ip-10-180-212-248.ec2.internal.warc.gz

## gist:ab4c2e8e87cb715e1f0b
<meta name="description" content="How do you design a property so that it compliments its surrounding so perfectly that it&#039;s virtually invisible to the naked eye? 2014&#039;s Australian House of the Year did just that."/>
<link rel="canonical" href="http://www.realestate.com.au/blog/invisible-house-takes-australian-house-year-australian-house-year/" />
<link rel="publisher" href="https://plus.google.com/+realestatecomau"/>
<meta property="og:locale" content="en_US" />
<meta property="og:type" content="article" />
<meta property="og:title" content="Australian house of the year is ... invisible?" />
<meta property="og:description" content="How do you design a property so that it compliments its surrounding so perfectly that it&#039;s virtually invisible to the naked eye? 2014&#039;s Australian House of the Year did just that." />
<meta property="og:url" content="http://www.realestate.com.au/blog/invisible-house-takes-australian-house-year-australian-house-year/" />
<meta property="og:site_name" content="Th

## mrcc.py
import re
#
from collections import Counter
#
import boto
import warc
#
from boto.s3.key import Key
from gzipstream import GzipStreamFile
from mrjob.job import MRJob

## calc_size.py
import sys
###
from gzipstream import GzipStreamFile
import warc

if __name__ == '__main__':
  k = open('CC-MAIN-20140728011800-00009-ip-10-146-231-18.ec2.internal.warc.gz')
  f = warc.WARCFile(fileobj=GzipStreamFile(k))
  html_sizes = []
  for num, record in enumerate(f):

## gist:2704d3d65aa191ff5f27

      
              1 file
            
          
              1 fork
            
          
              1 comment
            
          
              3 stars
            
          
                Smerity
                / gist:2704d3d65aa191ff5f27
            
            
              Last active
              May 1, 2017 19:45
            
              
                About the data
              
          
    Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.
As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

[ARC] Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
[ARC] Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
[ARC] Archived Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012
[WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/


## stream_warc.py
import boto
from boto.s3.key import Key
import zlib


def stream_decompress_multi(stream):
  dec = zlib.decompressobj(16 + zlib.MAX_WBITS)
  while True:
    chunk = stream.read(1024 * 8)
    if not chunk:

## intro.py
import re
#
from collections import Counter
from glob import glob
from urlparse import urlparse
#
import warc


# Extract the names and total usage count of all the opening HTML tags in the document

## warc
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00000-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00001-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00002-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00003-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00004-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00005-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00006-ip-10-147-4-33.ec2.internal.warc.gz
common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/C

## keybase.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                Smerity
                / keybase.md
            
            
              Created
              July 9, 2014 00:35
            
              
                Keybase Proof
              
          
    Keybase proof

I hereby claim:

I am smerity on github.
I am smerity (https://keybase.io/smerity) on keybase.
I have a public key whose fingerprint is 56A2 5996 3078 B205 1053  883A 6615 0186 B74F 858B

To claim this, I am signing this object:
	"Apache/2.2.11" 35
	"nginx/1.0.5" 35
	"Nginx" 35
	"Apache/2.2.27 (Unix) mod_ssl/2.2.27 OpenSSL/0.9.8e-fips-rhel5 mod_bwlimited/1.4" 36
	"eBay Server" 36
	"Polyvore Web Server" 36
	"WWW" 37
	"YTS/1.20.29" 37
	"Apache/2.2.15 (Scientific Linux)" 38
	"Apache/2.2.22 (Unix)" 39
	smerity@pegasus:/tmp$ s3cmd info s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00023-ip-10-180-212-248.ec2.internal.warc.gz
	s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00023-ip-10-180-212-248.ec2.internal.warc.gz (object):
	File size: 765658891
	Last mod: Thu, 17 Jul 2014 11:47:18 GMT
	MIME type: application/octet-stream
	MD5 sum: 5ce4fcbbb18442ce218e75a1d0f6d3dc
	ACL: gil: FULL_CONTROL
	ACL: anon: READ
	URL: http://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00023-ip-10-180-212-248.ec2.internal.warc.gz
	<meta name="description" content="How do you design a property so that it compliments its surrounding so perfectly that it's virtually invisible to the naked eye? 2014's Australian House of the Year did just that."/>
	<link rel="canonical" href="http://www.realestate.com.au/blog/invisible-house-takes-australian-house-year-australian-house-year/" />
	<link rel="publisher" href="https://plus.google.com/+realestatecomau"/>
	<meta property="og:locale" content="en_US" />
	<meta property="og:type" content="article" />
	<meta property="og:title" content="Australian house of the year is ... invisible?" />
	<meta property="og:description" content="How do you design a property so that it compliments its surrounding so perfectly that it's virtually invisible to the naked eye? 2014's Australian House of the Year did just that." />
	<meta property="og:url" content="http://www.realestate.com.au/blog/invisible-house-takes-australian-house-year-australian-house-year/" />
	<meta property="og:site_name" content="Th
	import re
	#
	from collections import Counter
	#
	import boto
	import warc
	#
	from boto.s3.key import Key
	from gzipstream import GzipStreamFile
	from mrjob.job import MRJob
	import sys
	###
	from gzipstream import GzipStreamFile
	import warc

	if __name__ == '__main__':
	k = open('CC-MAIN-20140728011800-00009-ip-10-146-231-18.ec2.internal.warc.gz')
	f = warc.WARCFile(fileobj=GzipStreamFile(k))
	html_sizes = []
	for num, record in enumerate(f):
	import boto
	from boto.s3.key import Key
	import zlib


	def stream_decompress_multi(stream):
	dec = zlib.decompressobj(16 + zlib.MAX_WBITS)
	while True:
	chunk = stream.read(1024 * 8)
	if not chunk:
	import re
	#
	from collections import Counter
	from glob import glob
	from urlparse import urlparse
	#
	import warc


	# Extract the names and total usage count of all the opening HTML tags in the document
	common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00000-ip-10-147-4-33.ec2.internal.warc.gz
	common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00001-ip-10-147-4-33.ec2.internal.warc.gz
	common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00002-ip-10-147-4-33.ec2.internal.warc.gz
	common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00003-ip-10-147-4-33.ec2.internal.warc.gz
	common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00004-ip-10-147-4-33.ec2.internal.warc.gz
	common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00005-ip-10-147-4-33.ec2.internal.warc.gz
	common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/CC-MAIN-20140416005201-00006-ip-10-147-4-33.ec2.internal.warc.gz
	common-crawl/crawl-data/CC-MAIN-2014-15/segments/1397609521512.15/warc/C