tmountain/README.md

## .gitignore
target/

## README.md

      
    Raw
  

              README.md
            
          
    A Common Crawl Experiment

Introduction

At my company, we are building infrastructure that enables us to
perform computations involving large bodies of text data.
To get familiar with the tech involved, I started with a simple
experiment: using Common Crawl metadata corpus, count crawled
URLs grouped by top level domain (TLD).
Here is a sample of how output from this "TLD query" might look like:
com     19298
org     2595
net     2228
uk      1674
de      1227
...

It's not a very exciting query. To be blunt, it's a pretty boring one.
But that's the point: this is a new ground for me, so I start simple,
and capture what I've learned.
This gist is a fully-fledged Git repo with all the code necessary to
run this query, so if you want to play with the code yourself, go
ahead clone this thing. You'll also need to install Leiningen.
OS X Homebrew users, here's a shortcut for you: brew install --devel leiningen
Results

I've downloaded the job output files (part-00000 ... part-00261)
and sorted them by TLD counts:
cat part* | sort -k2nr > results.txt
Here are top 10 results:
com     2673598481
org     304265520
net     267384802
de      205656928
uk      168846878
pl      82716496
ru      76587934
info    73012083
nl      63865755
fr      57607765

All results are here.
Hadoop Cluster Details

I ran this query on Amazon EMR cluster of 50 cc1.4xlarge
machines, which took about 80 minutes to execute against entire corpus of
Common Crawl metadata. Overall, the query saw 4,781,766,325
crawled URLs that had domain names with valid TLDs.
I also was able to confirm that simple linear equations can predict
how long a job would run based on results of the same job that was run
against a smaller data corpus and with fewer data processing nodes. I
ran it against 1 Common Crawl segment, and then against 10 segments,
and finally against all 56 segments.
Here are the monitoring graphs for the time period when the cluster
was active:

Here's the command to launch a cluster on EMR similar to what is
described here (using Amazon EMR Command Line Interface):
./elastic-mapreduce --create --name "CommonCrawl (TLD: all segments)" \
--jar s3://your.bucket/ccooo-standalone.jar \
--args 'ccooo.commoncrawl.TldQueryExe,s3://aws-publicdatasets/common-crawl/parse-output/segment,s3://your.bucket/tlds/all' \
--instance-group master --instance-type m2.2xlarge  --instance-count 1  --bid-price 0.111 \
--instance-group core   --instance-type cc1.4xlarge --instance-count 50 --bid-price 0.301 \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive \
--availability-zone us-east-1b
NOTE: the master node needs to be configured with
memory-intensive bootstrap script because various JVM processes need
a lot of heap space to deal with 245,000+ Hadoop tasks.
I picked us-east-1b zone because at that time spot prices looked the
lowest there.
cc1.4xlarge nodes were chosen because, well, they are badass!
Specifically, they have 10G Ethernet NICs which is good when you're
about slurp gobs of data from S3.
Spot Instances Are Awesome

The cluster, described above, consisted of 1 master node m2.2xlarge
spot priced $0.07 per hour plus EMR hourly fee of $0.21, and 50 core
nodes cc1.4xlarge spot priced $0.21 per hour plus EMR hourly fee of
$0.27.
Which means that for 80 minutes (rounded up to 2 hours), this works
out to:
cost = 2 * (1 * (0.07 + 0.21)  + 50 * (0.21 + 0.27)) = $48.56
Not too expensive! Let me know if I botched my math. I haven't seen a
drastic slope increase in our AWS billing curve, so I can't be that
far off.
Code

I've implemented the TLD Query using Cascalog, which strikes me
as one of the most eloquent ways to express Hadoop map/reduce logic.
Let's dive in.
Common Crawl metadata files are organized into segment directories,
each containing several thousand files named metadata-NNNNN. As it
turns out, not all segments are ready for consumption. If you naively
use a glob like this
s3://aws-publicdatasets/common-crawl/parse-output/segment/*/metadata*,
your Hadoop job will crash when trying to iterate/read S3 objects
using this glob, because some objects are not public. This manifests
itself as ... Status Code: 403, AWS Service: Amazon S3... error.
Common Crawl team publishes a file containing segment IDs that
are valid. Rationale for using this approach is discussed here.
Lesson 1. Use a hand-crafted glob with valid segments only.
metadata-path function, shown below, produces such a glob based on
hard-coded vector valid-segments.
(def valid-segments [1346823845675 ... 54 more elided ... 1346981172268])

(defn ^String metadata-path
  "Produces the glob of paths to metadata files organized
   according to CommonCrawl docs: http://tinyurl.com/common-crawl-about-dataset
   Here's an example (assuming valid segments are 1, 2 and 3):
   (metadata-path 's3://path/to/commoncrawl/segments')
   => 's3://path/to/commoncrawl/segments/{1,2,3}/metadata*'"
  [prefix]
  (->> valid-segments
       (s/join ",")
       (format "%s/{%s}/metadata*" prefix)))
The resulting glob will look like follows (assuming you specified
prefix s3://aws-publicdatasets/common-crawl/parse-output/segment).
"s3://aws-publicdatasets/common-crawl/parse-output/segment/{seg1,seg2,seg3,...}/metadata*"

Moving on.
Each metadata file contains a sequence of records each comprised of 2
items: crawled URL and metadata JSON. This query does not care about
metadata JSON, it only cares about the hostname in each URL. The
following bits of Clojure take care of parsing TLD of the URL's
hostname (assuming it's a domain name).
valid-tld set will be used by the query to check whether a string is
in fact a TLD.
Look at the test code
to see parse-tld in action.
;; A set of valid TLDs obtained from:
;; http://data.iana.org/TLD/tlds-alpha-by-domain.txt
(def valid-tld #{"ac" ... "com" ... "zw"})

(defn ^String parse-hostname
  "Extracts the hostname from the specified URL."
  [^Text url]
  (-> url str URI. .getHost (or "")))


(defn ^String parse-tld
  "Returns a piece of the URL's hostname after the last '.'
   Note: this may or not be an actual TLD, we'll validate
   eslewhere."
  [^Text url]
  (-> url parse-hostname (s/split #"\.") peek))
Now let's look at the Cascalog query, query-tlds that does all the
work.
The bit right after <- specifies that the query produces a
sequence of tuples [?url ?n] (e.g. [com 3], [uk 77], [net 2]).
Then it specifies that it only cares about ?url values coming out
from the source tap metadata-tap. (remember: each metadata tuple
consists of two values: crawled URL and metadata JSON; the second
value is not important, so we indicate that with _ placeholder).
Each ?url value is parsed into ?tld value using parse-tld
function we discussed earlier.
Predicate (valid-tld ?tld) checks if ?tld is valid. If result is
nil, the ?url in question will not count.
Aggregator (c/count :> ?n) counts crawled ?url-s grouped by
?tld.
When your Hadoop job is about 73% through several billion URLs, you
wouldn't want it to be aborted because of some malformed tuples, would
you? That's why (:trap ...) is specified: it traps tuple values that
cause exceptions and sends them to trap-tap sink.
(defn query-tlds
  "Counts site URLs from the metadata corpus grouped by TLD of each URL."
  [metadata-tap trap-tap]
  (<- [?tld ?n]
      (metadata-tap :> ?url _)
      (parse-tld ?url :> ?tld)
      (valid-tld ?tld)
      (c/count :> ?n)
      (:trap trap-tap)))
All Hadoop applications need to expose one or more entry point classes
that have main method that defines pertinent job details. In
Cascalog, this is done without much fuss using defmain.
The tricky part: I discovered I can't use the default hfs-seqfile
source tap for processing Common Crawl metadata files. It turns out,
Cascalog's hfs-seqfile creates a SequenceFile Cascading tap.
Cascading sequence files store a sequence of serialized Cascading
Tuple objects. Common Crawl sequence files store a sequence of
serialized key/value pairs (both key and value are of
org.apache.hadoop.io Text type). So initially this query crashed
with serialization errors when it touched Common Crawl data. I found
that I needed to use a WritableSequenceFile source tap instead.
Handily, more-taps library already provides such a tap creation
helper: hfs-wrtseqfile. And this was Lesson 2.
(defmain TldQueryExe
  "Defines 'main' method that will execute our query."
  [prefix output-dir]
  (let [metadata-tap (hfs-wrtseqfile (metadata-path prefix) Text Text :outfields ["key" "value"])
        trap-tap (hfs-seqfile (str output-dir ".trap"))]
    (?- (hfs-textline output-dir)
        (query-tlds metadata-tap trap-tap))))
You may invoke TldQueryExe.main entry point from command line like
so:
hadoop jar ccooo-standalone.jar ccooo.commoncrawl.TldQueryExe \
 s3://aws-publicdatasets/common-crawl/parse-output/segment \
 s3://bucket-you-own/some/path
FYI, malformed tuples will automatically be written to
s3://bucket-you-own/some/path.trap due to trap-tap (hfs-seqfile (str output-dir ".trap")) incantation above.
To run locally, you may want to download a couple metadata files and
then use local file system paths instead of s3:// URLs. Say, if you
have one metadata file at this path: ~/Downloads/1346823845675/metadata-00000
And you want to write results to ./tld-results, the command line
invocation would be:
hadoop jar ccooo-standalone.jar ccooo.commoncrawl.TldQueryExe  ~/Downloads ./tld-results


## commoncrawl.clj
(ns ccooo.commoncrawl
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:use cascalog.api
        [cascalog.more-taps :only (hfs-wrtseqfile)])
  (:import [org.apache.hadoop.io Text]
           [org.apache.commons.httpclient URI]))

;; Discussion about valid segments:
;; https://groups.google.com/forum/#!topic/common-crawl/QYTmnttZZyo/discussion
;;
;; I was led to this discussion because when naively globbing accross
;; all segments, my hadoop job failed with: 2012-10-10 21:44:09,895
;; INFO org.apache.hadoop.io.retry.RetryInvocationHandler
;; (pool-1-thread-1): Exception while invoking retrieveMetadata of
;; class org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.
;; Not retrying.Status Code: 403, AWS Service: Amazon S3, AWS Request
;; ID: ..., AWS Error Code: null, AWS Error Message: Forbidden, S3
;; Extended Request ID: ...
;;
;; I found this discussion related to such errors:
;; https://groups.google.com/d/topic/common-crawl/e07aej71GLU/discussion

;; Obtained from:
;; https://s3.amazonaws.com/aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
(def valid-segments [1346823845675 1346823846036 1346823846039 1346823846110 1346823846125 1346823846150 1346823846176 1346876860445 1346876860454 1346876860467 1346876860493 1346876860565 1346876860567 1346876860596 1346876860609 1346876860611 1346876860614 1346876860648 1346876860765 1346876860767 1346876860774 1346876860777 1346876860779 1346876860782 1346876860786 1346876860789 1346876860791 1346876860795 1346876860798 1346876860804 1346876860807 1346876860817 1346876860819 1346876860828 1346876860835 1346876860838 1346876860840 1346876860843 1346876860877 1346981172137 1346981172142 1346981172155 1346981172184 1346981172186 1346981172229 1346981172231 1346981172234 1346981172239 1346981172250 1346981172253 1346981172255 1346981172258 1346981172261 1346981172264 1346981172266 1346981172268])

;; A set of valid TLDs obtained from:
;; http://data.iana.org/TLD/tlds-alpha-by-domain.txt
(def valid-tld #{"ac" "ad" "ae" "aero" "af" "ag" "ai" "al" "am" "an" "ao" "aq" "ar" "arpa" "as" "asia" "at" "au" "aw" "ax" "az" "ba" "bb" "bd" "be" "bf" "bg" "bh" "bi" "biz" "bj" "bm" "bn" "bo" "br" "bs" "bt" "bv" "bw" "by" "bz" "ca" "cat" "cc" "cd" "cf" "cg" "ch" "ci" "ck" "cl" "cm" "cn" "co" "com" "coop" "cr" "cu" "cv" "cw" "cx" "cy" "cz" "de" "dj" "dk" "dm" "do" "dz" "ec" "edu" "ee" "eg" "er" "es" "et" "eu" "fi" "fj" "fk" "fm" "fo" "fr" "ga" "gb" "gd" "ge" "gf" "gg" "gh" "gi" "gl" "gm" "gn" "gov" "gp" "gq" "gr" "gs" "gt" "gu" "gw" "gy" "hk" "hm" "hn" "hr" "ht" "hu" "id" "ie" "il" "im" "in" "info" "int" "io" "iq" "ir" "is" "it" "je" "jm" "jo" "jobs" "jp" "ke" "kg" "kh" "ki" "km" "kn" "kp" "kr" "kw" "ky" "kz" "la" "lb" "lc" "li" "lk" "lr" "ls" "lt" "lu" "lv" "ly" "ma" "mc" "md" "me" "mg" "mh" "mil" "mk" "ml" "mm" "mn" "mo" "mobi" "mp" "mq" "mr" "ms" "mt" "mu" "museum" "mv" "mw" "mx" "my" "mz" "na" "name" "nc" "ne" "net" "nf" "ng" "ni" "nl" "no" "np" "nr" "nu" "nz" "om" "org" "pa" "pe" "pf" "pg" "ph" "pk" "pl" "pm" "pn" "post" "pr" "pro" "ps" "pt" "pw" "py" "qa" "re" "ro" "rs" "ru" "rw" "sa" "sb" "sc" "sd" "se" "sg" "sh" "si" "sj" "sk" "sl" "sm" "sn" "so" "sr" "st" "su" "sv" "sx" "sy" "sz" "tc" "td" "tel" "tf" "tg" "th" "tj" "tk" "tl" "tm" "tn" "to" "tp" "tr" "travel" "tt" "tv" "tw" "tz" "ua" "ug" "uk" "us" "uy" "uz" "va" "vc" "ve" "vg" "vi" "vn" "vu" "wf" "ws" "xn--0zwm56d" "xn--11b5bs3a9aj6g" "xn--3e0b707e" "xn--45brj9c" "xn--80akhbyknj4f" "xn--80ao21a" "xn--90a3ac" "xn--9t4b11yi5a" "xn--clchc0ea0b2g2a9gcd" "xn--deba0ad" "xn--fiqs8s" "xn--fiqz9s" "xn--fpcrj9c3d" "xn--fzc2c9e2c" "xn--g6w251d" "xn--gecrj9c" "xn--h2brj9c" "xn--hgbk6aj7f53bba" "xn--hlcj6aya9esc7a" "xn--j6w193g" "xn--jxalpdlp" "xn--kgbechtv" "xn--kprw13d" "xn--kpry57d" "xn--lgbbat1ad8j" "xn--mgb9awbf" "xn--mgbaam7a8h" "xn--mgbayh7gpa" "xn--mgbbh1a71e" "xn--mgbc0a9azcg" "xn--mgberp4a5d4ar" "xn--mgbx4cd0ab" "xn--o3cw4h" "xn--ogbpf8fl" "xn--p1ai" "xn--pgbs0dh" "xn--s9brj9c" "xn--wgbh1c" "xn--wgbl6a" "xn--xkc2al3hye2a" "xn--xkc2dl3a5ee0h" "xn--yfro4i67o" "xn--ygbi2ammx" "xn--zckzah" "xxx" "ye" "yt" "za" "zm" "zw"})

(defn ^String metadata-path
  "Produces the glob of paths to metadata files organized
   according to CommonCrawl docs: http://tinyurl.com/common-crawl-about-dataset
   Here's an example (assuming valid segments are 1, 2 and 3):
   (metadata-path 's3://path/to/commoncrawl/segments')
   => 's3://path/to/commoncrawl/segments/{1,2,3}/metadata*'"
  [prefix]
  (->> valid-segments
       (s/join ",")
       (format "%s/{%s}/metadata*" prefix)))

(defn ^String parse-hostname
  "Extracts the hostname from the specified URL."
  [^Text url]
  (-> url str URI. .getHost (or "")))

(defn ^String parse-tld
  "Returns a piece of the URL's hostname after the last '.'
   Note: this may or not be an actual TLD, we'll validate
   eslewhere."
  [^Text url]
  (-> url parse-hostname (s/split #"\.") peek))

(defn query-tlds
  "Counts site URLs from the metadata corpus grouped by TLD of each URL."
  [metadata-tap trap-tap]
  (<- [?tld ?n]
      (metadata-tap :> ?url _)
      (parse-tld ?url :> ?tld)
      (valid-tld ?tld)
      (c/count :> ?n)
      (:trap trap-tap)))

(defmain TldQueryExe
  "Defines 'main' method that will execute our query."
  [prefix output-dir]
  (let [metadata-tap (hfs-wrtseqfile (metadata-path prefix)
                                     Text Text
                                     :outfields ["key" "value"])
        trap-tap (hfs-seqfile (str output-dir ".trap"))]
    (?- (hfs-textline output-dir)
        (query-tlds metadata-tap trap-tap))))

## Makefile
# Collect a list of all Clojure source code file names.
sources := $(shell find . -type f -name "*.clj" -not -wholename "*/target/*")

target/ccooo-standalone.jar: $(sources)
	lein do compile, uberjar

.PHONY: test
test:
	lein do compile, midje

## project.clj
(defproject org.clojars.paxan/ccooo "0.0.1-SNAPSHOT"
  :description "Common Crawl One Oh One (CCOOO)"
  :jar-name "ccooo.jar"
  :uberjar-name "ccooo-standalone.jar"
  :dependencies [[cascalog "1.10.0"]
                 [cascalog-more-taps "0.3.0"]
                 [org.clojure/data.json "0.1.3"]
                 [commons-httpclient "3.0.1"]]
  :profiles {:dev
             {:dependencies [[org.apache.hadoop/hadoop-core "1.0.3"]
                             [midje "1.4.0"]
                             [midje-cascalog "0.4.0" :exclusions [org.clojure/clojure]]]}}
  :aot [ccooo.commoncrawl]
  :jvm-opts ["-Xmx1024m" "-server"])

## results.txt
com	2673598481
org	304265520
net	267384802
de	205656928
uk	168846878
pl	82716496
ru	76587934
info	73012083
nl	63865755
fr	57607765
it	57169697
jp	51352485
au	47644772
br	38397470
ca	34885440
edu	33770216
cz	31919454
es	30234103
ro	24314778
ch	23773429
se	23523429
cn	22098310
eu	21176891
us	20979439
biz	19670384
ua	18765163
be	17547133
dk	17237868
at	16700746
hu	15915580
gov	13441484
tv	11257982
nz	10990233
no	10720914
ar	10643506
in	10105676
fi	8057061
za	7807622
tk	6972693
sk	6711391
mx	6649092
tw	5633140
ie	5568317
tr	5339761
cc	5296120
gr	5269706
pt	4917728
kr	4664994
me	4466363
cl	4268607
lt	4211638
ws	3839074
il	3809111
vn	3491126
nu	3198196
fm	3113789
ir	3005073
hr	2972707
id	2915921
lv	2644739
si	2632041
sg	2517856
my	2480529
ee	2289856
hk	2223043
cat	2222285
co	2168680
name	1956669
ph	1806710
bg	1690155
rs	1687111
is	1623369
by	1551484
mobi	1480094
pe	1457104
su	1376915
to	1322537
th	1200424
kz	966996
lu	946862
pk	788239
ve	755158
md	752997
am	742657
uy	740998
travel	713272
int	593946
ae	533864
asia	499624
ec	488155
ba	472716
bz	465933
sa	459688
ma	414386
mil	393451
cr	391149
cx	375813
pro	354241
li	346502
eg	345288
la	336107
ge	334942
do	333321
np	329747
st	326570
mk	322710
cd	319637
lk	317232
coop	288483
uz	285619
az	283443
im	258522
py	249463
io	234936
cu	234508
mn	227103
ly	221035
vu	214927
hn	206804
ag	206739
ms	205234
gt	201512
ni	184440
bo	179374
tc	177118
cy	173733
pa	172407
ke	172375
tl	169478
ps	159170
aero	140006
mu	139950
ac	138494
bd	137211
jobs	136030
mt	134373
lb	127906
tel	127160
kg	124102
sv	120203
tn	113409
gd	108777
gs	107345
tt	98235
jo	96747
as	94761
ng	94631
re	93903
vg	92082
sc	90877
vc	90865
gg	89063
mz	82784
dj	79888
sh	79521
nr	79145
ug	78194
pr	71365
al	70276
tm	70163
museum	69595
sm	69013
jm	64123
gl	62177
va	61439
mo	60957
dz	59240
tz	57658
gm	56658
ad	55421
kh	54710
kw	53366
bm	52620
fo	51881
na	51738
nf	47086
qa	47069
om	45477
nc	45195
ky	44865
tj	44394
fj	43361
mp	43108
mc	42636
cm	42248
pg	41567
zw	41064
af	40424
pf	39768
sn	35878
lc	34734
mv	34301
sy	34067
bw	32829
ax	32473
gh	32253
je	31861
bn	30730
gp	30691
tf	29246
mg	28988
bs	26639
bh	25251
bt	24050
bb	24012
ao	23557
vi	20809
ci	20759
ht	20093
et	19840
gi	19441
zm	18859
ne	18770
ai	18131
an	16804
sd	16648
bf	15792
rw	15670
sl	15336
gy	15196
xn--p1ai	15036
mr	14426
ki	14247
sr	13590
mw	13021
iq	12683
cv	11521
ls	11458
mm	11282
pn	10651
ck	9083
hm	9032
ye	8738
bi	7941
dm	7384
ml	7265
sz	7048
aw	6521
bj	5822
sb	5494
xxx	5442
tg	4883
gf	4449
aq	3648
so	3638
mq	2755
lr	2686
ga	2641
fk	2249
kn	2137
cg	1778
yt	1431
gn	1330
cf	1112
td	1041
er	1009
gu	906
gq	564
km	469
pw	406
xn--fiqs8s	297
cw	170
gw	83
arpa	77
pm	51
mh	41
tp	37
wf	9
xn--o3cw4h	9
bv	8
post	5
xn--90a3ac	3
xn--j6w193g	3
gb	2
sx	1
xn--mgbaam7a8h	1

## test_ccooo.clj
(ns ccooo.test-ccooo
  (:use [midje sweet cascalog]
        ccooo.commoncrawl
        cascalog.api
        cascalog.testing)
  (:import [org.apache.hadoop.io Text]))

(fact "Hostname parsing"
      (-> "http://zorg.me/give/me/the/stones?q=foo bar" Text. parse-hostname) => "zorg.me"
      (-> "" Text. parse-hostname) => ""
      (-> "()" Text. parse-hostname) => ""
      (-> "abc" Text. parse-hostname) => ""
      (-> "what://wsx/qaz?q=a%20b" Text. parse-hostname) => "wsx")

(fact "TLD parsing"
      (-> "http://foo.bar.baz/1/2/3.txt" Text. parse-tld) => "baz"
      (-> "https://oh.really.xn--fiqs8s:443/yeah" Text. parse-tld) => "xn--fiqs8s")

(with-expected-sink-sets [nothing-trapped []]
  (let [metadata-tap [["http://abc.example.com/123" "{}"]
                      ["http://foo.bogus/" "{}"] ;; this one should be skipped!
                      ["http://xyz.example.com/890" "{}"]
                      ["https://wow.co.uk/hello" "{}"]]]
   (fact?- "the TLD query works all right"
           [["com" 2]
            ["uk" 1]]
           (query-tlds metadata-tap nothing-trapped))))
	(ns ccooo.commoncrawl
	(:require [clojure.string :as s]
	[cascalog.ops :as c])
	(:use cascalog.api
	[cascalog.more-taps :only (hfs-wrtseqfile)])
	(:import [org.apache.hadoop.io Text]
	[org.apache.commons.httpclient URI]))

	;; Discussion about valid segments:
	;; https://groups.google.com/forum/#!topic/common-crawl/QYTmnttZZyo/discussion
	;;
	;; I was led to this discussion because when naively globbing accross
	;; all segments, my hadoop job failed with: 2012-10-10 21:44:09,895
	;; INFO org.apache.hadoop.io.retry.RetryInvocationHandler
	;; (pool-1-thread-1): Exception while invoking retrieveMetadata of
	;; class org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.
	;; Not retrying.Status Code: 403, AWS Service: Amazon S3, AWS Request
	;; ID: ..., AWS Error Code: null, AWS Error Message: Forbidden, S3
	;; Extended Request ID: ...
	;;
	;; I found this discussion related to such errors:
	;; https://groups.google.com/d/topic/common-crawl/e07aej71GLU/discussion

	;; Obtained from:
	;; https://s3.amazonaws.com/aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
	(def valid-segments [1346823845675 1346823846036 1346823846039 1346823846110 1346823846125 1346823846150 1346823846176 1346876860445 1346876860454 1346876860467 1346876860493 1346876860565 1346876860567 1346876860596 1346876860609 1346876860611 1346876860614 1346876860648 1346876860765 1346876860767 1346876860774 1346876860777 1346876860779 1346876860782 1346876860786 1346876860789 1346876860791 1346876860795 1346876860798 1346876860804 1346876860807 1346876860817 1346876860819 1346876860828 1346876860835 1346876860838 1346876860840 1346876860843 1346876860877 1346981172137 1346981172142 1346981172155 1346981172184 1346981172186 1346981172229 1346981172231 1346981172234 1346981172239 1346981172250 1346981172253 1346981172255 1346981172258 1346981172261 1346981172264 1346981172266 1346981172268])

	;; A set of valid TLDs obtained from:
	;; http://data.iana.org/TLD/tlds-alpha-by-domain.txt
	(def valid-tld #{"ac" "ad" "ae" "aero" "af" "ag" "ai" "al" "am" "an" "ao" "aq" "ar" "arpa" "as" "asia" "at" "au" "aw" "ax" "az" "ba" "bb" "bd" "be" "bf" "bg" "bh" "bi" "biz" "bj" "bm" "bn" "bo" "br" "bs" "bt" "bv" "bw" "by" "bz" "ca" "cat" "cc" "cd" "cf" "cg" "ch" "ci" "ck" "cl" "cm" "cn" "co" "com" "coop" "cr" "cu" "cv" "cw" "cx" "cy" "cz" "de" "dj" "dk" "dm" "do" "dz" "ec" "edu" "ee" "eg" "er" "es" "et" "eu" "fi" "fj" "fk" "fm" "fo" "fr" "ga" "gb" "gd" "ge" "gf" "gg" "gh" "gi" "gl" "gm" "gn" "gov" "gp" "gq" "gr" "gs" "gt" "gu" "gw" "gy" "hk" "hm" "hn" "hr" "ht" "hu" "id" "ie" "il" "im" "in" "info" "int" "io" "iq" "ir" "is" "it" "je" "jm" "jo" "jobs" "jp" "ke" "kg" "kh" "ki" "km" "kn" "kp" "kr" "kw" "ky" "kz" "la" "lb" "lc" "li" "lk" "lr" "ls" "lt" "lu" "lv" "ly" "ma" "mc" "md" "me" "mg" "mh" "mil" "mk" "ml" "mm" "mn" "mo" "mobi" "mp" "mq" "mr" "ms" "mt" "mu" "museum" "mv" "mw" "mx" "my" "mz" "na" "name" "nc" "ne" "net" "nf" "ng" "ni" "nl" "no" "np" "nr" "nu" "nz" "om" "org" "pa" "pe" "pf" "pg" "ph" "pk" "pl" "pm" "pn" "post" "pr" "pro" "ps" "pt" "pw" "py" "qa" "re" "ro" "rs" "ru" "rw" "sa" "sb" "sc" "sd" "se" "sg" "sh" "si" "sj" "sk" "sl" "sm" "sn" "so" "sr" "st" "su" "sv" "sx" "sy" "sz" "tc" "td" "tel" "tf" "tg" "th" "tj" "tk" "tl" "tm" "tn" "to" "tp" "tr" "travel" "tt" "tv" "tw" "tz" "ua" "ug" "uk" "us" "uy" "uz" "va" "vc" "ve" "vg" "vi" "vn" "vu" "wf" "ws" "xn--0zwm56d" "xn--11b5bs3a9aj6g" "xn--3e0b707e" "xn--45brj9c" "xn--80akhbyknj4f" "xn--80ao21a" "xn--90a3ac" "xn--9t4b11yi5a" "xn--clchc0ea0b2g2a9gcd" "xn--deba0ad" "xn--fiqs8s" "xn--fiqz9s" "xn--fpcrj9c3d" "xn--fzc2c9e2c" "xn--g6w251d" "xn--gecrj9c" "xn--h2brj9c" "xn--hgbk6aj7f53bba" "xn--hlcj6aya9esc7a" "xn--j6w193g" "xn--jxalpdlp" "xn--kgbechtv" "xn--kprw13d" "xn--kpry57d" "xn--lgbbat1ad8j" "xn--mgb9awbf" "xn--mgbaam7a8h" "xn--mgbayh7gpa" "xn--mgbbh1a71e" "xn--mgbc0a9azcg" "xn--mgberp4a5d4ar" "xn--mgbx4cd0ab" "xn--o3cw4h" "xn--ogbpf8fl" "xn--p1ai" "xn--pgbs0dh" "xn--s9brj9c" "xn--wgbh1c" "xn--wgbl6a" "xn--xkc2al3hye2a" "xn--xkc2dl3a5ee0h" "xn--yfro4i67o" "xn--ygbi2ammx" "xn--zckzah" "xxx" "ye" "yt" "za" "zm" "zw"})

	(defn ^String metadata-path
	"Produces the glob of paths to metadata files organized
	according to CommonCrawl docs: http://tinyurl.com/common-crawl-about-dataset
	Here's an example (assuming valid segments are 1, 2 and 3):
	(metadata-path 's3://path/to/commoncrawl/segments')
	=> 's3://path/to/commoncrawl/segments/{1,2,3}/metadata*'"
	[prefix]
	(->> valid-segments
	(s/join ",")
	(format "%s/{%s}/metadata*" prefix)))

	(defn ^String parse-hostname
	"Extracts the hostname from the specified URL."
	[^Text url]
	(-> url str URI. .getHost (or "")))

	(defn ^String parse-tld
	"Returns a piece of the URL's hostname after the last '.'
	Note: this may or not be an actual TLD, we'll validate
	eslewhere."
	[^Text url]
	(-> url parse-hostname (s/split #"\.") peek))

	(defn query-tlds
	"Counts site URLs from the metadata corpus grouped by TLD of each URL."
	[metadata-tap trap-tap]
	(<- [?tld ?n]
	(metadata-tap :> ?url _)
	(parse-tld ?url :> ?tld)
	(valid-tld ?tld)
	(c/count :> ?n)
	(:trap trap-tap)))

	(defmain TldQueryExe
	"Defines 'main' method that will execute our query."
	[prefix output-dir]
	(let [metadata-tap (hfs-wrtseqfile (metadata-path prefix)
	Text Text
	:outfields ["key" "value"])
	trap-tap (hfs-seqfile (str output-dir ".trap"))]
	(?- (hfs-textline output-dir)
	(query-tlds metadata-tap trap-tap))))
	# Collect a list of all Clojure source code file names.
	sources := $(shell find . -type f -name ".clj" -not -wholename "/target/*")

	target/ccooo-standalone.jar: $(sources)
	lein do compile, uberjar

	.PHONY: test
	test:
	lein do compile, midje
	(defproject org.clojars.paxan/ccooo "0.0.1-SNAPSHOT"
	:description "Common Crawl One Oh One (CCOOO)"
	:jar-name "ccooo.jar"
	:uberjar-name "ccooo-standalone.jar"
	:dependencies [[cascalog "1.10.0"]
	[cascalog-more-taps "0.3.0"]
	[org.clojure/data.json "0.1.3"]
	[commons-httpclient "3.0.1"]]
	:profiles {:dev
	{:dependencies [[org.apache.hadoop/hadoop-core "1.0.3"]
	[midje "1.4.0"]
	[midje-cascalog "0.4.0" :exclusions [org.clojure/clojure]]]}}
	:aot [ccooo.commoncrawl]
	:jvm-opts ["-Xmx1024m" "-server"])
	com 2673598481
	org 304265520
	net 267384802
	de 205656928
	uk 168846878
	pl 82716496
	ru 76587934
	info 73012083
	nl 63865755
	fr 57607765
	it 57169697
	jp 51352485
	au 47644772
	br 38397470
	ca 34885440
	edu 33770216
	cz 31919454
	es 30234103
	ro 24314778
	ch 23773429
	se 23523429
	cn 22098310
	eu 21176891
	us 20979439
	biz 19670384
	ua 18765163
	be 17547133
	dk 17237868
	at 16700746
	hu 15915580
	gov 13441484
	tv 11257982
	nz 10990233
	no 10720914
	ar 10643506
	in 10105676
	fi 8057061
	za 7807622
	tk 6972693
	sk 6711391
	mx 6649092
	tw 5633140
	ie 5568317
	tr 5339761
	cc 5296120
	gr 5269706
	pt 4917728
	kr 4664994
	me 4466363
	cl 4268607
	lt 4211638
	ws 3839074
	il 3809111
	vn 3491126
	nu 3198196
	fm 3113789
	ir 3005073
	hr 2972707
	id 2915921
	lv 2644739
	si 2632041
	sg 2517856
	my 2480529
	ee 2289856
	hk 2223043
	cat 2222285
	co 2168680
	name 1956669
	ph 1806710
	bg 1690155
	rs 1687111
	is 1623369
	by 1551484
	mobi 1480094
	pe 1457104
	su 1376915
	to 1322537
	th 1200424
	kz 966996
	lu 946862
	pk 788239
	ve 755158
	md 752997
	am 742657
	uy 740998
	travel 713272
	int 593946
	ae 533864
	asia 499624
	ec 488155
	ba 472716
	bz 465933
	sa 459688
	ma 414386
	mil 393451
	cr 391149
	cx 375813
	pro 354241
	li 346502
	eg 345288
	la 336107
	ge 334942
	do 333321
	np 329747
	st 326570
	mk 322710
	cd 319637
	lk 317232
	coop 288483
	uz 285619
	az 283443
	im 258522
	py 249463
	io 234936
	cu 234508
	mn 227103
	ly 221035
	vu 214927
	hn 206804
	ag 206739
	ms 205234
	gt 201512
	ni 184440
	bo 179374
	tc 177118
	cy 173733
	pa 172407
	ke 172375
	tl 169478
	ps 159170
	aero 140006
	mu 139950
	ac 138494
	bd 137211
	jobs 136030
	mt 134373
	lb 127906
	tel 127160
	kg 124102
	sv 120203
	tn 113409
	gd 108777
	gs 107345
	tt 98235
	jo 96747
	as 94761
	ng 94631
	re 93903
	vg 92082
	sc 90877
	vc 90865
	gg 89063
	mz 82784
	dj 79888
	sh 79521
	nr 79145
	ug 78194
	pr 71365
	al 70276
	tm 70163
	museum 69595
	sm 69013
	jm 64123
	gl 62177
	va 61439
	mo 60957
	dz 59240
	tz 57658
	gm 56658
	ad 55421
	kh 54710
	kw 53366
	bm 52620
	fo 51881
	na 51738
	nf 47086
	qa 47069
	om 45477
	nc 45195
	ky 44865
	tj 44394
	fj 43361
	mp 43108
	mc 42636
	cm 42248
	pg 41567
	zw 41064
	af 40424
	pf 39768
	sn 35878
	lc 34734
	mv 34301
	sy 34067
	bw 32829
	ax 32473
	gh 32253
	je 31861
	bn 30730
	gp 30691
	tf 29246
	mg 28988
	bs 26639
	bh 25251
	bt 24050
	bb 24012
	ao 23557
	vi 20809
	ci 20759
	ht 20093
	et 19840
	gi 19441
	zm 18859
	ne 18770
	ai 18131
	an 16804
	sd 16648
	bf 15792
	rw 15670
	sl 15336
	gy 15196
	xn--p1ai 15036
	mr 14426
	ki 14247
	sr 13590
	mw 13021
	iq 12683
	cv 11521
	ls 11458
	mm 11282
	pn 10651
	ck 9083
	hm 9032
	ye 8738
	bi 7941
	dm 7384
	ml 7265
	sz 7048
	aw 6521
	bj 5822
	sb 5494
	xxx 5442
	tg 4883
	gf 4449
	aq 3648
	so 3638
	mq 2755
	lr 2686
	ga 2641
	fk 2249
	kn 2137
	cg 1778
	yt 1431
	gn 1330
	cf 1112
	td 1041
	er 1009
	gu 906
	gq 564
	km 469
	pw 406
	xn--fiqs8s 297
	cw 170
	gw 83
	arpa 77
	pm 51
	mh 41
	tp 37
	wf 9
	xn--o3cw4h 9
	bv 8
	post 5
	xn--90a3ac 3
	xn--j6w193g 3
	gb 2
	sx 1
	xn--mgbaam7a8h 1
	(ns ccooo.test-ccooo
	(:use [midje sweet cascalog]
	ccooo.commoncrawl
	cascalog.api
	cascalog.testing)
	(:import [org.apache.hadoop.io Text]))

	(fact "Hostname parsing"
	(-> "http://zorg.me/give/me/the/stones?q=foo bar" Text. parse-hostname) => "zorg.me"
	(-> "" Text. parse-hostname) => ""
	(-> "()" Text. parse-hostname) => ""
	(-> "abc" Text. parse-hostname) => ""
	(-> "what://wsx/qaz?q=a%20b" Text. parse-hostname) => "wsx")

	(fact "TLD parsing"
	(-> "http://foo.bar.baz/1/2/3.txt" Text. parse-tld) => "baz"
	(-> "https://oh.really.xn--fiqs8s:443/yeah" Text. parse-tld) => "xn--fiqs8s")

	(with-expected-sink-sets [nothing-trapped []]
	(let [metadata-tap [["http://abc.example.com/123" "{}"]
	["http://foo.bogus/" "{}"] ;; this one should be skipped!
	["http://xyz.example.com/890" "{}"]
	["https://wow.co.uk/hello" "{}"]]]
	(fact?- "the TLD query works all right"
	[["com" 2]
	["uk" 1]]
	(query-tlds metadata-tap nothing-trapped))))