Skip to content

Instantly share code, notes, and snippets.

@vihari
Created July 4, 2017 12:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vihari/554df24072a605bec059b57bf58814ec to your computer and use it in GitHub Desktop.
Save vihari/554df24072a605bec059b57bf58814ec to your computer and use it in GitHub Desktop.
A PERL script to grep CommonCrwal dataset on Amazon's S3 storage. Configure your AWS account (http://tech.marksblogg.com/petabytes-of-website-data-spark-emr.html) before using the script.
#!/usr/bin/perl -w
# set the query
$query = "www.google.com\\\/maps\\\/embed";
# path to CommonCrawl dataset
$S3_URL = "s3://commoncrawl/crawl-data/CC-MAIN-2017-26/segments/";
$all = `aws s3 ls $S3_URL|perl -ane 'print "\$F[1]\n"'`;
print "Launching search for: $query...\n";
@segs = split(/[\n\s]+/, $all);
$nf=0;
for ($i=0;$i<=$#segs;$i++){
$all = "aws s3 ls $S3_URL"."$segs[$i]"."warc/";
$all = `$all|perl -ane 'print "\$F[3]\n"'`;
@seg_files = split(/[\n\s]+/, $all);
$total = ($#segs+1)*($#seg_files+1);
for ($j=0;$j<=$#seg_files;$j++){
$f = "$S3_URL/$segs[$i]/warc/$seg_files[$i]";
# print "searching $f...\n";
$cmd = "aws s3 cp $S3_URL".$segs[$i]."warc/$seg_files[$i] -|zcat| perl -ane 'BEGIN{\$prev=0;\$pr_end=0;\$cur=0;\$prev_loc=0} \$cur+=length();if(/$query/i){print \"Doc: $f -- [\$prev-]\$_\"; \$pr_end=1;}if(/^Content-Location:/){\$prev_loc=\$_;} if(/^WARC\\\/1\.0/){\$prev=\$cur; if(\$pr_end>0){print \"Location: \$prev_loc End: \$cur\n\"; \$pr_end=0}}'";
print "$cmd\n";
print `$cmd`;
if ($nf%100 == 0){
print STDOUT "Progress ~ $nf\/$total\n";
}
$nf += 1;
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment