Skip to content

Instantly share code, notes, and snippets.

@jonaslejon
Created September 7, 2014 20:00
Show Gist options
  • Save jonaslejon/e2a431734a8d20ba53f3 to your computer and use it in GitHub Desktop.
Save jonaslejon/e2a431734a8d20ba53f3 to your computer and use it in GitHub Desktop.
Commoncrawl worker
#!/bin/sh
#JAVA_HOME=/ebs/jdk1.7.0_09/
export JAVA_HOME=/ebs/jdk1.7.0_09/
SEGMENT=$1
s3cmd --add-header=x-amz-request-payer:requester ls s3://aws-publicdatasets/common-crawl/parse-output/segment/$SEGMENT/ > $SEGMENT.txt
grep metadata $SEGMENT.txt > tmpfile
mv tmpfile $SEGMENT.txt
mkdir metadata/$SEGMENT/
cd metadata/$SEGMENT
for a in `cat ../../$SEGMENT.txt|awk '{print $4}'`; do s3cmd --continue --add-header=x-amz-request-payer:requester get $a; done
mkdir ../../stage1exe/$SEGMENT/
for a in *; do echo $a; /ebs/hadoop-0.23.5/bin/hadoop fs -text $a 2>&1|python ../../exe2.pyc > ../../stage1exe/$SEGMENT/$a.csv; done
cd ../../
find stage1exe/$SEGMENT/ -type f -exec cat {} >> stagefiles/$SEGMENT.csv \;
rm -rf metadata/$SEGMENT/
cd stagefiles
sort $SEGMENT.csv|uniq > foo
mv foo $SEGMENT.csv
shuf $SEGMENT.csv > foo
mv foo $SEGMENT.csv
rm $SEGMENT.csv.bz2
WC=`wc -l $SEGMENT.csv`
bzip2 $SEGMENT.csv
cd ..
scp -C -P 4010 stagefiles/$SEGMENT.csv.bz2 je@hybo:stagefiles/
echo "Done stage $SEGMENT with $WC lines"|mail -s "Commoncrawl worker done with $SEGMENT (filename $SEGMENT.csv.bz2)" ##redacted##@triop.se
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment