Skip to content

Instantly share code, notes, and snippets.

@samg
Created December 21, 2010 07:02
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samg/749603 to your computer and use it in GitHub Desktop.
Save samg/749603 to your computer and use it in GitHub Desktop.
cat_scrape_data.sh cat_scrape_data.sh
#!/bin/bash
if [ -z "$1" ] ; then
echo "usage: $0 [scrape_id]"
exit 1
fi
if [ -z "$2" ] ; then
echo "usage: $0 [scrape_id] [output file]"
exit 1
fi
imported_list=~/.imported_weaver_files.txt
big_file=$2
workarea=/tmp/workarea
touch $imported_list
if [ ! -f $imported_list ] ; then
echo "can't create $imported_list: refusing to run"
exit 1
fi
bucket="s3://johnny5-production-scrape-output/j5-$1*"
times_through=0
while [ true ]; do
i=0
for file in `s3cmd ls $bucket | perl -pe "s/ +/\t/g" | cut -f 4`; do
mkdir -p $workarea
cd $workarea
echo untaring $file
s3cmd get $file - | tar x
echo concatenating it
for homepage in `find . -type f| grep '\w\.\w'`; do
echo -n `echo $homepage | cut -d/ -f4`
echo -ne \\035\\035\\035\\035\\035\\035\\035\\035\\035\\035\\035
zcat $homepage
echo -ne \\036\\036\\036\\036\\036\\036\\036\\036\\036\\036\\036
done >> $big_file-$i.txt
i=`expr $i '+' '1'`
rm -rf $workarea
echo w00t. moving on.
times_through=0
done
times_through=`expr $times_through + 1`
if [ "$times_through" = "100" ]; then
echo "Import done or bot is dead."
exit
fi
echo "no files to import"
echo "waiting 30 seconds, then I'll try again"
sleep 30;
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment