Skip to content

Instantly share code, notes, and snippets.

@abicky
Created August 7, 2011 05:25
Show Gist options
  • Save abicky/1130091 to your computer and use it in GitHub Desktop.
Save abicky/1130091 to your computer and use it in GitHub Desktop.
execute a command like wc to data on HDFS
#!/bin/bash
condition=""
fs="\t"
while getopts c:F: OPT; do
case $OPT in
c ) condition=$OPTARG;;
F ) fs=$OPTARG;;
esac
done
shift $(($OPTIND - 1))
if [ $# -ne 1 ]; then
echo "usage: wc_hdfs [-c condition [-F fs]] inputdir"
exit 1
fi
inputdir=$1
if [ -z "$condition" ]; then
mapper=wc
else
mapper="awk -F $fs '$condition {lines++; words += NF; chars += length(\$0) + 1} END {print lines, words, chars}'"
fi
tempfile=$(mktemp -u)
hadoop jar hadoop-streaming.jar -D mapred.reduce.tasks=1 \
-mapper "$mapper" \
-reducer "awk '{sum1 += \$1; sum2 += \$2; sum3 += \$3;} END {print sum1,sum2,sum3}'" \
-input $inputdir -output $tempfile >/dev/null &&
hadoop dfs -cat $tempfile/* &&
hadoop dfs -rmr $tempfile >/dev/null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment