Skip to content

Instantly share code, notes, and snippets.

@bepcyc
Forked from abicky/wc_hdfs
Last active December 25, 2015 15:29
Show Gist options
  • Save bepcyc/6998903 to your computer and use it in GitHub Desktop.
Save bepcyc/6998903 to your computer and use it in GitHub Desktop.
wc for Hadoop HDFS files
#!/bin/bash
#set correct path
HADOOP_HOME="/usr/lib/hadoop"
condition=""
fs="\t"
words=""
lines=""
chars=""
while getopts cnd:F:lwm OPT; do
case $OPT in
cnd ) condition=$OPTARG;;
F ) fs=$OPTARG;;
l ) lines="\$1";;
w ) words="\$2";;
m ) chars="\$3";;
esac
done
shift $(($OPTIND - 1))
FIELDS=("$lines")
FIELDS+=("$words")
FIELDS+=("$chars")
printf -v FIELDS "%s,%s" ${FIELDS[@]}
FIELDS=$(echo $FIELDS | sed 's/,\+/,/g' | sed 's/^,//g' | sed 's/,$//g')
FILTER="awk '{print $FIELDS}'"
if [ $# -ne 1 ]; then
echo "usage: wc_hdfs [-l -w -c] [-cnd condition [-F fs]] inputdir"
exit 1
fi
inputdir=$1
if [ -z "$condition" ]; then
mapper=wc
else
mapper="awk -F $fs '$condition {lines++; words += NF; chars += length(\$0) + 1} END {print lines, words, chars}'"
fi
tempfile=$(mktemp -u)
hadoop jar ${HADOOP_HOME}/contrib/streaming/hadoop-streaming*.jar -D mapred.reduce.tasks=1 \
-mapper "$mapper" \
-reducer "awk '{sum1 += \$1; sum2 += \$2; sum3 += \$3;} END {print sum1,sum2,sum3}'" \
-input $inputdir -output $tempfile >/dev/null &&
hadoop dfs -cat $tempfile/* | eval ${FILTER} &&
hadoop dfs -rmr $tempfile >/dev/null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment