Skip to content

Instantly share code, notes, and snippets.

@majidalfifi
Last active July 3, 2019 05:43
Show Gist options
  • Save majidalfifi/c245be54cf5d7feb6fd5 to your computer and use it in GitHub Desktop.
Save majidalfifi/c245be54cf5d7feb6fd5 to your computer and use it in GitHub Desktop.
In a Hadoop cluster, if you would like to get a count of lines in some files, one easy way is to do the following:
hadoop fs -cat inputdir/* | wc -l
However this streams the content from all machines to the single machine that performs the counting.
It would be nice if "hadoop fs" has a subcommand to do this for example "hadoop fs -wc -l" but that is not the case.
An alternative is to use Hadoop streaming to parallize the lines counting task and then a single reducor to sum up the results from all the nodes. Something like the following:
hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \
-Dmapred.reduce.tasks=1 \
-input inputdir \
-output outputdir \
-mapper "bash -c 'paste <(echo "count") <(wc -l) '" \
-reducer "bash -c 'cut -f2 | paste -sd+ | bc'"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment