majidalfifi/count-lines-hadoop-streaming

## count-lines-hadoop-streaming
In a Hadoop cluster, if you would like to get a count of lines in some files, one easy way is to do the following:

hadoop fs -cat inputdir/* | wc -l

However this streams the content from all machines to the single machine that performs the counting.
It would be nice if "hadoop fs" has a subcommand to do this for example "hadoop fs -wc -l" but that is not the case.

An alternative is to use Hadoop streaming to parallize the lines counting task and then a single reducor to sum up the results from all the nodes. Something like the following:
hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \
  -Dmapred.reduce.tasks=1 \
  -input inputdir \
  -output outputdir \
  -mapper "bash -c 'paste <(echo "count") <(wc -l) '" \
  -reducer "bash -c 'cut -f2 | paste -sd+ | bc'"
	In a Hadoop cluster, if you would like to get a count of lines in some files, one easy way is to do the following:

	hadoop fs -cat inputdir/* \| wc -l

	However this streams the content from all machines to the single machine that performs the counting.
	It would be nice if "hadoop fs" has a subcommand to do this for example "hadoop fs -wc -l" but that is not the case.

	An alternative is to use Hadoop streaming to parallize the lines counting task and then a single reducor to sum up the results from all the nodes. Something like the following:
	hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \
	-Dmapred.reduce.tasks=1 \
	-input inputdir \
	-output outputdir \
	-mapper "bash -c 'paste <(echo "count") <(wc -l) '" \
	-reducer "bash -c 'cut -f2 \| paste -sd+ \| bc'"