Last active
July 3, 2019 05:43
-
-
Save majidalfifi/c245be54cf5d7feb6fd5 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In a Hadoop cluster, if you would like to get a count of lines in some files, one easy way is to do the following: | |
hadoop fs -cat inputdir/* | wc -l | |
However this streams the content from all machines to the single machine that performs the counting. | |
It would be nice if "hadoop fs" has a subcommand to do this for example "hadoop fs -wc -l" but that is not the case. | |
An alternative is to use Hadoop streaming to parallize the lines counting task and then a single reducor to sum up the results from all the nodes. Something like the following: | |
hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \ | |
-Dmapred.reduce.tasks=1 \ | |
-input inputdir \ | |
-output outputdir \ | |
-mapper "bash -c 'paste <(echo "count") <(wc -l) '" \ | |
-reducer "bash -c 'cut -f2 | paste -sd+ | bc'" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment