In a Hadoop cluster, if you would like to get a count of lines in some files, one easy way is to do the following:
hadoop fs -cat inputdir/* | wc -l
However this streams the content from all machines to the single machine that performs the counting.
It would be nice if "hadoop fs" has a subcommand to do this for example "hadoop fs -wc -l" but that is not the case.
An alternative is to use Hadoop streaming to parallize the lines counting task and then a single reducor to sum up the results from all the nodes. Something like the following:
hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \
-Dmapred.reduce.tasks=1 \
I was getting the following error in the spark-shell when trying to load lzo files (CDH 5.4)
ERROR GPLNativeCodeLoader: Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
The fix invovles making spark-shell aware of this dependency by either of the follwoing
1. export LD_LIBRARY_PATH=/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native/
2. add saftey value in spark configuration in cloudera manager
Spark (Standalone) Service Advanced Configuration Snippet (Safety Valve) for spark-conf/
# Day 3:
## Taks1: apply wordcount we did yesterday
curl -O
Print out the top 10 used words in the Quran.
## Task2: Submit Hadoop Job
Day 4:
Task1: Word Count MapReduce in python using Hadoop Streaming
cd hadoopworkshop
git pull
echo -e "this is a line\nthis is another line\nand one more"
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/ | sort -k1,1
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/ | sort -k1,1 | python src/main/python/
