Skip to content

Instantly share code, notes, and snippets.

test
pd.set_option('display.max_colwidth', 600)
[^\\p{InArabic}]+
Day 4:
Task1: Word Count MapReduce in python using Hadoop Streaming
cd hadoopworkshop
git pull
echo -e "this is a line\nthis is another line\nand one more"
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py | sort -k1,1
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py | sort -k1,1 | python src/main/python/reducer.py
# Day 3:
## Taks1: apply wordcount we did yesterday
curl -O https://raw.githubusercontent.com/rizaumami/quran-epub/master/Source/quran-simple.txt
Print out the top 10 used words in the Quran.
## Task2: Submit Hadoop Job
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
I was getting the following error in the spark-shell when trying to load lzo files (CDH 5.4)
ERROR GPLNativeCodeLoader: Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
The fix invovles making spark-shell aware of this dependency by either of the follwoing
1. export LD_LIBRARY_PATH=/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native/
2. add saftey value in spark configuration in cloudera manager
Spark (Standalone) Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh
{
"created_at": "Sun May 17 05:07:58 +0000 2015",
"id": 599803522842918912,
"id_str": "599803522842918912",
"text": "Trying to send a tweet from my Apple Watch",
"source": "\u003ca href=\"https:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for Apple Watch\u003c\/a\u003e",
"truncated": false,
"in_reply_to_status_id": null,
"in_reply_to_status_id_str": null,
"in_reply_to_user_id": null,
In a Hadoop cluster, if you would like to get a count of lines in some files, one easy way is to do the following:
hadoop fs -cat inputdir/* | wc -l
However this streams the content from all machines to the single machine that performs the counting.
It would be nice if "hadoop fs" has a subcommand to do this for example "hadoop fs -wc -l" but that is not the case.
An alternative is to use Hadoop streaming to parallize the lines counting task and then a single reducor to sum up the results from all the nodes. Something like the following:
hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \
-Dmapred.reduce.tasks=1 \