This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
test |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pd.set_option('display.max_colwidth', 600) | |
[^\\p{InArabic}]+ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Day 4: | |
Task1: Word Count MapReduce in python using Hadoop Streaming | |
cd hadoopworkshop | |
git pull | |
echo -e "this is a line\nthis is another line\nand one more" | |
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py | |
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py | sort -k1,1 | |
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py | sort -k1,1 | python src/main/python/reducer.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Day 3: | |
## Taks1: apply wordcount we did yesterday | |
curl -O https://raw.githubusercontent.com/rizaumami/quran-epub/master/Source/quran-simple.txt | |
Print out the top 10 used words in the Quran. | |
## Task2: Submit Hadoop Job |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I was getting the following error in the spark-shell when trying to load lzo files (CDH 5.4) | |
ERROR GPLNativeCodeLoader: Could not load native gpl library | |
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path | |
The fix invovles making spark-shell aware of this dependency by either of the follwoing | |
1. export LD_LIBRARY_PATH=/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native/ | |
2. add saftey value in spark configuration in cloudera manager | |
Spark (Standalone) Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"created_at": "Sun May 17 05:07:58 +0000 2015", | |
"id": 599803522842918912, | |
"id_str": "599803522842918912", | |
"text": "Trying to send a tweet from my Apple Watch", | |
"source": "\u003ca href=\"https:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for Apple Watch\u003c\/a\u003e", | |
"truncated": false, | |
"in_reply_to_status_id": null, | |
"in_reply_to_status_id_str": null, | |
"in_reply_to_user_id": null, |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In a Hadoop cluster, if you would like to get a count of lines in some files, one easy way is to do the following: | |
hadoop fs -cat inputdir/* | wc -l | |
However this streams the content from all machines to the single machine that performs the counting. | |
It would be nice if "hadoop fs" has a subcommand to do this for example "hadoop fs -wc -l" but that is not the case. | |
An alternative is to use Hadoop streaming to parallize the lines counting task and then a single reducor to sum up the results from all the nodes. Something like the following: | |
hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \ | |
-Dmapred.reduce.tasks=1 \ |