majidalfifi

## test
test

## tawari
pd.set_option('display.max_colwidth', 600)

[^\\p{InArabic}]+

## hadoop-training4
Day 4:

Task1: Word Count MapReduce in python using Hadoop Streaming
cd hadoopworkshop
git pull

echo -e "this is a line\nthis is another line\nand one more"
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py | sort -k1,1
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py | sort -k1,1 | python src/main/python/reducer.py

## hadoop-training
# Day 3:
## Taks1: apply wordcount we did yesterday

curl -O https://raw.githubusercontent.com/rizaumami/quran-epub/master/Source/quran-simple.txt

Print out the top 10 used words in the Quran.


## Task2: Submit Hadoop Job

## fraudar.ipynb

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                majidalfifi
                / fraudar.ipynb
            
            
              Created
              November 12, 2016 03:43
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## fix for spark lzo native dependency in CDH
I was getting the following error in the spark-shell when trying to load lzo files (CDH 5.4)

ERROR GPLNativeCodeLoader: Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path

The fix invovles making spark-shell aware of this dependency by either of the follwoing
1. export LD_LIBRARY_PATH=/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native/

2. add saftey value in spark configuration in cloudera manager
Spark (Standalone) Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh

## tweet-from-apple-watch.json
{
  "created_at": "Sun May 17 05:07:58 +0000 2015",
  "id": 599803522842918912,
  "id_str": "599803522842918912",
  "text": "Trying to send a tweet from my Apple Watch",
  "source": "\u003ca href=\"https:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for Apple Watch\u003c\/a\u003e",
  "truncated": false,
  "in_reply_to_status_id": null,
  "in_reply_to_status_id_str": null,
  "in_reply_to_user_id": null,

## count-lines-hadoop-streaming
In a Hadoop cluster, if you would like to get a count of lines in some files, one easy way is to do the following:

hadoop fs -cat inputdir/* | wc -l

However this streams the content from all machines to the single machine that performs the counting.
It would be nice if "hadoop fs" has a subcommand to do this for example "hadoop fs -wc -l" but that is not the case.

An alternative is to use Hadoop streaming to parallize the lines counting task and then a single reducor to sum up the results from all the nodes. Something like the following:
hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \
  -Dmapred.reduce.tasks=1 \
	Day 4:

	Task1: Word Count MapReduce in python using Hadoop Streaming
	cd hadoopworkshop
	git pull

	echo -e "this is a line\nthis is another line\nand one more"
	echo -e "this is a line\nthis is another line\nand one more" \| python src/main/python/mapper.py
	echo -e "this is a line\nthis is another line\nand one more" \| python src/main/python/mapper.py \| sort -k1,1
	echo -e "this is a line\nthis is another line\nand one more" \| python src/main/python/mapper.py \| sort -k1,1 \| python src/main/python/reducer.py
	# Day 3:
	## Taks1: apply wordcount we did yesterday

	curl -O https://raw.githubusercontent.com/rizaumami/quran-epub/master/Source/quran-simple.txt

	Print out the top 10 used words in the Quran.



	## Task2: Submit Hadoop Job
	I was getting the following error in the spark-shell when trying to load lzo files (CDH 5.4)

	ERROR GPLNativeCodeLoader: Could not load native gpl library
	java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path

	The fix invovles making spark-shell aware of this dependency by either of the follwoing
	1. export LD_LIBRARY_PATH=/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native/

	2. add saftey value in spark configuration in cloudera manager
	Spark (Standalone) Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh
	{
	"created_at": "Sun May 17 05:07:58 +0000 2015",
	"id": 599803522842918912,
	"id_str": "599803522842918912",
	"text": "Trying to send a tweet from my Apple Watch",
	"source": "\u003ca href=\"https:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for Apple Watch\u003c\/a\u003e",
	"truncated": false,
	"in_reply_to_status_id": null,
	"in_reply_to_status_id_str": null,
	"in_reply_to_user_id": null,
	In a Hadoop cluster, if you would like to get a count of lines in some files, one easy way is to do the following:

	hadoop fs -cat inputdir/* \| wc -l

	However this streams the content from all machines to the single machine that performs the counting.
	It would be nice if "hadoop fs" has a subcommand to do this for example "hadoop fs -wc -l" but that is not the case.

	An alternative is to use Hadoop streaming to parallize the lines counting task and then a single reducor to sum up the results from all the nodes. Something like the following:
	hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \
	-Dmapred.reduce.tasks=1 \