majidalfifi/hadoop-training4

## hadoop-training4
Day 4:

Task1: Word Count MapReduce in python using Hadoop Streaming
cd hadoopworkshop
git pull

echo -e "this is a line\nthis is another line\nand one more"
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py | sort -k1,1
echo -e "this is a line\nthis is another line\nand one more" | python src/main/python/mapper.py | sort -k1,1 | python src/main/python/reducer.py

Now run it on hadoop:
yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-streaming*.jar \
-file src/main/python/mapper.py \
-file src/main/python/reducer.py \
-mapper src/main/python/mapper.py \
-reducer src/main/python/reducer.py \
-input file1.txt \
-output results66


Task2: execute top tweeters
mvn clean compile assembly:single
hdfs dfs -cp /tmp/tweets-sample.json .
yarn jar target/mapreduce-helloworld-1.0-SNAPSHOT-jar-with-dependencies.jar edu.kfupm.hadoop.TopTweeters tweets-sample.json results


Task3: write a job to find most mentioned users,
1. copy TopTweeters.java to TopMentions.java
2. make the appropriate changes :)

mvn clean compile assembly:single
yarn jar target/mapreduce-helloworld-1.0-SNAPSHOT-jar-with-dependencies.jar edu.kfupm.hadoop.TopMentions /tmp/tweets-sample.json results
	Day 4:

	Task1: Word Count MapReduce in python using Hadoop Streaming
	cd hadoopworkshop
	git pull

	echo -e "this is a line\nthis is another line\nand one more"
	echo -e "this is a line\nthis is another line\nand one more" \| python src/main/python/mapper.py
	echo -e "this is a line\nthis is another line\nand one more" \| python src/main/python/mapper.py \| sort -k1,1
	echo -e "this is a line\nthis is another line\nand one more" \| python src/main/python/mapper.py \| sort -k1,1 \| python src/main/python/reducer.py

	Now run it on hadoop:
	yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-streaming*.jar \
	-file src/main/python/mapper.py \
	-file src/main/python/reducer.py \
	-mapper src/main/python/mapper.py \
	-reducer src/main/python/reducer.py \
	-input file1.txt \
	-output results66


	Task2: execute top tweeters
	mvn clean compile assembly:single
	hdfs dfs -cp /tmp/tweets-sample.json .
	yarn jar target/mapreduce-helloworld-1.0-SNAPSHOT-jar-with-dependencies.jar edu.kfupm.hadoop.TopTweeters tweets-sample.json results


	Task3: write a job to find most mentioned users,
	1. copy TopTweeters.java to TopMentions.java
	2. make the appropriate changes :)

	mvn clean compile assembly:single
	yarn jar target/mapreduce-helloworld-1.0-SNAPSHOT-jar-with-dependencies.jar edu.kfupm.hadoop.TopMentions /tmp/tweets-sample.json results