Anjaiah Methuku anjijava16

## anjiScala.txt
Scala Envi:


1.  Install scala
1.1 Download scala (latest version) form http://www.scala-lang.org/
1.2 Uncompress it
1.3 Add the scala bin folder to path variable

2.  Eclipse mars or luna

## SkipMapper.java
package com.hadoop.skipper;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.StringTokenizer;

import org.apache.hadoop.fs.Path;

## URL.Imp
http://www.geoinsyssoft.com/rdds-vs-dataframes-apache-spark/

## GitCommands
[-]GIT
https://www.codeschool.com/learn/git
https://cursos.alura.com.br/course/git
https://www.codecademy.com/learn/learn-git

Um projeto no GIT é composto de 3 partes:
	- Working Directory: Onde acontece a edição/deleção
	- Staging Area: Onde adicionamos os arquivos a serem comitados
	- Repository: Onde acontece o commit e é armazenado a última versão.

## Hive_Best_practise.txt
•	DISTINCT and GROUP BY - Use only if it is necessary. Try to avoid it as it will degrade the performance.
•	PARTITION - Try to partition the table. Using the partition column in Filter will Improve the performance.
•	Rewrite - Do not use the same query as used in RDBMS. Rewrite the query completely to improve the performance.
•	Map Split Size - Try to reduce the Map Split Size. This will reduce the time taken by the query.
•	Map Join - Try to Map Join small tables so that joining it with large table will take less time.
•	Memory - Change the memory based on queries used.
•	Format ORC - Try to keep all the tables in ORC format which will improve the queries on that table.
•	Hive Execute parallel - For executing jobs in parallel.
•	CTAS - Try creating Managed tables instead of External tables.
•	Data Explosion - Try to fetch the filtered data set and join. Make sure that there is no cross join between large data set.

## CentsOsChrome.txt
Procedure to install Google Chrome 52 on a RHEL/CentOS/Fedora Linux:

Here is how to install and use the Google Chrome 45 in five easy steps:

    Open the Terminal application. Grab 64bit Google Chrome.
    Type the following command to download 64 bit version of Google Chrome:
    wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
    Install Google Chrome and its dependencies on a CentOS/RHEL, type:
    sudo yum install ./google-chrome-stable_current_*.rpm
    Start Google Chrome from the CLI:

## BlockSizeAndReplicationIncreased.txt

i) hadoop fs -Ddfs.block.size=67108864 -Ddfs.replication=4 -copyFromLocal pom.xml /app/data


ii) hdfs fsck -blocks -files -locations /app/data/pom.xml

Output : Connecting to namenode via http://localhost:50070
FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path /app/data/pom.xml at Mon Aug 21 23:36:58 IST 2017
/app/data/pom.xml 2617 bytes, 1 block(s):  Under replicated BP-806356112-127.0.0.1-1489344343967:blk_1073742850_2027. Target Replicas is 4 but found 1 replica(s).
0. BP-806356112-127.0.0.1-1489344343967:blk_1073742850_2027 len=2617 repl=1 [127.0.0.1:50010]

## HadoopURL.txt
http://www3.cs.stonybrook.edu/~youngkwon/cse535/Lecture13_Hadoop_HDFS.pdf


http://www3.cs.stonybrook.edu/~youngkwon/cse535/


http://www3.cs.stonybrook.edu/~youngkwon/cse535/Lecture12_Hadoop_MapReduce.pdf

## WhyBlockSize128MB.txt

Wednesday, March 11, 2015
HDFS Tutorial

Posted by pramod narayana at 6:27 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest

## HiveExecution.hql
java -cp "hiveJdbcQueryUtils-0.1.jar:lib/*" com.iwinner.hive.select.hive.main.MainProcess
	Scala Envi:


	1. Install scala
	1.1 Download scala (latest version) form http://www.scala-lang.org/
	1.2 Uncompress it
	1.3 Add the scala bin folder to path variable

	2. Eclipse mars or luna
	package com.hadoop.skipper;

	import java.io.BufferedReader;
	import java.io.FileReader;
	import java.io.IOException;
	import java.util.HashSet;
	import java.util.Set;
	import java.util.StringTokenizer;

	import org.apache.hadoop.fs.Path;
	[-]GIT
	https://www.codeschool.com/learn/git
	https://cursos.alura.com.br/course/git
	https://www.codecademy.com/learn/learn-git

	Um projeto no GIT é composto de 3 partes:
	- Working Directory: Onde acontece a edição/deleção
	- Staging Area: Onde adicionamos os arquivos a serem comitados
	- Repository: Onde acontece o commit e é armazenado a última versão.
	• DISTINCT and GROUP BY - Use only if it is necessary. Try to avoid it as it will degrade the performance.
	• PARTITION - Try to partition the table. Using the partition column in Filter will Improve the performance.
	• Rewrite - Do not use the same query as used in RDBMS. Rewrite the query completely to improve the performance.
	• Map Split Size - Try to reduce the Map Split Size. This will reduce the time taken by the query.
	• Map Join - Try to Map Join small tables so that joining it with large table will take less time.
	• Memory - Change the memory based on queries used.
	• Format ORC - Try to keep all the tables in ORC format which will improve the queries on that table.
	• Hive Execute parallel - For executing jobs in parallel.
	• CTAS - Try creating Managed tables instead of External tables.
	• Data Explosion - Try to fetch the filtered data set and join. Make sure that there is no cross join between large data set.
	Procedure to install Google Chrome 52 on a RHEL/CentOS/Fedora Linux:

	Here is how to install and use the Google Chrome 45 in five easy steps:

	Open the Terminal application. Grab 64bit Google Chrome.
	Type the following command to download 64 bit version of Google Chrome:
	wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
	Install Google Chrome and its dependencies on a CentOS/RHEL, type:
	sudo yum install ./google-chrome-stable_current_*.rpm
	Start Google Chrome from the CLI:

	i) hadoop fs -Ddfs.block.size=67108864 -Ddfs.replication=4 -copyFromLocal pom.xml /app/data


	ii) hdfs fsck -blocks -files -locations /app/data/pom.xml

	Output : Connecting to namenode via http://localhost:50070
	FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path /app/data/pom.xml at Mon Aug 21 23:36:58 IST 2017
	/app/data/pom.xml 2617 bytes, 1 block(s): Under replicated BP-806356112-127.0.0.1-1489344343967:blk_1073742850_2027. Target Replicas is 4 but found 1 replica(s).
	0. BP-806356112-127.0.0.1-1489344343967:blk_1073742850_2027 len=2617 repl=1 [127.0.0.1:50010]
	http://www3.cs.stonybrook.edu/~youngkwon/cse535/Lecture13_Hadoop_HDFS.pdf


	http://www3.cs.stonybrook.edu/~youngkwon/cse535/


	http://www3.cs.stonybrook.edu/~youngkwon/cse535/Lecture12_Hadoop_MapReduce.pdf

	Wednesday, March 11, 2015
	HDFS Tutorial

	Posted by pramod narayana at 6:27 AM No comments:
	Email This
	BlogThis!
	Share to Twitter
	Share to Facebook
	Share to Pinterest