jamesrajendran

## Linux commands
env # to get all env variables


*********to work as root*************
su -
**************ifconfig synonyms------------
ip address show or ip a s or ip a s eth0
************formatted file name************
cp a.txt a_$(date +%F).txt

## Hive Commands
hive> set mapreduce.framework.name=local
display hive database name: set hive.cli.print.current.db=true;

DESCRIBE EXTENDED husn_small; --to get statistics
Analyze table husn_small compute statistics;

create table snpn(sn String, pn String)
LOAD DATA  INPATH 'hdfs://127200813master.eap.g4ihos.itcs.hpecorp.net:8020/user/centos7/test_data/snpn' append INTO TABLE snpn


## Scala Commands
Scala examples
map:
 val l = List(1,2,3,4,5)
 l.map(x => x + 3 ) or  l.map(_ + 3 )

pass a function as param to map:
def f(x:Int) = if (x > 3 ) (x) else None
 l.map(x => f(x)) or l.map( f(_))

 flatMap example:

## Useful POM
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>nosql</groupId>
    <artifactId>gettingstarted</artifactId>
    <version>0.0.1-SNAPSHOT</version>
  </parent>
  <groupId>com</groupId>
  <artifactId>hbase</artifactId>
  <version>0.0.1-hbase-SNAPSHOT</version>

## Flume Commands
Source --> channel --> sink

set source, channel and sink in .conf  file.
source type eg: exec for shell commands like tail......


flume-ng agent --conf conf -conf-file /usr/hdp/2.5.0.0-1245/flume/conf/flume-hdfs-sink.conf --name agent1

flume-ng agent --conf conf -conf-file /usr/hdp/2.5.0.0-1245/flume/conf/flume-hdfs-sink_file.conf --name agent2

## Hadoop Tuning
Small file size
	1.MR - CombinedFileInputFormat
	  Hive - copy by fewer Reducers

	2.set input split size - block size - number of mappers( to bigger number)
			each mapper uses one jvm - fewer the mappers, fewer the jvms created and destroyed.
			if you have more mapper then smaller split size is better. - fewer mappers bigger size is better.


	3.allocating proper number of reducres

## Data Modeling
ACID - Atomicity, Consistency, Isoloation, Durability
Atomicity: all or none(mutiple dmls all as one)
Consistency: a transaction either creates a new and valid state of data, or in failure, its previous state.(commit/rollback)
Isolation: transaction in process(uncommitted inserts) should not be visible to other transaction.
Durability: in the event of failure or restart committed data should be recoverable.

CAP theorem: Consistency, Availability, Partition Tolerance
	consistency - Every read receives the most recent write or error.
	availability - every request receives a non-error response - without guarantee that it contains the most recent data
	partition tolerance - the system continues to operate despite an arbitrary number of messages being dropped /delayed between the nodes.

## Kafka scripts
sudo ssh 127200813data00.eap.g4ihos.itcs.hpecorp.net
cd /usr/hdp/2.5.0.0-1245/kafka/bin/

./kafka-topics.sh --create --zookeeper 127200813master.eap.g4ihos.itcs.hpecorp.net:2181,127200813data02.eap.g4ihos.itcs.hpecorp.net:2181,127200813data01.eap.g4ihos.itcs.hpecorp.net:2181,127200813data00.eap.g4ihos.itcs.hpecorp.net:2181 --replication-factor 1 --partitions 1 --topic test


./kafka-topics.sh --list --zookeeper 127200813master.eap.g4ihos.itcs.hpecorp.net:2181,127200813data02.eap.g4ihos.itcs.hpecorp.net:2181,127200813data01.eap.g4ihos.itcs.hpecorp.net:2181,127200813data00.eap.g4ihos.itcs.hpecorp.net:2181

./kafka-console-producer.sh --broker-list 127200813data00.eap.g4ihos.itcs.hpecorp.net:9092 --topic test

## Java Notes
If you own souce code, make all methods final - never accidentally overridden

Arrays and collections should never be null

avoid state - like http - parallelism/distributed


eventHandling:
https://www.youtube.com/watch?v=ZUe1Xz7DAcY#t=17.905495

## Hadoop Admin
adduser
	useradd etl_user -g hadoop
identify user
	id etl_user
login as a different user
	sudo su - etl_user
as hdfs is the hadoop rootuser in distributions like hortonworks/cloudera
sudo su - hdfs
hadoop fs -mkdir /user/etl_user
hadoop fs -chown etl_user:supergroup /user/etl_user
	env # to get all env variables


	*******to work as root***********
	su -
	**************ifconfig synonyms------------
	ip address show or ip a s or ip a s eth0
	**********formatted file name**********
	cp a.txt a_$(date +%F).txt
	hive> set mapreduce.framework.name=local
	display hive database name: set hive.cli.print.current.db=true;

	DESCRIBE EXTENDED husn_small; --to get statistics
	Analyze table husn_small compute statistics;

	create table snpn(sn String, pn String)
	LOAD DATA INPATH 'hdfs://127200813master.eap.g4ihos.itcs.hpecorp.net:8020/user/centos7/test_data/snpn' append INTO TABLE snpn
	Scala examples
	map:
	val l = List(1,2,3,4,5)
	l.map(x => x + 3 ) or l.map(_ + 3 )

	pass a function as param to map:
	def f(x:Int) = if (x > 3 ) (x) else None
	l.map(x => f(x)) or l.map( f(_))

	flatMap example:
	<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<parent>
	<groupId>nosql</groupId>
	<artifactId>gettingstarted</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	</parent>
	<groupId>com</groupId>
	<artifactId>hbase</artifactId>
	<version>0.0.1-hbase-SNAPSHOT</version>
	Source --> channel --> sink

	set source, channel and sink in .conf file.
	source type eg: exec for shell commands like tail......


	flume-ng agent --conf conf -conf-file /usr/hdp/2.5.0.0-1245/flume/conf/flume-hdfs-sink.conf --name agent1

	flume-ng agent --conf conf -conf-file /usr/hdp/2.5.0.0-1245/flume/conf/flume-hdfs-sink_file.conf --name agent2
	Small file size
	1.MR - CombinedFileInputFormat
	Hive - copy by fewer Reducers

	2.set input split size - block size - number of mappers( to bigger number)
	each mapper uses one jvm - fewer the mappers, fewer the jvms created and destroyed.
	if you have more mapper then smaller split size is better. - fewer mappers bigger size is better.


	3.allocating proper number of reducres
	ACID - Atomicity, Consistency, Isoloation, Durability
	Atomicity: all or none(mutiple dmls all as one)
	Consistency: a transaction either creates a new and valid state of data, or in failure, its previous state.(commit/rollback)
	Isolation: transaction in process(uncommitted inserts) should not be visible to other transaction.
	Durability: in the event of failure or restart committed data should be recoverable.

	CAP theorem: Consistency, Availability, Partition Tolerance
	consistency - Every read receives the most recent write or error.
	availability - every request receives a non-error response - without guarantee that it contains the most recent data
	partition tolerance - the system continues to operate despite an arbitrary number of messages being dropped /delayed between the nodes.
	sudo ssh 127200813data00.eap.g4ihos.itcs.hpecorp.net
	cd /usr/hdp/2.5.0.0-1245/kafka/bin/

	./kafka-topics.sh --create --zookeeper 127200813master.eap.g4ihos.itcs.hpecorp.net:2181,127200813data02.eap.g4ihos.itcs.hpecorp.net:2181,127200813data01.eap.g4ihos.itcs.hpecorp.net:2181,127200813data00.eap.g4ihos.itcs.hpecorp.net:2181 --replication-factor 1 --partitions 1 --topic test


	./kafka-topics.sh --list --zookeeper 127200813master.eap.g4ihos.itcs.hpecorp.net:2181,127200813data02.eap.g4ihos.itcs.hpecorp.net:2181,127200813data01.eap.g4ihos.itcs.hpecorp.net:2181,127200813data00.eap.g4ihos.itcs.hpecorp.net:2181

	./kafka-console-producer.sh --broker-list 127200813data00.eap.g4ihos.itcs.hpecorp.net:9092 --topic test
	If you own souce code, make all methods final - never accidentally overridden

	Arrays and collections should never be null

	avoid state - like http - parallelism/distributed


	eventHandling:
	https://www.youtube.com/watch?v=ZUe1Xz7DAcY#t=17.905495
	adduser
	useradd etl_user -g hadoop
	identify user
	id etl_user
	login as a different user
	sudo su - etl_user
	as hdfs is the hadoop rootuser in distributions like hortonworks/cloudera
	sudo su - hdfs
	hadoop fs -mkdir /user/etl_user
	hadoop fs -chown etl_user:supergroup /user/etl_user