Skip to content

Instantly share code, notes, and snippets.

env # to get all env variables
*********to work as root*************
su -
**************ifconfig synonyms------------
ip address show or ip a s or ip a s eth0
************formatted file name************
cp a.txt a_$(date +%F).txt
hive> set mapreduce.framework.name=local
display hive database name: set hive.cli.print.current.db=true;
DESCRIBE EXTENDED husn_small; --to get statistics
Analyze table husn_small compute statistics;
create table snpn(sn String, pn String)
LOAD DATA INPATH 'hdfs://127200813master.eap.g4ihos.itcs.hpecorp.net:8020/user/centos7/test_data/snpn' append INTO TABLE snpn
Scala examples
map:
val l = List(1,2,3,4,5)
l.map(x => x + 3 ) or l.map(_ + 3 )
pass a function as param to map:
def f(x:Int) = if (x > 3 ) (x) else None
l.map(x => f(x)) or l.map( f(_))
flatMap example:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>nosql</groupId>
<artifactId>gettingstarted</artifactId>
<version>0.0.1-SNAPSHOT</version>
</parent>
<groupId>com</groupId>
<artifactId>hbase</artifactId>
<version>0.0.1-hbase-SNAPSHOT</version>
Source --> channel --> sink
set source, channel and sink in .conf file.
source type eg: exec for shell commands like tail......
flume-ng agent --conf conf -conf-file /usr/hdp/2.5.0.0-1245/flume/conf/flume-hdfs-sink.conf --name agent1
flume-ng agent --conf conf -conf-file /usr/hdp/2.5.0.0-1245/flume/conf/flume-hdfs-sink_file.conf --name agent2
Small file size
1.MR - CombinedFileInputFormat
Hive - copy by fewer Reducers
2.set input split size - block size - number of mappers( to bigger number)
each mapper uses one jvm - fewer the mappers, fewer the jvms created and destroyed.
if you have more mapper then smaller split size is better. - fewer mappers bigger size is better.
3.allocating proper number of reducres
ACID - Atomicity, Consistency, Isoloation, Durability
Atomicity: all or none(mutiple dmls all as one)
Consistency: a transaction either creates a new and valid state of data, or in failure, its previous state.(commit/rollback)
Isolation: transaction in process(uncommitted inserts) should not be visible to other transaction.
Durability: in the event of failure or restart committed data should be recoverable.
CAP theorem: Consistency, Availability, Partition Tolerance
consistency - Every read receives the most recent write or error.
availability - every request receives a non-error response - without guarantee that it contains the most recent data
partition tolerance - the system continues to operate despite an arbitrary number of messages being dropped /delayed between the nodes.
sudo ssh 127200813data00.eap.g4ihos.itcs.hpecorp.net
cd /usr/hdp/2.5.0.0-1245/kafka/bin/
./kafka-topics.sh --create --zookeeper 127200813master.eap.g4ihos.itcs.hpecorp.net:2181,127200813data02.eap.g4ihos.itcs.hpecorp.net:2181,127200813data01.eap.g4ihos.itcs.hpecorp.net:2181,127200813data00.eap.g4ihos.itcs.hpecorp.net:2181 --replication-factor 1 --partitions 1 --topic test
./kafka-topics.sh --list --zookeeper 127200813master.eap.g4ihos.itcs.hpecorp.net:2181,127200813data02.eap.g4ihos.itcs.hpecorp.net:2181,127200813data01.eap.g4ihos.itcs.hpecorp.net:2181,127200813data00.eap.g4ihos.itcs.hpecorp.net:2181
./kafka-console-producer.sh --broker-list 127200813data00.eap.g4ihos.itcs.hpecorp.net:9092 --topic test
If you own souce code, make all methods final - never accidentally overridden
Arrays and collections should never be null
avoid state - like http - parallelism/distributed
eventHandling:
https://www.youtube.com/watch?v=ZUe1Xz7DAcY#t=17.905495
adduser
useradd etl_user -g hadoop
identify user
id etl_user
login as a different user
sudo su - etl_user
as hdfs is the hadoop rootuser in distributions like hortonworks/cloudera
sudo su - hdfs
hadoop fs -mkdir /user/etl_user
hadoop fs -chown etl_user:supergroup /user/etl_user