Skip to content

Instantly share code, notes, and snippets.

@anjijava16
Created August 23, 2017 02:21
Show Gist options
  • Save anjijava16/105160015bd289b2c5590ed07dbf0555 to your computer and use it in GitHub Desktop.
Save anjijava16/105160015bd289b2c5590ed07dbf0555 to your computer and use it in GitHub Desktop.
Wednesday, March 11, 2015
HDFS Tutorial
Posted by pramod narayana at 6:27 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest
Monday, October 20, 2014
Multi-Node(4) Hadoop Cluster on Ubuntu using Oracle VirtualBox. Cloudera CDH5
This article is divided into 2 parts
Installation of Ubuntu on Oracle VirtualBox and Cloning
Clustering and Cloudera(CDH5) installation
Reference: http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/
nameNode.cdh-cluster.com
192.168.1.100
8GB RAM
40 GB Diskspace
dn-01.cdh-cluster.com
192.168.1.101
2GB-4GB RAM
40 GB Diskspace
dn-02.cdh-cluster.com
192.168.1.102
2GB-4GB RAM
40 GB Diskspace
dn-03.cdh-cluster.com
192.168.1.103
2GB-4GB RAM
40 GB Diskspace
Hostnames: nameNode, dn-01, dn-02, dn-03
domain name: cdh-cluster.com
host machine: windows 8
guest machine/virtual machine: ubuntu-12.04.4-server
Download latest Oracle VirtualBox for Windows
https://www.virtualbox.org/wiki/Downloads
Install the VirtualBox
Download ubuntu-12.04.4-server-amd64.iso from ubuntu website
http://releases.ubuntu.com/precise/ubuntu-12.04.4-server-amd64.iso
Creation of Virtual Machine:
Install the oracle virutalbox if not already installed
After VirtualBox Installation
Click on "New" Button on left top corner
1. Name and Operating System
Name: nameNode
Tpye: Linux
Version: Ubuntu (64bit)
2. Memory Size:
Set memory to 8192 (8GB)
3. Hard drive:
choose Create a virtual hard drive now -- default
5. Hard drive file type
choose VDI(Virtual Disk Image) -- default
6. Storage on physical hard drive
choose Dynamically allocated -- default
7. File location and size
choose the location where you have enough memory.
set hard disk space to 40GB
Now Virtual Machine is created...
Ubuntu Installation on the VirtualMachine:
Let the newly created VM be highlighted.
1. Network Configuration
Click "Setting" button next the "New" button
Select Network and then Adapter 1 tab
check Enable Network Adapter (by default it will be checked and attched to NAT)
Attach to: Bridge Adapter
Name: choose you working internet adapter (Wifi or Ethernet)
3. Boot Configuration
Select Setting->>Storage
Click on Empty Disk icon
Click on the Disk icon on the right end corner
Chose ubuntu-12.04.4-server-amd64.iso from your local disk
Check Live CD/DVD check box
All the configuration are done for installation.
Now Click the "Start" button next to "Setting" button
Installation Begins...
Events during installation process
Keyboard Detection: press No
Hostname: give any name and press continue (hostname will be changed later)
Full Name for the new user: Give any name and press continue
Username for your account: hduser
Choose a password for the new user: <your password>
Re-enter password to verify: <your password>
Encrypt your home directory? No
Partitioning method: Guided - use entire disk
Select disk to partition: whatever default is chosen
Write the changes to disk? Yes
HTTP proxy information (blank for none): Keep it blank and press continue
How do you want to manage upgrades on this system?
Select Install security updates automatically
Choose software to install : select OpenSSH server
Install the GRUB boot loader to the master boot record? Yes
Installation complete: Press continue
System will reboot
Login with configured user during installation...
Your current working directory will be your home directory
Hostname, Domain and IPaddress configuration
1. Configure hostname
sudo vi /etc/hostname
overwrite content with nameNode
2. Configure IPaddress and Domain name
sudo vi /etc/network/interfaces and add the following lines
auto eth0
iface eth0 inet static
address 192.168.1.100
netmask 255.255.255.0
network 192.168.1.0
broadcast 192.168.1.255
gateway 192.168.1.1 # This is my gateway ip i.e modem ip
dns-nameservers <dns ip provided by your serivce provider>
dns-domain cdh-cluster.com # You can give you own domain name
dns-search cdh-cluster.com
Restart network
sudo /etc/init.d/networking restart
Trying to ping google.com to check whether this VM is connected to network. This should work..
Installing the Linux Guest Additions
sudo apt-get update
sudo apt-get install dkms
My virtual box version is 4.3.10. Download your version guest additions
wget http://download.virtualbox.org/virtualbox/4.3.10/VBoxGuestAdditions_4.3.10.iso
sudo mount -o loop VBoxGuestAdditions_4.3.10.iso /mnt/
cd /mnt
sudo ./VBoxLinuxAdditions.run
3. For resolving fqdn to ipaddress
sudo vi /etc/hosts
add the following lines
192.168.1.100 nameNode.cdh-cluster.com nameNode
192.168.1.101 dn-01.cdh-cluster.com dn-01
192.168.1.102 dn-02.cdh-cluster.com dn-02
192.168.1.103 dn-03.cdh-cluster.com dn-03
Comment the line contatining 127.0.1.1
Setup SSH
sudo apt-get install openssh-clients;
sudo ssh-keygen (press enter and press enter twice)
cd ~/.ssh
cp id_rsa.pub authorized_keys
sudo vi /etc/ssh/ssh_config
StrictHostKeyChecking no
Shutdown the virtual machine
sudo init 0
Cloning VirtualMachine
On the VirutalBox, right click on the virtualmachine and select clone
New Machine Name: Provide a name
Clone Type: choose linked clone
Do this 2 more times so that there are 4 virtual machine
Now change the ip address and hostname of these newly cloned virtual machines
/etc/hostname
dn-0[n] where n= 1, 2, 3
/etc/network/interfaces
address 192.168.1.n #where n= 101, 102, 103
restart network
sudo /etc/init.d/networking restart
Now the 4 virtual machines are ready for hadoop cluster.
Download cloudera manger on nameNode (You could use separate machine to install cloudera manager). I am doing it on nameNode
$> curl -O http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
$> chmod +x cloudera-manager-installer.bin
$> ./cloudera-manager-installer.bin
After installation use the web browser on you hostmachine or any other machine on the name network and connect to http://namenode.cdh-cluster.com:7180 or http://<ip-address>:7180
To continue the installation, you will have to select the Cloudera free license version. You will then ipaddress of the nodes that will be used in the cluster. Just enter all the nodes ipaddress separated by a space. Click on the “Search” button. You can then use the configured user(hduser) password (or the SSH keys you have generated) to automate the connectivty to the different nodes.
In ubuntu root privileges are required for any package installtion.
sudo vi /etc/sudoers and add the following line
hduser ALL=(ALL) NOPASSWD: ALL #Here hduser is user configured on my machine
Once this is done, you will select additional service components; just select everything by default. The installation will continue and will complete.
You can also customize and add service components one at a time.
By default
postgresql is used for hive metastore
derby db for oozie
Posted by pramod narayana at 10:17 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest
Friday, September 26, 2014
Analyse StackOverflow Dataset using Pig with Filter UDF
--Problem: Given a list of user’s comments, filter out a majority of the comments that do not contain a particular keyword
--Refernce: MapReduce Design Pattern, Pig Programming, http://pig.apache.org/docs/r0.11.1/index.html
-- Input: StackOverflow comments dataset. small subset of the comments dataset is used.
register '/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar';
register '/home/hduser/Documents/mytutorial/Pig/SOFilter.jar';
socomments = LOAD 'hive/stackoverflow/comments/comments' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS (record:chararray);
-- tuples ared filtered using custom Filter UDF
filtered_comments = FILTER socomments by SOFilter(record);
STORE filtered_comments in 'hive/stackoverflow/comments/comments/output';
Filter UDF
keywords are hardcoded here for simplicity. Keywords could be read from a file through distributed cache.
public class SOFilter extends FilterFunc{
@Override
public Boolean exec(Tuple input) throws IOException {
try{
String xmlRecord = (String)input.get(0);
if (null == xmlRecord ) {
return false;
}
Map<String, String> parsed = new HashMap<String, String>();
String[] tokens = xmlRecord.trim().substring(5, xmlRecord.trim().length() - 3).split("\"");
for (int i = 0; i < tokens.length - 1; i += 2) {
String key = tokens[i].trim();
String val = tokens[i + 1];
parsed.put(key.substring(0, key.length() - 1), val);
}
String text = parsed.get("Text");
if (null != text) {
text = text.toLowerCase();
if (Pattern.compile("\\bhadoop\\b|\\bhbase\\b|\\bhive\\b|\\bmapreduce\\b|\\boozie\\b|\\bsqoop\\b|flume\\b").matcher(text).find())
return true;
}
} catch(Exception e){
System.err.println("Failed to process input; error - " + e.getMessage());
}
return false;
}
}
Posted by pramod narayana at 7:29 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest
Friday, September 19, 2014
Analyse StackOverflow Dataset using Pig with Eval UDF
-- Problem: Given a list of user’s comments, determine the first and last time a user commented and the total number of comments from that user
-- Refernce: MapReduce Design Pattern, Pig Programming, http://pig.apache.org/docs/r0.11.1/index.html
-- Input: StackOverflow comments dataset. small subset of the comments dataset is used.
register '/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar';
register '/home/hduser/Documents/mytutorial/Pig/XMLMap.jar';
-- Here comments dataset is loaded using XMLLoader. However, TextLoader can also be used as there is one xml element per line.
comments = LOAD 'hive/stackoverflow/comments/comments' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS (record:chararray);
-- Using a custom UDF parse the xml tuple and return Map.
commentsMap = FOREACH comments GENERATE XMLMap(record);
-- From the Map extract UserId, CreationDate, Convert CreationDate from String to datetime object.
A = foreach commentsMap generate $0#'UserId' as userId:chararray, ToDate($0#'CreationDate', 'yyyy-MM-dd\'T\'HH:mm:ss.SSS') as date;
-- Group the tuple by userId. All comments of a user is grouped together.
B = GROUP A BY userId;
--- For each group, count the number of comments, calculate min(first comment) creation date and max(last comment) creation date.
C = FOREACH B GENERATE group as userId, COUNT(A) as totalcomments, MIN(A.date) as first, MAX(A.date) as last;
STORE C into 'hive/stackoverflow/comments/output';
//Custom UDF XMLMap. Input is the tuple containing the xmlelement. This function converts the tuple into a Map of key value and returns that Map.
public class XMLMap extends EvalFunc<Map<String, String>> {
@Override
public Map<String, String> exec(Tuple input) throws IOException {
try{
String xmlRecord = (String)input.get(0);
Tuple output = TupleFactory.getInstance().newTuple();
Map<String, String> map = new HashMap<String, String>();
String[] tokens = xmlRecord.trim().substring(5, xmlRecord.trim().length() - 3).split("\"");
for (int i = 0; i < tokens.length - 1; i += 2) {
String key = tokens[i].trim();
String val = tokens[i + 1];
map.put(key.substring(0, key.length() - 1), val);
}
return map;
} catch(Exception e){
System.err.println("Failed to process input; error - " + e.getMessage());
return null;
}
}
Posted by pramod narayana at 6:27 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest
Friday, September 12, 2014
Reduce Side Join in MapReduce.
Reduce Side Join
As the name specifies, join is done in the reducer. Join is done on 2 or more large datasets
Supports inner join, left outer join, right outer join, full join, anti-join.
High network bandwidth is required because huge data is sent to reduce phase
All the datasets are very large.
DataSetA(UserDataSet) contains User Details
DataSetB(CommentDataSet) contains Comments Details
Here simplified version of the dataset is used.
Here Join is done on userid.
2 Mapper class are defined
UserMapper and CommentMapper
UserMapper outputs userId and Name as key and value respectively from UserDataSet. Here 'A' is prefixed to the value
CommentMapper outputs userId and Comment as key and value respectively from CommentDataSet. Here 'B' is prefixed to the value
'A' & 'B' are prefixed so that Reducer knows, for a give key, which value came from what DataSet.
Reducer receives values from both DataSets for a given key.
Reducer builds listA for values from DataSetA and listB for values from DataSetA.
Using listA and listB, inner join, left outer join and right outer join are done
Posted by pramod narayana at 9:19 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest
Tuesday, September 9, 2014
Bloom Filter
Bloom filter is a datastructure used to check whether an object exists in a set. The result of the check operation could be definitive no , ie the object does not exist for sure or the object may exist in the set, ie there is a probability that object exits in the set.
There are 2 operations in Bloom filter.
Adding objects to the Bloom filter or Building Bloom filter
Checking objects in the Bloom filters.
Following are used in Bloom filter
m: Number of bits
n: Number of Objects that will be added to the set.
p: Desired false rate. (like 1 in 1000000 checks could be false)
k: Number of different hash functions
For both Adding and Checking operations, a bit array is used with all the bits initially set to zero.
Adding Objects involves hashing the Object multiple times. Modulo is done on each hash result against the bit array size(MODm). This will give k integers with values in between 0 and m-1. These integers are used as index to the bit array and the bit at that index is set to 1. Two or more objects may set the same bit.
Checking Objects involves hashing and modulo. Here all the bit values of the Object is compared in the bit array. If all the bits are set then the Object may exist. There could be a probability that these bits were set by some other objects.
Consider a Bloom filter bit array(size is 16, ie m = 16). Initially all bits are set to zero
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Consider 4 objects A, B, C and D and 2 hash functions.
Adding Object A and B to the Bloom filter: Since there are 2 hash functions, hashing on the object is done twice. A modulo of m is done on every hash result.
Const c1 = hash1(Object A);
int x = mod_m(c); // x is in between 0 and 15 i.e. mod16
Const c2 = hash2(Object A);
int y = mod_m(c); // y is in between 0 and 15 i.e. mod16
suppose x and y are 0 and 15 respectively
Bloom filter bit array after adding Object A
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
Bloom filter bit array after adding Object B (x = 2, y = 3)
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
Checking Objects B and D in the Bloom filter:
After doing hash and modulo lets suppose x = 0, y = 2 for C and x = 7 and y = 10 for D
For Object C result is true, but actually C does not exist. These bits were set by Object A and Object B. This is called false positive.
For Object D result is false. So it is sure that Object D does not exist
Bloom filter uses less storage. All the Objects need not be stored rather only few bits per Object
Once the Bit Array is created it cannot be modified. It can be recreated
Adding more Objects to Bloom filter will not increase the size of the bit array, but will set more bits to 1. Setting more bits to 1 will increase the false positive rate.
Bloom Filter is used in HBase to filter out HFile blocks that does not contain key.
Posted by pramod narayana at 2:46 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest
Monday, September 8, 2014
HADOOP FAQ
why block size is 64/128MB
NameNode holds filesystem(HDFS) metadata in memory. So the number of files in the HDFS is limited by the size of memory in the NameNode.
For files, directory and blocks around 150 bytes of metadata is required. This metadata includes filename, blockid, replication, location of each block, etc.
Suppose the block size is 4K. 1 GB file will result in 250,000 blocks. With replication of 3 it will be 750,000 blocks. For each block the metadata has to be stored in NameNode's memory. But with 128MB block size, there will be only 8 blocks and with replication it will be 24 blocks. So a smaller block size will consume more memory in NameNode than larger block size.
why millions of large files than billions of small files
NameNode holds filesystem(HDFS) metadata in memory. So the number of files in the HDFS is limited by the size of memory in the NameNode.
For files, directory and blocks around 150 bytes of metadata is required. This metadata includes filename, blockid, replication, location of each block, etc. Consider 1000 files and each file size to be 1 MB. Each file requires a single block of size 1MB. In total there are 1000 blocks and with replication of 3 it will be 3000 blocks. Now consider 1 file of size 1GB. It will take 8 blocks and 24 blocks after replication. So 1 large file's metadata takes less memory in NameNode than 1000 small files.
From MapReduce prospective, if there are lot of small files, the cost of starting
the worker processes can be disproportionally high compared to the data it is
processing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment