Created
August 23, 2017 02:21
-
-
Save anjijava16/105160015bd289b2c5590ed07dbf0555 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Wednesday, March 11, 2015 | |
HDFS Tutorial | |
Posted by pramod narayana at 6:27 AM No comments: | |
Email This | |
BlogThis! | |
Share to Twitter | |
Share to Facebook | |
Share to Pinterest | |
Monday, October 20, 2014 | |
Multi-Node(4) Hadoop Cluster on Ubuntu using Oracle VirtualBox. Cloudera CDH5 | |
This article is divided into 2 parts | |
Installation of Ubuntu on Oracle VirtualBox and Cloning | |
Clustering and Cloudera(CDH5) installation | |
Reference: http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/ | |
nameNode.cdh-cluster.com | |
192.168.1.100 | |
8GB RAM | |
40 GB Diskspace | |
dn-01.cdh-cluster.com | |
192.168.1.101 | |
2GB-4GB RAM | |
40 GB Diskspace | |
dn-02.cdh-cluster.com | |
192.168.1.102 | |
2GB-4GB RAM | |
40 GB Diskspace | |
dn-03.cdh-cluster.com | |
192.168.1.103 | |
2GB-4GB RAM | |
40 GB Diskspace | |
Hostnames: nameNode, dn-01, dn-02, dn-03 | |
domain name: cdh-cluster.com | |
host machine: windows 8 | |
guest machine/virtual machine: ubuntu-12.04.4-server | |
Download latest Oracle VirtualBox for Windows | |
https://www.virtualbox.org/wiki/Downloads | |
Install the VirtualBox | |
Download ubuntu-12.04.4-server-amd64.iso from ubuntu website | |
http://releases.ubuntu.com/precise/ubuntu-12.04.4-server-amd64.iso | |
Creation of Virtual Machine: | |
Install the oracle virutalbox if not already installed | |
After VirtualBox Installation | |
Click on "New" Button on left top corner | |
1. Name and Operating System | |
Name: nameNode | |
Tpye: Linux | |
Version: Ubuntu (64bit) | |
2. Memory Size: | |
Set memory to 8192 (8GB) | |
3. Hard drive: | |
choose Create a virtual hard drive now -- default | |
5. Hard drive file type | |
choose VDI(Virtual Disk Image) -- default | |
6. Storage on physical hard drive | |
choose Dynamically allocated -- default | |
7. File location and size | |
choose the location where you have enough memory. | |
set hard disk space to 40GB | |
Now Virtual Machine is created... | |
Ubuntu Installation on the VirtualMachine: | |
Let the newly created VM be highlighted. | |
1. Network Configuration | |
Click "Setting" button next the "New" button | |
Select Network and then Adapter 1 tab | |
check Enable Network Adapter (by default it will be checked and attched to NAT) | |
Attach to: Bridge Adapter | |
Name: choose you working internet adapter (Wifi or Ethernet) | |
3. Boot Configuration | |
Select Setting->>Storage | |
Click on Empty Disk icon | |
Click on the Disk icon on the right end corner | |
Chose ubuntu-12.04.4-server-amd64.iso from your local disk | |
Check Live CD/DVD check box | |
All the configuration are done for installation. | |
Now Click the "Start" button next to "Setting" button | |
Installation Begins... | |
Events during installation process | |
Keyboard Detection: press No | |
Hostname: give any name and press continue (hostname will be changed later) | |
Full Name for the new user: Give any name and press continue | |
Username for your account: hduser | |
Choose a password for the new user: <your password> | |
Re-enter password to verify: <your password> | |
Encrypt your home directory? No | |
Partitioning method: Guided - use entire disk | |
Select disk to partition: whatever default is chosen | |
Write the changes to disk? Yes | |
HTTP proxy information (blank for none): Keep it blank and press continue | |
How do you want to manage upgrades on this system? | |
Select Install security updates automatically | |
Choose software to install : select OpenSSH server | |
Install the GRUB boot loader to the master boot record? Yes | |
Installation complete: Press continue | |
System will reboot | |
Login with configured user during installation... | |
Your current working directory will be your home directory | |
Hostname, Domain and IPaddress configuration | |
1. Configure hostname | |
sudo vi /etc/hostname | |
overwrite content with nameNode | |
2. Configure IPaddress and Domain name | |
sudo vi /etc/network/interfaces and add the following lines | |
auto eth0 | |
iface eth0 inet static | |
address 192.168.1.100 | |
netmask 255.255.255.0 | |
network 192.168.1.0 | |
broadcast 192.168.1.255 | |
gateway 192.168.1.1 # This is my gateway ip i.e modem ip | |
dns-nameservers <dns ip provided by your serivce provider> | |
dns-domain cdh-cluster.com # You can give you own domain name | |
dns-search cdh-cluster.com | |
Restart network | |
sudo /etc/init.d/networking restart | |
Trying to ping google.com to check whether this VM is connected to network. This should work.. | |
Installing the Linux Guest Additions | |
sudo apt-get update | |
sudo apt-get install dkms | |
My virtual box version is 4.3.10. Download your version guest additions | |
wget http://download.virtualbox.org/virtualbox/4.3.10/VBoxGuestAdditions_4.3.10.iso | |
sudo mount -o loop VBoxGuestAdditions_4.3.10.iso /mnt/ | |
cd /mnt | |
sudo ./VBoxLinuxAdditions.run | |
3. For resolving fqdn to ipaddress | |
sudo vi /etc/hosts | |
add the following lines | |
192.168.1.100 nameNode.cdh-cluster.com nameNode | |
192.168.1.101 dn-01.cdh-cluster.com dn-01 | |
192.168.1.102 dn-02.cdh-cluster.com dn-02 | |
192.168.1.103 dn-03.cdh-cluster.com dn-03 | |
Comment the line contatining 127.0.1.1 | |
Setup SSH | |
sudo apt-get install openssh-clients; | |
sudo ssh-keygen (press enter and press enter twice) | |
cd ~/.ssh | |
cp id_rsa.pub authorized_keys | |
sudo vi /etc/ssh/ssh_config | |
StrictHostKeyChecking no | |
Shutdown the virtual machine | |
sudo init 0 | |
Cloning VirtualMachine | |
On the VirutalBox, right click on the virtualmachine and select clone | |
New Machine Name: Provide a name | |
Clone Type: choose linked clone | |
Do this 2 more times so that there are 4 virtual machine | |
Now change the ip address and hostname of these newly cloned virtual machines | |
/etc/hostname | |
dn-0[n] where n= 1, 2, 3 | |
/etc/network/interfaces | |
address 192.168.1.n #where n= 101, 102, 103 | |
restart network | |
sudo /etc/init.d/networking restart | |
Now the 4 virtual machines are ready for hadoop cluster. | |
Download cloudera manger on nameNode (You could use separate machine to install cloudera manager). I am doing it on nameNode | |
$> curl -O http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin | |
$> chmod +x cloudera-manager-installer.bin | |
$> ./cloudera-manager-installer.bin | |
After installation use the web browser on you hostmachine or any other machine on the name network and connect to http://namenode.cdh-cluster.com:7180 or http://<ip-address>:7180 | |
To continue the installation, you will have to select the Cloudera free license version. You will then ipaddress of the nodes that will be used in the cluster. Just enter all the nodes ipaddress separated by a space. Click on the “Search” button. You can then use the configured user(hduser) password (or the SSH keys you have generated) to automate the connectivty to the different nodes. | |
In ubuntu root privileges are required for any package installtion. | |
sudo vi /etc/sudoers and add the following line | |
hduser ALL=(ALL) NOPASSWD: ALL #Here hduser is user configured on my machine | |
Once this is done, you will select additional service components; just select everything by default. The installation will continue and will complete. | |
You can also customize and add service components one at a time. | |
By default | |
postgresql is used for hive metastore | |
derby db for oozie | |
Posted by pramod narayana at 10:17 AM No comments: | |
Email This | |
BlogThis! | |
Share to Twitter | |
Share to Facebook | |
Share to Pinterest | |
Friday, September 26, 2014 | |
Analyse StackOverflow Dataset using Pig with Filter UDF | |
--Problem: Given a list of user’s comments, filter out a majority of the comments that do not contain a particular keyword | |
--Refernce: MapReduce Design Pattern, Pig Programming, http://pig.apache.org/docs/r0.11.1/index.html | |
-- Input: StackOverflow comments dataset. small subset of the comments dataset is used. | |
register '/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar'; | |
register '/home/hduser/Documents/mytutorial/Pig/SOFilter.jar'; | |
socomments = LOAD 'hive/stackoverflow/comments/comments' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS (record:chararray); | |
-- tuples ared filtered using custom Filter UDF | |
filtered_comments = FILTER socomments by SOFilter(record); | |
STORE filtered_comments in 'hive/stackoverflow/comments/comments/output'; | |
Filter UDF | |
keywords are hardcoded here for simplicity. Keywords could be read from a file through distributed cache. | |
public class SOFilter extends FilterFunc{ | |
@Override | |
public Boolean exec(Tuple input) throws IOException { | |
try{ | |
String xmlRecord = (String)input.get(0); | |
if (null == xmlRecord ) { | |
return false; | |
} | |
Map<String, String> parsed = new HashMap<String, String>(); | |
String[] tokens = xmlRecord.trim().substring(5, xmlRecord.trim().length() - 3).split("\""); | |
for (int i = 0; i < tokens.length - 1; i += 2) { | |
String key = tokens[i].trim(); | |
String val = tokens[i + 1]; | |
parsed.put(key.substring(0, key.length() - 1), val); | |
} | |
String text = parsed.get("Text"); | |
if (null != text) { | |
text = text.toLowerCase(); | |
if (Pattern.compile("\\bhadoop\\b|\\bhbase\\b|\\bhive\\b|\\bmapreduce\\b|\\boozie\\b|\\bsqoop\\b|flume\\b").matcher(text).find()) | |
return true; | |
} | |
} catch(Exception e){ | |
System.err.println("Failed to process input; error - " + e.getMessage()); | |
} | |
return false; | |
} | |
} | |
Posted by pramod narayana at 7:29 AM No comments: | |
Email This | |
BlogThis! | |
Share to Twitter | |
Share to Facebook | |
Share to Pinterest | |
Friday, September 19, 2014 | |
Analyse StackOverflow Dataset using Pig with Eval UDF | |
-- Problem: Given a list of user’s comments, determine the first and last time a user commented and the total number of comments from that user | |
-- Refernce: MapReduce Design Pattern, Pig Programming, http://pig.apache.org/docs/r0.11.1/index.html | |
-- Input: StackOverflow comments dataset. small subset of the comments dataset is used. | |
register '/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar'; | |
register '/home/hduser/Documents/mytutorial/Pig/XMLMap.jar'; | |
-- Here comments dataset is loaded using XMLLoader. However, TextLoader can also be used as there is one xml element per line. | |
comments = LOAD 'hive/stackoverflow/comments/comments' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS (record:chararray); | |
-- Using a custom UDF parse the xml tuple and return Map. | |
commentsMap = FOREACH comments GENERATE XMLMap(record); | |
-- From the Map extract UserId, CreationDate, Convert CreationDate from String to datetime object. | |
A = foreach commentsMap generate $0#'UserId' as userId:chararray, ToDate($0#'CreationDate', 'yyyy-MM-dd\'T\'HH:mm:ss.SSS') as date; | |
-- Group the tuple by userId. All comments of a user is grouped together. | |
B = GROUP A BY userId; | |
--- For each group, count the number of comments, calculate min(first comment) creation date and max(last comment) creation date. | |
C = FOREACH B GENERATE group as userId, COUNT(A) as totalcomments, MIN(A.date) as first, MAX(A.date) as last; | |
STORE C into 'hive/stackoverflow/comments/output'; | |
//Custom UDF XMLMap. Input is the tuple containing the xmlelement. This function converts the tuple into a Map of key value and returns that Map. | |
public class XMLMap extends EvalFunc<Map<String, String>> { | |
@Override | |
public Map<String, String> exec(Tuple input) throws IOException { | |
try{ | |
String xmlRecord = (String)input.get(0); | |
Tuple output = TupleFactory.getInstance().newTuple(); | |
Map<String, String> map = new HashMap<String, String>(); | |
String[] tokens = xmlRecord.trim().substring(5, xmlRecord.trim().length() - 3).split("\""); | |
for (int i = 0; i < tokens.length - 1; i += 2) { | |
String key = tokens[i].trim(); | |
String val = tokens[i + 1]; | |
map.put(key.substring(0, key.length() - 1), val); | |
} | |
return map; | |
} catch(Exception e){ | |
System.err.println("Failed to process input; error - " + e.getMessage()); | |
return null; | |
} | |
} | |
Posted by pramod narayana at 6:27 AM No comments: | |
Email This | |
BlogThis! | |
Share to Twitter | |
Share to Facebook | |
Share to Pinterest | |
Friday, September 12, 2014 | |
Reduce Side Join in MapReduce. | |
Reduce Side Join | |
As the name specifies, join is done in the reducer. Join is done on 2 or more large datasets | |
Supports inner join, left outer join, right outer join, full join, anti-join. | |
High network bandwidth is required because huge data is sent to reduce phase | |
All the datasets are very large. | |
DataSetA(UserDataSet) contains User Details | |
DataSetB(CommentDataSet) contains Comments Details | |
Here simplified version of the dataset is used. | |
Here Join is done on userid. | |
2 Mapper class are defined | |
UserMapper and CommentMapper | |
UserMapper outputs userId and Name as key and value respectively from UserDataSet. Here 'A' is prefixed to the value | |
CommentMapper outputs userId and Comment as key and value respectively from CommentDataSet. Here 'B' is prefixed to the value | |
'A' & 'B' are prefixed so that Reducer knows, for a give key, which value came from what DataSet. | |
Reducer receives values from both DataSets for a given key. | |
Reducer builds listA for values from DataSetA and listB for values from DataSetA. | |
Using listA and listB, inner join, left outer join and right outer join are done | |
Posted by pramod narayana at 9:19 AM No comments: | |
Email This | |
BlogThis! | |
Share to Twitter | |
Share to Facebook | |
Share to Pinterest | |
Tuesday, September 9, 2014 | |
Bloom Filter | |
Bloom filter is a datastructure used to check whether an object exists in a set. The result of the check operation could be definitive no , ie the object does not exist for sure or the object may exist in the set, ie there is a probability that object exits in the set. | |
There are 2 operations in Bloom filter. | |
Adding objects to the Bloom filter or Building Bloom filter | |
Checking objects in the Bloom filters. | |
Following are used in Bloom filter | |
m: Number of bits | |
n: Number of Objects that will be added to the set. | |
p: Desired false rate. (like 1 in 1000000 checks could be false) | |
k: Number of different hash functions | |
For both Adding and Checking operations, a bit array is used with all the bits initially set to zero. | |
Adding Objects involves hashing the Object multiple times. Modulo is done on each hash result against the bit array size(MODm). This will give k integers with values in between 0 and m-1. These integers are used as index to the bit array and the bit at that index is set to 1. Two or more objects may set the same bit. | |
Checking Objects involves hashing and modulo. Here all the bit values of the Object is compared in the bit array. If all the bits are set then the Object may exist. There could be a probability that these bits were set by some other objects. | |
Consider a Bloom filter bit array(size is 16, ie m = 16). Initially all bits are set to zero | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
Consider 4 objects A, B, C and D and 2 hash functions. | |
Adding Object A and B to the Bloom filter: Since there are 2 hash functions, hashing on the object is done twice. A modulo of m is done on every hash result. | |
Const c1 = hash1(Object A); | |
int x = mod_m(c); // x is in between 0 and 15 i.e. mod16 | |
Const c2 = hash2(Object A); | |
int y = mod_m(c); // y is in between 0 and 15 i.e. mod16 | |
suppose x and y are 0 and 15 respectively | |
Bloom filter bit array after adding Object A | |
1 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
1 | |
Bloom filter bit array after adding Object B (x = 2, y = 3) | |
1 | |
0 | |
1 | |
1 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
0 | |
1 | |
Checking Objects B and D in the Bloom filter: | |
After doing hash and modulo lets suppose x = 0, y = 2 for C and x = 7 and y = 10 for D | |
For Object C result is true, but actually C does not exist. These bits were set by Object A and Object B. This is called false positive. | |
For Object D result is false. So it is sure that Object D does not exist | |
Bloom filter uses less storage. All the Objects need not be stored rather only few bits per Object | |
Once the Bit Array is created it cannot be modified. It can be recreated | |
Adding more Objects to Bloom filter will not increase the size of the bit array, but will set more bits to 1. Setting more bits to 1 will increase the false positive rate. | |
Bloom Filter is used in HBase to filter out HFile blocks that does not contain key. | |
Posted by pramod narayana at 2:46 AM No comments: | |
Email This | |
BlogThis! | |
Share to Twitter | |
Share to Facebook | |
Share to Pinterest | |
Monday, September 8, 2014 | |
HADOOP FAQ | |
why block size is 64/128MB | |
NameNode holds filesystem(HDFS) metadata in memory. So the number of files in the HDFS is limited by the size of memory in the NameNode. | |
For files, directory and blocks around 150 bytes of metadata is required. This metadata includes filename, blockid, replication, location of each block, etc. | |
Suppose the block size is 4K. 1 GB file will result in 250,000 blocks. With replication of 3 it will be 750,000 blocks. For each block the metadata has to be stored in NameNode's memory. But with 128MB block size, there will be only 8 blocks and with replication it will be 24 blocks. So a smaller block size will consume more memory in NameNode than larger block size. | |
why millions of large files than billions of small files | |
NameNode holds filesystem(HDFS) metadata in memory. So the number of files in the HDFS is limited by the size of memory in the NameNode. | |
For files, directory and blocks around 150 bytes of metadata is required. This metadata includes filename, blockid, replication, location of each block, etc. Consider 1000 files and each file size to be 1 MB. Each file requires a single block of size 1MB. In total there are 1000 blocks and with replication of 3 it will be 3000 blocks. Now consider 1 file of size 1GB. It will take 8 blocks and 24 blocks after replication. So 1 large file's metadata takes less memory in NameNode than 1000 small files. | |
From MapReduce prospective, if there are lot of small files, the cost of starting | |
the worker processes can be disproportionally high compared to the data it is | |
processing |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment