anjijava16/WhyBlockSize128MB.txt

## WhyBlockSize128MB.txt

Wednesday, March 11, 2015
HDFS Tutorial

Posted by pramod narayana at 6:27 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest

Monday, October 20, 2014
Multi-Node(4) Hadoop Cluster on Ubuntu using Oracle VirtualBox. Cloudera CDH5


This article is divided into 2 parts
Installation of Ubuntu on Oracle VirtualBox and Cloning
Clustering and Cloudera(CDH5) installation
Reference: http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/


nameNode.cdh-cluster.com
192.168.1.100
8GB RAM
40 GB Diskspace
dn-01.cdh-cluster.com
192.168.1.101
2GB-4GB RAM
40 GB Diskspace
dn-02.cdh-cluster.com
192.168.1.102
2GB-4GB RAM
40 GB Diskspace
dn-03.cdh-cluster.com
192.168.1.103
2GB-4GB RAM
40 GB Diskspace


Hostnames: nameNode, dn-01, dn-02, dn-03
domain name: cdh-cluster.com
host machine: windows 8
guest machine/virtual machine: ubuntu-12.04.4-server


Download latest Oracle VirtualBox for Windows
https://www.virtualbox.org/wiki/Downloads
Install the VirtualBox


Download ubuntu-12.04.4-server-amd64.iso from ubuntu website
http://releases.ubuntu.com/precise/ubuntu-12.04.4-server-amd64.iso


Creation of Virtual Machine:
Install the oracle virutalbox if not already installed
After VirtualBox Installation
Click on "New" Button on left top corner
1. Name and Operating System
Name: nameNode
Tpye: Linux
Version: Ubuntu (64bit)


2. Memory Size:
Set memory to 8192 (8GB)

3. Hard drive:
choose Create a virtual hard drive now -- default


5. Hard drive file type
choose VDI(Virtual Disk Image) -- default


6. Storage on physical hard drive
choose Dynamically allocated -- default


7. File location and size
choose the location where you have enough memory.
set hard disk space to 40GB


Now Virtual Machine is created...


Ubuntu Installation on the VirtualMachine:


Let the newly created VM be highlighted.
1. Network Configuration
Click "Setting" button next the "New" button
Select Network and then Adapter 1 tab
check Enable Network Adapter (by default it will be checked and attched to NAT)
Attach to: Bridge Adapter
Name: choose you working internet adapter (Wifi or Ethernet)


3. Boot Configuration
Select Setting->>Storage
Click on Empty Disk icon
Click on the Disk icon on the right end corner
Chose ubuntu-12.04.4-server-amd64.iso from your local disk
Check Live CD/DVD check box


All the configuration are done for installation.
Now Click the "Start" button next to "Setting" button
Installation Begins...
Events during installation process
Keyboard Detection: press No
Hostname: give any name and press continue (hostname will be changed later)
Full Name for the new user: Give any name and press continue
Username for your account: hduser
Choose a password for the new user: <your password>
Re-enter password to verify: <your password>
Encrypt your home directory? No
Partitioning method: Guided - use entire disk
Select disk to partition: whatever default is chosen
Write the changes to disk? Yes
HTTP proxy information (blank for none): Keep it blank and press continue
How do you want to manage upgrades on this system?
Select Install security updates automatically
Choose software to install : select OpenSSH server
Install the GRUB boot loader to the master boot record? Yes
Installation complete: Press continue


System will reboot
Login with configured user during installation...
Your current working directory will be your home directory
Hostname, Domain and IPaddress configuration


1. Configure hostname
sudo vi /etc/hostname
overwrite content with nameNode


2. Configure IPaddress and Domain name
sudo vi /etc/network/interfaces and add the following lines
auto eth0
iface eth0 inet static
address 192.168.1.100
netmask 255.255.255.0
network 192.168.1.0
broadcast 192.168.1.255
gateway 192.168.1.1 # This is my gateway ip i.e modem ip
dns-nameservers <dns ip provided by your serivce provider>
dns-domain cdh-cluster.com # You can give you own domain name
dns-search cdh-cluster.com


Restart network
sudo /etc/init.d/networking restart
Trying to ping google.com to check whether this VM is connected to network. This should work..


Installing the Linux Guest Additions

sudo apt-get update
sudo apt-get install dkms
My virtual box version is 4.3.10. Download your version guest additions
wget http://download.virtualbox.org/virtualbox/4.3.10/VBoxGuestAdditions_4.3.10.iso
sudo mount -o loop VBoxGuestAdditions_4.3.10.iso /mnt/
cd /mnt
sudo ./VBoxLinuxAdditions.run


3. For resolving fqdn to ipaddress
sudo vi /etc/hosts
add the following lines
192.168.1.100 nameNode.cdh-cluster.com nameNode
192.168.1.101 dn-01.cdh-cluster.com dn-01
192.168.1.102 dn-02.cdh-cluster.com dn-02
192.168.1.103 dn-03.cdh-cluster.com dn-03
Comment the line contatining 127.0.1.1


Setup SSH
sudo apt-get install openssh-clients;
sudo ssh-keygen (press enter and press enter twice)
cd ~/.ssh
cp id_rsa.pub authorized_keys
sudo vi /etc/ssh/ssh_config
StrictHostKeyChecking no
Shutdown the virtual machine
sudo init 0


Cloning VirtualMachine


On the VirutalBox, right click on the virtualmachine and select clone
New Machine Name: Provide a name
Clone Type: choose linked clone
Do this 2 more times so that there are 4 virtual machine
Now change the ip address and hostname of these newly cloned virtual machines
/etc/hostname
dn-0[n] where n= 1, 2, 3
/etc/network/interfaces
address 192.168.1.n #where n= 101, 102, 103
restart network
sudo /etc/init.d/networking restart


Now the 4 virtual machines are ready for hadoop cluster.


Download cloudera manger on nameNode (You could use separate machine to install cloudera manager). I am doing it on nameNode


$> curl -O http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
$> chmod +x cloudera-manager-installer.bin
$> ./cloudera-manager-installer.bin


After installation use the web browser on you hostmachine or any other machine on the name network and connect to http://namenode.cdh-cluster.com:7180 or http://<ip-address>:7180


To continue the installation, you will have to select the Cloudera free license version. You will then ipaddress of the nodes that will be used in the cluster. Just enter all the nodes ipaddress separated by a space. Click on the “Search” button. You can then use the configured user(hduser) password (or the SSH keys you have generated) to automate the connectivty to the different nodes.
In ubuntu root privileges are required for any package installtion.
sudo vi /etc/sudoers and add the following line
hduser ALL=(ALL) NOPASSWD: ALL #Here hduser is user configured on my machine


Once this is done, you will select additional service components; just select everything by default. The installation will continue and will complete.
You can also customize and add service components one at a time.


By default
postgresql is used for hive metastore
derby db for oozie
Posted by pramod narayana at 10:17 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest

Friday, September 26, 2014
Analyse StackOverflow Dataset using Pig with Filter UDF
--Problem: Given a list of user’s comments, filter out a majority of the comments that do not contain a particular keyword
--Refernce: MapReduce Design Pattern, Pig Programming, http://pig.apache.org/docs/r0.11.1/index.html
-- Input: StackOverflow comments dataset. small subset of the comments dataset is used.

register '/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar';
register '/home/hduser/Documents/mytutorial/Pig/SOFilter.jar';
socomments = LOAD 'hive/stackoverflow/comments/comments' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS (record:chararray);
-- tuples ared filtered using custom Filter UDF
filtered_comments = FILTER socomments by SOFilter(record);
STORE filtered_comments in 'hive/stackoverflow/comments/comments/output';


Filter UDF
keywords are hardcoded here for simplicity.  Keywords could be read from a file through distributed cache.
public class SOFilter extends FilterFunc{

    @Override
    public Boolean exec(Tuple input) throws IOException {
        try{
            String xmlRecord = (String)input.get(0);
            if (null == xmlRecord ) {
                return false;
            }

            Map<String, String> parsed = new HashMap<String, String>();
            String[] tokens = xmlRecord.trim().substring(5, xmlRecord.trim().length() - 3).split("\"");

            for (int i = 0; i < tokens.length - 1; i += 2) {
                String key = tokens[i].trim();
                String val = tokens[i + 1];
                parsed.put(key.substring(0, key.length() - 1), val);
            }
            String text =  parsed.get("Text");
            if (null != text) {
                text = text.toLowerCase();
                if (Pattern.compile("\\bhadoop\\b|\\bhbase\\b|\\bhive\\b|\\bmapreduce\\b|\\boozie\\b|\\bsqoop\\b|flume\\b").matcher(text).find())
                    return true;
            }

        } catch(Exception e){
            System.err.println("Failed to process input; error - " + e.getMessage());

        }
        return false;
    }

}
Posted by pramod narayana at 7:29 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest

Friday, September 19, 2014
Analyse StackOverflow Dataset using Pig with Eval UDF
-- Problem: Given a list of user’s comments, determine the first and last time a user commented and the total number of comments from that user
-- Refernce: MapReduce Design Pattern, Pig Programming, http://pig.apache.org/docs/r0.11.1/index.html
-- Input: StackOverflow comments dataset. small subset of the comments dataset is used.

register '/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar';
register '/home/hduser/Documents/mytutorial/Pig/XMLMap.jar';


-- Here comments dataset is loaded using XMLLoader. However, TextLoader can also be used as there is one xml element per line.
comments = LOAD 'hive/stackoverflow/comments/comments' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS (record:chararray);
-- Using a custom UDF parse the xml tuple and return Map.
commentsMap = FOREACH comments GENERATE XMLMap(record);
-- From the Map extract UserId, CreationDate, Convert CreationDate from String to datetime object.
A = foreach commentsMap generate $0#'UserId' as userId:chararray, ToDate($0#'CreationDate', 'yyyy-MM-dd\'T\'HH:mm:ss.SSS') as date;
-- Group the tuple by userId. All comments of a user is grouped together.
B = GROUP A BY userId;
--- For each group, count the number of comments, calculate min(first comment) creation date and max(last comment) creation date.
C = FOREACH B GENERATE group as userId, COUNT(A) as totalcomments, MIN(A.date) as first, MAX(A.date) as last;

STORE C into 'hive/stackoverflow/comments/output';


//Custom UDF XMLMap. Input is the tuple containing the xmlelement. This function converts the tuple into a Map of key value and returns that Map.

public class XMLMap extends EvalFunc<Map<String, String>> {

    @Override
    public Map<String, String> exec(Tuple input) throws IOException {
        try{
            String xmlRecord = (String)input.get(0);
            Tuple output = TupleFactory.getInstance().newTuple();
            Map<String, String> map = new HashMap<String, String>();
            String[] tokens = xmlRecord.trim().substring(5, xmlRecord.trim().length() - 3).split("\"");

            for (int i = 0; i < tokens.length - 1; i += 2) {
                String key = tokens[i].trim();
                String val = tokens[i + 1];
                map.put(key.substring(0, key.length() - 1), val);
            }
            return map;
            } catch(Exception e){
                System.err.println("Failed to process input; error - " + e.getMessage());
                return null;
        }
}
Posted by pramod narayana at 6:27 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest

Friday, September 12, 2014
Reduce Side Join in MapReduce.
Reduce Side Join
 As the name specifies, join is done in the reducer. Join is done on 2 or more large datasets
 Supports inner join, left outer join, right outer join, full join, anti-join.
 High network bandwidth is required because huge data is sent to reduce phase
 All the datasets are very large.


DataSetA(UserDataSet) contains User Details
DataSetB(CommentDataSet) contains Comments Details
Here simplified version of the dataset is used.


Here Join is done on userid.

2 Mapper class are defined
UserMapper and CommentMapper
UserMapper outputs userId and Name as key and value respectively from UserDataSet. Here 'A' is prefixed to the value
CommentMapper outputs userId and Comment as key and value respectively from CommentDataSet. Here 'B' is prefixed to the value

'A' & 'B' are prefixed so that Reducer knows, for a give key,  which value came from what DataSet.

Reducer receives values from both DataSets for a given key.
Reducer builds listA for values from DataSetA and listB for values from DataSetA.

Using listA and listB, inner join, left outer join and right outer join are done


Posted by pramod narayana at 9:19 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest

Tuesday, September 9, 2014
Bloom Filter

Bloom filter is a datastructure used to check whether an object exists in a set. The result of the check operation could be definitive no , ie the object does not exist for sure or the object may exist in the set, ie there is a probability that object exits in the set.

There are 2 operations in Bloom filter.
Adding objects to the Bloom filter or Building Bloom filter
Checking objects in the Bloom filters.


Following are used in Bloom filter
m: Number of bits
n: Number of Objects that will be added to the set.
p: Desired false rate. (like 1 in 1000000 checks could be false)
k: Number of different hash functions


For both Adding and Checking operations, a bit array is used with all the bits initially set to zero.

Adding Objects involves hashing the Object multiple times. Modulo is done on each hash result against the bit array size(MODm). This will give k integers with values in between 0 and m-1. These integers are used as index to the bit array and the bit at that index is set to 1. Two or more objects may set the same bit.

Checking Objects involves hashing and modulo. Here all the bit values of the Object is compared in the bit array. If all the bits are set then the Object may exist. There could be a probability that these bits were set by some other objects.

Consider a Bloom filter bit array(size is 16, ie m = 16). Initially all bits are set to zero
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


Consider 4 objects A, B, C and D and 2 hash functions.

Adding Object A and B to the Bloom filter: Since there are 2 hash functions, hashing on the object is done twice. A modulo of m is done on every hash result.
Const c1 = hash1(Object A);
int x = mod_m(c); // x is in between 0 and 15 i.e. mod16

Const c2 = hash2(Object A);
int y = mod_m(c); // y is in between 0 and 15 i.e. mod16

suppose x and y are 0 and 15 respectively

Bloom filter bit array after adding Object A
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1

Bloom filter bit array after adding Object B (x = 2, y = 3)
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1


Checking Objects B and D in the Bloom filter:
After doing hash and modulo lets suppose x = 0, y = 2 for C and x = 7 and y = 10 for D
For Object C result is true, but actually C does not exist. These bits were set by Object A and Object B. This is called false positive.
For Object D result is false. So it is sure that Object D does not exist

Bloom filter uses less storage. All the Objects need not be stored rather only few bits per Object
Once the Bit Array is created it cannot be modified. It can be recreated
Adding more Objects to Bloom filter will not increase the size of the bit array, but will set more bits to 1. Setting more bits to 1 will increase the false positive rate.


Bloom Filter is used in HBase to filter out HFile blocks that does not contain key.
Posted by pramod narayana at 2:46 AM No comments:
Email This
BlogThis!
Share to Twitter
Share to Facebook
Share to Pinterest

Monday, September 8, 2014
HADOOP FAQ
why block size is 64/128MB
NameNode holds filesystem(HDFS) metadata in memory. So the number of files in the HDFS is limited by the size of memory in the NameNode.
For files, directory and blocks around 150 bytes of metadata is required. This metadata includes filename, blockid, replication, location of each block, etc.
Suppose the block size is 4K. 1 GB file will result in 250,000 blocks. With replication of 3 it will be 750,000 blocks. For each block the metadata has to be stored in NameNode's memory. But with 128MB block size, there will be only 8 blocks and with replication it will be 24 blocks. So a smaller block size will consume more memory in NameNode than larger block size.


why millions of large files than billions of small files
NameNode holds filesystem(HDFS) metadata in memory. So the number of files in the HDFS is limited by the size of memory in the NameNode.
For files, directory and blocks around 150 bytes of metadata is required. This metadata includes filename, blockid, replication, location of each block, etc. Consider 1000 files and each file size to be 1 MB. Each file requires a single block of size 1MB. In total there are 1000 blocks and with replication of 3 it will be 3000 blocks. Now consider 1 file of size 1GB. It will take 8 blocks and 24 blocks after replication. So 1 large file's metadata takes less memory in NameNode than 1000 small files.

From MapReduce prospective, if there are lot of small files, the cost of starting
the worker processes can be disproportionally high compared to the data it is
processing

	Wednesday, March 11, 2015
	HDFS Tutorial

	Posted by pramod narayana at 6:27 AM No comments:
	Email This
	BlogThis!
	Share to Twitter
	Share to Facebook
	Share to Pinterest

	Monday, October 20, 2014
	Multi-Node(4) Hadoop Cluster on Ubuntu using Oracle VirtualBox. Cloudera CDH5


	This article is divided into 2 parts
	Installation of Ubuntu on Oracle VirtualBox and Cloning
	Clustering and Cloudera(CDH5) installation
	Reference: http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/


	nameNode.cdh-cluster.com
	192.168.1.100
	8GB RAM
	40 GB Diskspace
	dn-01.cdh-cluster.com
	192.168.1.101
	2GB-4GB RAM
	40 GB Diskspace
	dn-02.cdh-cluster.com
	192.168.1.102
	2GB-4GB RAM
	40 GB Diskspace
	dn-03.cdh-cluster.com
	192.168.1.103
	2GB-4GB RAM
	40 GB Diskspace


	Hostnames: nameNode, dn-01, dn-02, dn-03
	domain name: cdh-cluster.com
	host machine: windows 8
	guest machine/virtual machine: ubuntu-12.04.4-server


	Download latest Oracle VirtualBox for Windows
	https://www.virtualbox.org/wiki/Downloads
	Install the VirtualBox



	Download ubuntu-12.04.4-server-amd64.iso from ubuntu website
	http://releases.ubuntu.com/precise/ubuntu-12.04.4-server-amd64.iso






	Creation of Virtual Machine:
	Install the oracle virutalbox if not already installed
	After VirtualBox Installation
	Click on "New" Button on left top corner
	1. Name and Operating System
	Name: nameNode
	Tpye: Linux
	Version: Ubuntu (64bit)



	2. Memory Size:
	Set memory to 8192 (8GB)

	3. Hard drive:
	choose Create a virtual hard drive now -- default



	5. Hard drive file type
	choose VDI(Virtual Disk Image) -- default



	6. Storage on physical hard drive
	choose Dynamically allocated -- default



	7. File location and size
	choose the location where you have enough memory.
	set hard disk space to 40GB



	Now Virtual Machine is created...









	Ubuntu Installation on the VirtualMachine:


	Let the newly created VM be highlighted.
	1. Network Configuration
	Click "Setting" button next the "New" button
	Select Network and then Adapter 1 tab
	check Enable Network Adapter (by default it will be checked and attched to NAT)
	Attach to: Bridge Adapter
	Name: choose you working internet adapter (Wifi or Ethernet)


	3. Boot Configuration
	Select Setting->>Storage
	Click on Empty Disk icon
	Click on the Disk icon on the right end corner
	Chose ubuntu-12.04.4-server-amd64.iso from your local disk
	Check Live CD/DVD check box


	All the configuration are done for installation.
	Now Click the "Start" button next to "Setting" button
	Installation Begins...
	Events during installation process
	Keyboard Detection: press No
	Hostname: give any name and press continue (hostname will be changed later)
	Full Name for the new user: Give any name and press continue
	Username for your account: hduser
	Choose a password for the new user: <your password>
	Re-enter password to verify: <your password>
	Encrypt your home directory? No
	Partitioning method: Guided - use entire disk
	Select disk to partition: whatever default is chosen
	Write the changes to disk? Yes
	HTTP proxy information (blank for none): Keep it blank and press continue
	How do you want to manage upgrades on this system?
	Select Install security updates automatically
	Choose software to install : select OpenSSH server
	Install the GRUB boot loader to the master boot record? Yes
	Installation complete: Press continue


	System will reboot
	Login with configured user during installation...
	Your current working directory will be your home directory
	Hostname, Domain and IPaddress configuration


	1. Configure hostname
	sudo vi /etc/hostname
	overwrite content with nameNode


	2. Configure IPaddress and Domain name
	sudo vi /etc/network/interfaces and add the following lines
	auto eth0
	iface eth0 inet static
	address 192.168.1.100
	netmask 255.255.255.0
	network 192.168.1.0
	broadcast 192.168.1.255
	gateway 192.168.1.1 # This is my gateway ip i.e modem ip
	dns-nameservers <dns ip provided by your serivce provider>
	dns-domain cdh-cluster.com # You can give you own domain name
	dns-search cdh-cluster.com


	Restart network
	sudo /etc/init.d/networking restart
	Trying to ping google.com to check whether this VM is connected to network. This should work..


	Installing the Linux Guest Additions

	sudo apt-get update
	sudo apt-get install dkms
	My virtual box version is 4.3.10. Download your version guest additions
	wget http://download.virtualbox.org/virtualbox/4.3.10/VBoxGuestAdditions_4.3.10.iso
	sudo mount -o loop VBoxGuestAdditions_4.3.10.iso /mnt/
	cd /mnt
	sudo ./VBoxLinuxAdditions.run


	3. For resolving fqdn to ipaddress
	sudo vi /etc/hosts
	add the following lines
	192.168.1.100 nameNode.cdh-cluster.com nameNode
	192.168.1.101 dn-01.cdh-cluster.com dn-01
	192.168.1.102 dn-02.cdh-cluster.com dn-02
	192.168.1.103 dn-03.cdh-cluster.com dn-03
	Comment the line contatining 127.0.1.1






	Setup SSH
	sudo apt-get install openssh-clients;
	sudo ssh-keygen (press enter and press enter twice)
	cd ~/.ssh
	cp id_rsa.pub authorized_keys
	sudo vi /etc/ssh/ssh_config
	StrictHostKeyChecking no
	Shutdown the virtual machine
	sudo init 0



	Cloning VirtualMachine



	On the VirutalBox, right click on the virtualmachine and select clone
	New Machine Name: Provide a name
	Clone Type: choose linked clone
	Do this 2 more times so that there are 4 virtual machine
	Now change the ip address and hostname of these newly cloned virtual machines
	/etc/hostname
	dn-0[n] where n= 1, 2, 3
	/etc/network/interfaces
	address 192.168.1.n #where n= 101, 102, 103
	restart network
	sudo /etc/init.d/networking restart




	Now the 4 virtual machines are ready for hadoop cluster.


	Download cloudera manger on nameNode (You could use separate machine to install cloudera manager). I am doing it on nameNode


	$> curl -O http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
	$> chmod +x cloudera-manager-installer.bin
	$> ./cloudera-manager-installer.bin


	After installation use the web browser on you hostmachine or any other machine on the name network and connect to http://namenode.cdh-cluster.com:7180 or http://<ip-address>:7180


	To continue the installation, you will have to select the Cloudera free license version. You will then ipaddress of the nodes that will be used in the cluster. Just enter all the nodes ipaddress separated by a space. Click on the “Search” button. You can then use the configured user(hduser) password (or the SSH keys you have generated) to automate the connectivty to the different nodes.
	In ubuntu root privileges are required for any package installtion.
	sudo vi /etc/sudoers and add the following line
	hduser ALL=(ALL) NOPASSWD: ALL #Here hduser is user configured on my machine



	Once this is done, you will select additional service components; just select everything by default. The installation will continue and will complete.
	You can also customize and add service components one at a time.


	By default
	postgresql is used for hive metastore
	derby db for oozie
	Posted by pramod narayana at 10:17 AM No comments:
	Email This
	BlogThis!
	Share to Twitter
	Share to Facebook
	Share to Pinterest

	Friday, September 26, 2014
	Analyse StackOverflow Dataset using Pig with Filter UDF
	--Problem: Given a list of user’s comments, filter out a majority of the comments that do not contain a particular keyword
	--Refernce: MapReduce Design Pattern, Pig Programming, http://pig.apache.org/docs/r0.11.1/index.html
	-- Input: StackOverflow comments dataset. small subset of the comments dataset is used.

	register '/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar';
	register '/home/hduser/Documents/mytutorial/Pig/SOFilter.jar';
	socomments = LOAD 'hive/stackoverflow/comments/comments' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS (record:chararray);
	-- tuples ared filtered using custom Filter UDF
	filtered_comments = FILTER socomments by SOFilter(record);
	STORE filtered_comments in 'hive/stackoverflow/comments/comments/output';


	Filter UDF
	keywords are hardcoded here for simplicity. Keywords could be read from a file through distributed cache.
	public class SOFilter extends FilterFunc{

	@Override
	public Boolean exec(Tuple input) throws IOException {
	try{
	String xmlRecord = (String)input.get(0);
	if (null == xmlRecord ) {
	return false;
	}

	Map<String, String> parsed = new HashMap<String, String>();
	String[] tokens = xmlRecord.trim().substring(5, xmlRecord.trim().length() - 3).split("\"");

	for (int i = 0; i < tokens.length - 1; i += 2) {
	String key = tokens[i].trim();
	String val = tokens[i + 1];
	parsed.put(key.substring(0, key.length() - 1), val);
	}
	String text = parsed.get("Text");
	if (null != text) {
	text = text.toLowerCase();
	if (Pattern.compile("\\bhadoop\\b\|\\bhbase\\b\|\\bhive\\b\|\\bmapreduce\\b\|\\boozie\\b\|\\bsqoop\\b\|flume\\b").matcher(text).find())
	return true;
	}

	} catch(Exception e){
	System.err.println("Failed to process input; error - " + e.getMessage());

	}
	return false;
	}

	}
	Posted by pramod narayana at 7:29 AM No comments:
	Email This
	BlogThis!
	Share to Twitter
	Share to Facebook
	Share to Pinterest

	Friday, September 19, 2014
	Analyse StackOverflow Dataset using Pig with Eval UDF
	-- Problem: Given a list of user’s comments, determine the first and last time a user commented and the total number of comments from that user
	-- Refernce: MapReduce Design Pattern, Pig Programming, http://pig.apache.org/docs/r0.11.1/index.html
	-- Input: StackOverflow comments dataset. small subset of the comments dataset is used.

	register '/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar';
	register '/home/hduser/Documents/mytutorial/Pig/XMLMap.jar';


	-- Here comments dataset is loaded using XMLLoader. However, TextLoader can also be used as there is one xml element per line.
	comments = LOAD 'hive/stackoverflow/comments/comments' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS (record:chararray);
	-- Using a custom UDF parse the xml tuple and return Map.
	commentsMap = FOREACH comments GENERATE XMLMap(record);
	-- From the Map extract UserId, CreationDate, Convert CreationDate from String to datetime object.
	A = foreach commentsMap generate $0#'UserId' as userId:chararray, ToDate($0#'CreationDate', 'yyyy-MM-dd\'T\'HH:mm:ss.SSS') as date;
	-- Group the tuple by userId. All comments of a user is grouped together.
	B = GROUP A BY userId;
	--- For each group, count the number of comments, calculate min(first comment) creation date and max(last comment) creation date.
	C = FOREACH B GENERATE group as userId, COUNT(A) as totalcomments, MIN(A.date) as first, MAX(A.date) as last;

	STORE C into 'hive/stackoverflow/comments/output';


	//Custom UDF XMLMap. Input is the tuple containing the xmlelement. This function converts the tuple into a Map of key value and returns that Map.

	public class XMLMap extends EvalFunc<Map<String, String>> {

	@Override
	public Map<String, String> exec(Tuple input) throws IOException {
	try{
	String xmlRecord = (String)input.get(0);
	Tuple output = TupleFactory.getInstance().newTuple();
	Map<String, String> map = new HashMap<String, String>();
	String[] tokens = xmlRecord.trim().substring(5, xmlRecord.trim().length() - 3).split("\"");

	for (int i = 0; i < tokens.length - 1; i += 2) {
	String key = tokens[i].trim();
	String val = tokens[i + 1];
	map.put(key.substring(0, key.length() - 1), val);
	}
	return map;
	} catch(Exception e){
	System.err.println("Failed to process input; error - " + e.getMessage());
	return null;
	}
	}
	Posted by pramod narayana at 6:27 AM No comments:
	Email This
	BlogThis!
	Share to Twitter
	Share to Facebook
	Share to Pinterest

	Friday, September 12, 2014
	Reduce Side Join in MapReduce.
	Reduce Side Join
	As the name specifies, join is done in the reducer. Join is done on 2 or more large datasets
	Supports inner join, left outer join, right outer join, full join, anti-join.
	High network bandwidth is required because huge data is sent to reduce phase
	All the datasets are very large.




	DataSetA(UserDataSet) contains User Details
	DataSetB(CommentDataSet) contains Comments Details
	Here simplified version of the dataset is used.


	Here Join is done on userid.

	2 Mapper class are defined
	UserMapper and CommentMapper
	UserMapper outputs userId and Name as key and value respectively from UserDataSet. Here 'A' is prefixed to the value
	CommentMapper outputs userId and Comment as key and value respectively from CommentDataSet. Here 'B' is prefixed to the value

	'A' & 'B' are prefixed so that Reducer knows, for a give key, which value came from what DataSet.

	Reducer receives values from both DataSets for a given key.
	Reducer builds listA for values from DataSetA and listB for values from DataSetA.

	Using listA and listB, inner join, left outer join and right outer join are done





	Posted by pramod narayana at 9:19 AM No comments:
	Email This
	BlogThis!
	Share to Twitter
	Share to Facebook
	Share to Pinterest

	Tuesday, September 9, 2014
	Bloom Filter

	Bloom filter is a datastructure used to check whether an object exists in a set. The result of the check operation could be definitive no , ie the object does not exist for sure or the object may exist in the set, ie there is a probability that object exits in the set.

	There are 2 operations in Bloom filter.
	Adding objects to the Bloom filter or Building Bloom filter
	Checking objects in the Bloom filters.


	Following are used in Bloom filter
	m: Number of bits
	n: Number of Objects that will be added to the set.
	p: Desired false rate. (like 1 in 1000000 checks could be false)
	k: Number of different hash functions


	For both Adding and Checking operations, a bit array is used with all the bits initially set to zero.

	Adding Objects involves hashing the Object multiple times. Modulo is done on each hash result against the bit array size(MODm). This will give k integers with values in between 0 and m-1. These integers are used as index to the bit array and the bit at that index is set to 1. Two or more objects may set the same bit.

	Checking Objects involves hashing and modulo. Here all the bit values of the Object is compared in the bit array. If all the bits are set then the Object may exist. There could be a probability that these bits were set by some other objects.

	Consider a Bloom filter bit array(size is 16, ie m = 16). Initially all bits are set to zero
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0


	Consider 4 objects A, B, C and D and 2 hash functions.

	Adding Object A and B to the Bloom filter: Since there are 2 hash functions, hashing on the object is done twice. A modulo of m is done on every hash result.
	Const c1 = hash1(Object A);
	int x = mod_m(c); // x is in between 0 and 15 i.e. mod16

	Const c2 = hash2(Object A);
	int y = mod_m(c); // y is in between 0 and 15 i.e. mod16

	suppose x and y are 0 and 15 respectively

	Bloom filter bit array after adding Object A
	1
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	1

	Bloom filter bit array after adding Object B (x = 2, y = 3)
	1
	0
	1
	1
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	0
	1



	Checking Objects B and D in the Bloom filter:
	After doing hash and modulo lets suppose x = 0, y = 2 for C and x = 7 and y = 10 for D
	For Object C result is true, but actually C does not exist. These bits were set by Object A and Object B. This is called false positive.
	For Object D result is false. So it is sure that Object D does not exist

	Bloom filter uses less storage. All the Objects need not be stored rather only few bits per Object
	Once the Bit Array is created it cannot be modified. It can be recreated
	Adding more Objects to Bloom filter will not increase the size of the bit array, but will set more bits to 1. Setting more bits to 1 will increase the false positive rate.


	Bloom Filter is used in HBase to filter out HFile blocks that does not contain key.
	Posted by pramod narayana at 2:46 AM No comments:
	Email This
	BlogThis!
	Share to Twitter
	Share to Facebook
	Share to Pinterest

	Monday, September 8, 2014
	HADOOP FAQ
	why block size is 64/128MB
	NameNode holds filesystem(HDFS) metadata in memory. So the number of files in the HDFS is limited by the size of memory in the NameNode.
	For files, directory and blocks around 150 bytes of metadata is required. This metadata includes filename, blockid, replication, location of each block, etc.
	Suppose the block size is 4K. 1 GB file will result in 250,000 blocks. With replication of 3 it will be 750,000 blocks. For each block the metadata has to be stored in NameNode's memory. But with 128MB block size, there will be only 8 blocks and with replication it will be 24 blocks. So a smaller block size will consume more memory in NameNode than larger block size.



	why millions of large files than billions of small files
	NameNode holds filesystem(HDFS) metadata in memory. So the number of files in the HDFS is limited by the size of memory in the NameNode.
	For files, directory and blocks around 150 bytes of metadata is required. This metadata includes filename, blockid, replication, location of each block, etc. Consider 1000 files and each file size to be 1 MB. Each file requires a single block of size 1MB. In total there are 1000 blocks and with replication of 3 it will be 3000 blocks. Now consider 1 file of size 1GB. It will take 8 blocks and 24 blocks after replication. So 1 large file's metadata takes less memory in NameNode than 1000 small files.

	From MapReduce prospective, if there are lot of small files, the cost of starting
	the worker processes can be disproportionally high compared to the data it is
	processing