theamith/Running Hadoop on Ubuntu Linux(Single Node Cluster)

## Running Hadoop on Ubuntu Linux(Single Node Cluster)
1. Prerequisites

Java
User for hadoop
SSH
IPv6

1.1 Setup and Check Java  :

# Install Sun Java JDK
$sudo apt-get install openjdk-7-jdk

# full JDK will be placed in /usr/lib/jvm/java-7-openjdk-i386
# After installation check whether java is installed correctly
user@ubuntu:~# java -version

1.2. Add and Check  a dedicated Hadoop System User :

# Add a group named hadoop
$ sudo addgroup hadoop

# Add user hduser in group hadoop
$ sudo adduser --ingroup hadoop hduser

# switch to newly added user
$su - hduser

1.3. Generate Secure Shell or SSH :

# Hadoop requires SSH access to manage its nodes
# For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.
# If SSH is not up and not configured to allow SSH public key authentication then follow steps in http://ubuntuguide.org/wiki/Ubuntu_Quantal_Remote_Access#Remote_Access

# switch user to hduser
user@ubuntu:~$ su - hduser

# generate ssh key for hduser with empty password
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
[...snipp...]
hduser@ubuntu:~$

#change directory to ssh
$cd ~/.ssh

#enable SSH access to your local machine with this newly created key.
hduser@ubuntu:~$ cat id_rsa.pub >> authorized_keys

$sudo apt-get install openssh-server

# During this step if there occurs an exception saying “hduser is not in sudoers file. This incident will be reported.” Do the following:

$ cd /etc
$sudo gedit sudoers

In that file add the following line below “# User privilege specification”
hduser ALL=(ALL) ALL
save the file and exit.

# The final step is to test the SSH setup by connecting to your local machine with the hduser user.
# The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file
# If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).
hduser@ubuntu:~$ ssh localhost

The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
hduser@ubuntu:~$

# If the SSH connect should fail, these general tips might help:
# Enable debugging with ssh -vvv localhost and investigate the error in detail.
Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hduser user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.

1.4. Disabling IPV6:

#You can also disable IPv6 only for Hadoop by adding the following line to conf/hadoop-env.sh
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Or

#disable IPV6 for hadoop user
#open sysctl.conf file in pico editor
$sudo pico /etc/sysctl.conf

#change the following contents
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

#reboot the system
$sudo reboot

#check whether IPV6 is diabled
$cat /proc/sys/net/ipv6/conf/all/disable_ipv6
0-enabled; 1-disabled

2.Hadoop


Installation
Cfg
Start & Stop
Run MapReduce

2.1. Installation:

# Download Hadoop from the Apache Download Mirrors
# change to local directory
$ cd /usr/local

# unzip the hadoop tar file to /usr/local
$ sudo tar xzf hadoop-1.0.3.tar.gz

# move unzipped file to folder named hadoop
$ sudo mv hadoop-1.0.3 hadoop

# change the ownership to hduser
$ sudo chown -R hduser:hadoop hadoop

2.2.1. Configure ~/.bashrc :

# switch to hduser
$su -hduser

# open bashrc in pico editor
$pico ~/.bashrc

#Add to the end of the ~/.bashrc

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
   hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin


2.2.2. Setting environment variable for JAVA_HOME :

# find the path where java is installed
$whereis java
$ls -l /usr/bin/java
$ls -l /etc/alternatives/java

# open haddop-env.sh file to set envt var for JAVA_HOME
$pico /usr/local/hadoop/conf/hadoop-env.sh
Change
#export JAVA_HOME=/usr/lib/j2sdk1.5-sun
To
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

2.2.2. Create dir Set ownership  and  permission :


# make directory and change permission to hduser
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp

# ...and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /app/hadoop/tmp

2.2.4. Configuration- conf/*-site.xml :

# Add the following snippets between the <configuration> ... </configuration> tags in the conf/core-site.xml file.

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

# In file conf/mapred-site.xml:

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

# In file conf/hdfs-site.xml:

<property>
<name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

2.2.5. Formatting HDFS via Name Node :

# To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format

The output will look like this:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
hduser@ubuntu:/usr/local/hadoop$

2.3.1. Starting Single Node Cluster :

# To start Namenode, Datanode, Jobtracker and a Tasktracker on your machine runs this command

hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

The output will look like this:

hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
hduser@ubuntu:/usr/local/hadoop$

2.3.2. Stopping Single Node Cluster :

hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh

Output:

hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
hduser@ubuntu:/usr/local/hadoop$

2.4 Running MapReduce Job :


$./hadoop dfs -put /path-of-os-filesystem/test /test
$./hadoop dfs -ls /test
$..hadoop jar ../hadoop-examples-1.0.1.jar wordcount /test/out
$./hadoop dfs -lsr /test/out
	1. Prerequisites

	Java
	User for hadoop
	SSH
	IPv6

	1.1 Setup and Check Java :

	# Install Sun Java JDK
	$sudo apt-get install openjdk-7-jdk

	# full JDK will be placed in /usr/lib/jvm/java-7-openjdk-i386
	# After installation check whether java is installed correctly
	user@ubuntu:~# java -version

	1.2. Add and Check a dedicated Hadoop System User :

	# Add a group named hadoop
	$ sudo addgroup hadoop

	# Add user hduser in group hadoop
	$ sudo adduser --ingroup hadoop hduser

	# switch to newly added user
	$su - hduser

	1.3. Generate Secure Shell or SSH :

	# Hadoop requires SSH access to manage its nodes
	# For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.
	# If SSH is not up and not configured to allow SSH public key authentication then follow steps in http://ubuntuguide.org/wiki/Ubuntu_Quantal_Remote_Access#Remote_Access

	# switch user to hduser
	user@ubuntu:~$ su - hduser

	# generate ssh key for hduser with empty password
	hduser@ubuntu:~$ ssh-keygen -t rsa -P ""

	Generating public/private rsa key pair.
	Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
	Created directory '/home/hduser/.ssh'.
	Your identification has been saved in /home/hduser/.ssh/id_rsa.
	Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
	The key fingerprint is:
	9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
	The key's randomart image is:
	[...snipp...]
	hduser@ubuntu:~$

	#change directory to ssh
	$cd ~/.ssh

	#enable SSH access to your local machine with this newly created key.
	hduser@ubuntu:~$ cat id_rsa.pub >> authorized_keys

	$sudo apt-get install openssh-server

	# During this step if there occurs an exception saying “hduser is not in sudoers file. This incident will be reported.” Do the following:

	$ cd /etc
	$sudo gedit sudoers

	In that file add the following line below “# User privilege specification”
	hduser ALL=(ALL) ALL
	save the file and exit.

	# The final step is to test the SSH setup by connecting to your local machine with the hduser user.
	# The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file
	# If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).
	hduser@ubuntu:~$ ssh localhost

	The authenticity of host 'localhost (::1)' can't be established.
	RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
	Are you sure you want to continue connecting (yes/no)? yes

	Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
	Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
	Ubuntu 10.04 LTS
	[...snipp...]
	hduser@ubuntu:~$

	# If the SSH connect should fail, these general tips might help:
	# Enable debugging with ssh -vvv localhost and investigate the error in detail.
	Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hduser user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.

	1.4. Disabling IPV6:

	#You can also disable IPv6 only for Hadoop by adding the following line to conf/hadoop-env.sh
	export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

	Or

	#disable IPV6 for hadoop user
	#open sysctl.conf file in pico editor
	$sudo pico /etc/sysctl.conf

	#change the following contents
	net.ipv6.conf.all.disable_ipv6 = 1
	net.ipv6.conf.default.disable_ipv6 = 1
	net.ipv6.conf.lo.disable_ipv6 = 1

	#reboot the system
	$sudo reboot

	#check whether IPV6 is diabled
	$cat /proc/sys/net/ipv6/conf/all/disable_ipv6
	0-enabled; 1-disabled

	2.Hadoop


	Installation
	Cfg
	Start & Stop
	Run MapReduce

	2.1. Installation:

	# Download Hadoop from the Apache Download Mirrors
	# change to local directory
	$ cd /usr/local

	# unzip the hadoop tar file to /usr/local
	$ sudo tar xzf hadoop-1.0.3.tar.gz

	# move unzipped file to folder named hadoop
	$ sudo mv hadoop-1.0.3 hadoop

	# change the ownership to hduser
	$ sudo chown -R hduser:hadoop hadoop

	2.2.1. Configure ~/.bashrc :

	# switch to hduser
	$su -hduser

	# open bashrc in pico editor
	$pico ~/.bashrc

	#Add to the end of the ~/.bashrc

	# Set Hadoop-related environment variables
	export HADOOP_HOME=/usr/local/hadoop
	# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
	export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
	# Some convenient aliases and functions for running Hadoop-related commands
	unalias fs &> /dev/null
	alias fs="hadoop fs"
	unalias hls &> /dev/null
	alias hls="fs -ls"
	# If you have LZO compression enabled in your Hadoop cluster and
	# compress job outputs with LZOP (not covered in this tutorial):
	# Conveniently inspect an LZOP compressed file from the command
	# line; run via:
	#
	# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
	#
	# Requires installed 'lzop' command.
	#
	lzohead () {
	hadoop fs -cat $1 \| lzop -dc \| head -1000 \| less
	}
	# Add Hadoop bin/ directory to PATH
	export PATH=$PATH:$HADOOP_HOME/bin


	2.2.2. Setting environment variable for JAVA_HOME :

	# find the path where java is installed
	$whereis java
	$ls -l /usr/bin/java
	$ls -l /etc/alternatives/java

	# open haddop-env.sh file to set envt var for JAVA_HOME
	$pico /usr/local/hadoop/conf/hadoop-env.sh
	Change
	#export JAVA_HOME=/usr/lib/j2sdk1.5-sun
	To
	export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

	2.2.2. Create dir Set ownership and permission :


	# make directory and change permission to hduser
	$ sudo mkdir -p /app/hadoop/tmp
	$ sudo chown hduser:hadoop /app/hadoop/tmp

	# ...and if you want to tighten up security, chmod from 755 to 750...
	$ sudo chmod 750 /app/hadoop/tmp

	2.2.4. Configuration- conf/*-site.xml :

	# Add the following snippets between the <configuration> ... </configuration> tags in the conf/core-site.xml file.

	<property>
	<name>hadoop.tmp.dir</name>
	<value>/app/hadoop/tmp</value>
	<description>A base for other temporary directories.</description>
	</property>

	<property>
	<name>fs.default.name</name>
	<value>hdfs://localhost:54310</value>
	<description>The name of the default file system. A URI whose
	scheme and authority determine the FileSystem implementation. The
	uri's scheme determines the config property (fs.SCHEME.impl) naming
	the FileSystem implementation class. The uri's authority is used to
	determine the host, port, etc. for a filesystem.</description>
	</property>

	# In file conf/mapred-site.xml:

	<property>
	<name>mapred.job.tracker</name>
	<value>localhost:54311</value>
	<description>The host and port that the MapReduce job tracker runs
	at. If "local", then jobs are run in-process as a single map
	and reduce task.
	</description>
	</property>

	# In file conf/hdfs-site.xml:

	<property>
	<name>dfs.replication</name>
	<value>1</value>
	<description>Default block replication.
	The actual number of replications can be specified when the file is created.
	The default is used if replication is not specified in create time.
	</description>
	</property>

	2.2.5. Formatting HDFS via Name Node :

	# To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

	hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format

	The output will look like this:

	hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format
	10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
	/************************************************************
	STARTUP_MSG: Starting NameNode
	STARTUP_MSG: host = ubuntu/127.0.1.1
	STARTUP_MSG: args = [-format]
	STARTUP_MSG: version = 0.20.2
	STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
	************************************************************/
	10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
	10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
	10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
	10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
	10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
	10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
	/************************************************************
	SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
	************************************************************/
	hduser@ubuntu:/usr/local/hadoop$

	2.3.1. Starting Single Node Cluster :

	# To start Namenode, Datanode, Jobtracker and a Tasktracker on your machine runs this command

	hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

	The output will look like this:

	hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh
	starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
	localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
	localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
	starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
	localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
	hduser@ubuntu:/usr/local/hadoop$

	2.3.2. Stopping Single Node Cluster :

	hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh

	Output:

	hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh
	stopping jobtracker
	localhost: stopping tasktracker
	stopping namenode
	localhost: stopping datanode
	localhost: stopping secondarynamenode
	hduser@ubuntu:/usr/local/hadoop$

	2.4 Running MapReduce Job :


	$./hadoop dfs -put /path-of-os-filesystem/test /test
	$./hadoop dfs -ls /test
	$..hadoop jar ../hadoop-examples-1.0.1.jar wordcount /test/out
	$./hadoop dfs -lsr /test/out