Last active
December 14, 2015 07:18
-
-
Save theamith/5049011 to your computer and use it in GitHub Desktop.
Running Hadoop on Ubuntu Linux(Single Node Cluster):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Prerequisites | |
Java | |
User for hadoop | |
SSH | |
IPv6 | |
1.1 Setup and Check Java : | |
# Install Sun Java JDK | |
$sudo apt-get install openjdk-7-jdk | |
# full JDK will be placed in /usr/lib/jvm/java-7-openjdk-i386 | |
# After installation check whether java is installed correctly | |
user@ubuntu:~# java -version | |
1.2. Add and Check a dedicated Hadoop System User : | |
# Add a group named hadoop | |
$ sudo addgroup hadoop | |
# Add user hduser in group hadoop | |
$ sudo adduser --ingroup hadoop hduser | |
# switch to newly added user | |
$su - hduser | |
1.3. Generate Secure Shell or SSH : | |
# Hadoop requires SSH access to manage its nodes | |
# For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section. | |
# If SSH is not up and not configured to allow SSH public key authentication then follow steps in http://ubuntuguide.org/wiki/Ubuntu_Quantal_Remote_Access#Remote_Access | |
# switch user to hduser | |
user@ubuntu:~$ su - hduser | |
# generate ssh key for hduser with empty password | |
hduser@ubuntu:~$ ssh-keygen -t rsa -P "" | |
Generating public/private rsa key pair. | |
Enter file in which to save the key (/home/hduser/.ssh/id_rsa): | |
Created directory '/home/hduser/.ssh'. | |
Your identification has been saved in /home/hduser/.ssh/id_rsa. | |
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. | |
The key fingerprint is: | |
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu | |
The key's randomart image is: | |
[...snipp...] | |
hduser@ubuntu:~$ | |
#change directory to ssh | |
$cd ~/.ssh | |
#enable SSH access to your local machine with this newly created key. | |
hduser@ubuntu:~$ cat id_rsa.pub >> authorized_keys | |
$sudo apt-get install openssh-server | |
# During this step if there occurs an exception saying “hduser is not in sudoers file. This incident will be reported.” Do the following: | |
$ cd /etc | |
$sudo gedit sudoers | |
In that file add the following line below “# User privilege specification” | |
hduser ALL=(ALL) ALL | |
save the file and exit. | |
# The final step is to test the SSH setup by connecting to your local machine with the hduser user. | |
# The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file | |
# If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information). | |
hduser@ubuntu:~$ ssh localhost | |
The authenticity of host 'localhost (::1)' can't be established. | |
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. | |
Are you sure you want to continue connecting (yes/no)? yes | |
Warning: Permanently added 'localhost' (RSA) to the list of known hosts. | |
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux | |
Ubuntu 10.04 LTS | |
[...snipp...] | |
hduser@ubuntu:~$ | |
# If the SSH connect should fail, these general tips might help: | |
# Enable debugging with ssh -vvv localhost and investigate the error in detail. | |
Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hduser user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload. | |
1.4. Disabling IPV6: | |
#You can also disable IPv6 only for Hadoop by adding the following line to conf/hadoop-env.sh | |
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true | |
Or | |
#disable IPV6 for hadoop user | |
#open sysctl.conf file in pico editor | |
$sudo pico /etc/sysctl.conf | |
#change the following contents | |
net.ipv6.conf.all.disable_ipv6 = 1 | |
net.ipv6.conf.default.disable_ipv6 = 1 | |
net.ipv6.conf.lo.disable_ipv6 = 1 | |
#reboot the system | |
$sudo reboot | |
#check whether IPV6 is diabled | |
$cat /proc/sys/net/ipv6/conf/all/disable_ipv6 | |
0-enabled; 1-disabled | |
2.Hadoop | |
Installation | |
Cfg | |
Start & Stop | |
Run MapReduce | |
2.1. Installation: | |
# Download Hadoop from the Apache Download Mirrors | |
# change to local directory | |
$ cd /usr/local | |
# unzip the hadoop tar file to /usr/local | |
$ sudo tar xzf hadoop-1.0.3.tar.gz | |
# move unzipped file to folder named hadoop | |
$ sudo mv hadoop-1.0.3 hadoop | |
# change the ownership to hduser | |
$ sudo chown -R hduser:hadoop hadoop | |
2.2.1. Configure ~/.bashrc : | |
# switch to hduser | |
$su -hduser | |
# open bashrc in pico editor | |
$pico ~/.bashrc | |
#Add to the end of the ~/.bashrc | |
# Set Hadoop-related environment variables | |
export HADOOP_HOME=/usr/local/hadoop | |
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) | |
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386 | |
# Some convenient aliases and functions for running Hadoop-related commands | |
unalias fs &> /dev/null | |
alias fs="hadoop fs" | |
unalias hls &> /dev/null | |
alias hls="fs -ls" | |
# If you have LZO compression enabled in your Hadoop cluster and | |
# compress job outputs with LZOP (not covered in this tutorial): | |
# Conveniently inspect an LZOP compressed file from the command | |
# line; run via: | |
# | |
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo | |
# | |
# Requires installed 'lzop' command. | |
# | |
lzohead () { | |
hadoop fs -cat $1 | lzop -dc | head -1000 | less | |
} | |
# Add Hadoop bin/ directory to PATH | |
export PATH=$PATH:$HADOOP_HOME/bin | |
2.2.2. Setting environment variable for JAVA_HOME : | |
# find the path where java is installed | |
$whereis java | |
$ls -l /usr/bin/java | |
$ls -l /etc/alternatives/java | |
# open haddop-env.sh file to set envt var for JAVA_HOME | |
$pico /usr/local/hadoop/conf/hadoop-env.sh | |
Change | |
#export JAVA_HOME=/usr/lib/j2sdk1.5-sun | |
To | |
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386 | |
2.2.2. Create dir Set ownership and permission : | |
# make directory and change permission to hduser | |
$ sudo mkdir -p /app/hadoop/tmp | |
$ sudo chown hduser:hadoop /app/hadoop/tmp | |
# ...and if you want to tighten up security, chmod from 755 to 750... | |
$ sudo chmod 750 /app/hadoop/tmp | |
2.2.4. Configuration- conf/*-site.xml : | |
# Add the following snippets between the <configuration> ... </configuration> tags in the conf/core-site.xml file. | |
<property> | |
<name>hadoop.tmp.dir</name> | |
<value>/app/hadoop/tmp</value> | |
<description>A base for other temporary directories.</description> | |
</property> | |
<property> | |
<name>fs.default.name</name> | |
<value>hdfs://localhost:54310</value> | |
<description>The name of the default file system. A URI whose | |
scheme and authority determine the FileSystem implementation. The | |
uri's scheme determines the config property (fs.SCHEME.impl) naming | |
the FileSystem implementation class. The uri's authority is used to | |
determine the host, port, etc. for a filesystem.</description> | |
</property> | |
# In file conf/mapred-site.xml: | |
<property> | |
<name>mapred.job.tracker</name> | |
<value>localhost:54311</value> | |
<description>The host and port that the MapReduce job tracker runs | |
at. If "local", then jobs are run in-process as a single map | |
and reduce task. | |
</description> | |
</property> | |
# In file conf/hdfs-site.xml: | |
<property> | |
<name>dfs.replication</name> | |
<value>1</value> | |
<description>Default block replication. | |
The actual number of replications can be specified when the file is created. | |
The default is used if replication is not specified in create time. | |
</description> | |
</property> | |
2.2.5. Formatting HDFS via Name Node : | |
# To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command | |
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format | |
The output will look like this: | |
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format | |
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG: | |
/************************************************************ | |
STARTUP_MSG: Starting NameNode | |
STARTUP_MSG: host = ubuntu/127.0.1.1 | |
STARTUP_MSG: args = [-format] | |
STARTUP_MSG: version = 0.20.2 | |
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 | |
************************************************************/ | |
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop | |
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup | |
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true | |
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds. | |
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted. | |
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG: | |
/************************************************************ | |
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 | |
************************************************************/ | |
hduser@ubuntu:/usr/local/hadoop$ | |
2.3.1. Starting Single Node Cluster : | |
# To start Namenode, Datanode, Jobtracker and a Tasktracker on your machine runs this command | |
hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh | |
The output will look like this: | |
hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh | |
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out | |
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out | |
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out | |
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out | |
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out | |
hduser@ubuntu:/usr/local/hadoop$ | |
2.3.2. Stopping Single Node Cluster : | |
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh | |
Output: | |
hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh | |
stopping jobtracker | |
localhost: stopping tasktracker | |
stopping namenode | |
localhost: stopping datanode | |
localhost: stopping secondarynamenode | |
hduser@ubuntu:/usr/local/hadoop$ | |
2.4 Running MapReduce Job : | |
$./hadoop dfs -put /path-of-os-filesystem/test /test | |
$./hadoop dfs -ls /test | |
$..hadoop jar ../hadoop-examples-1.0.1.jar wordcount /test/out | |
$./hadoop dfs -lsr /test/out |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment