###Pre-Req's:
Installing Cloudera Repo on all nodes that you'll install hadoop daemons on:
sudo cat > /etc/yum.repos.d/cloudera.repo <<EOF
[cloudera-cdh5]
# Packages for Cloudera's Distribution for Hadoop, Version 5, on RedHat or CentOS 6 x86_64
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
EOF
Disable IPTables:
sudo service iptables stop
sudo chkconfig iptables off
Disable SELINUX:
sudo /usr/sbin/setenforce 0
sudo sed -i.old s/SELINUX=enforcing/SELINUX=disabled/ /etc/selinux/config
Set Hostname (replace [name_of_host] with your systems hostname):
sudo hostname [name_of_host]
Make sure the /etc/hosts
file on each system contains the IP addresses and
fully-qualified domain names (FQDN) of all the members of the cluster. Also,
make sure the /etc/sysconfig/network
on each system contains the hostname
you just set for that system
Validate the hostname settings:
- Run
uname -a
and verify if the output matches thehostname
command - Run
/sbin/ifconfig
and note the inet addr in the eth0 entry - Run
host -v -t A $(hostname)
and make sure that hostname matches the output of thehostname
command and has the same IP address as reported byifconfig
.
Install JDK on all nodes (Recommended to install Oracle JDK):
wget https://dl.dropboxusercontent.com/u/5756075/jdk-6u45-linux-x64-rpm.bin
chmod +x jdk-6u45-linux-x64-rpm.bin
./jdk-6u45-linux-x64-rpm.bin
###Installing Packages
Installing daemons on the respective nodes:
On NameNode (node on where you are trying to install & manage it):
sudo yum clean all; sudo yum install hadoop-hdfs-namenode
On SecondaryNameNode (node on where you are trying to install & manage it):
sudo yum clean all; sudo yum install hadoop-hdfs-secondarynamenode
On Resource Manager (node on where you are trying to install & manage it)::
sudo yum clean all; sudo yum install hadoop-yarn-resourcemanager \
hadoop-mapreduce-historyserver
On all worker nodes (nodes where you are willing to manage as workers):
sudo yum clean all; sudo yum install hadoop-yarn-nodemanager \
hadoop-hdfs-datanode hadoop-mapreduce
On Gateway node (this node is optional but if configured this will act as a entry point to your cluster, recommended to install on a seperate node where no hadoop related daemons are running):
sudo yum clean all; sudo yum install hadoop-client
NOTE: Update the configuration properties as required, replace the hostnames in the sample configuration to the hostname's representing your environment
Export JAVA_HOME
in /etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.6.0_45/
Configuring local storage directories:
dfs.namenode.name.dir
specifies the URIs of the directories in local file system where namenode stores its metadata and edit logs.dfs.datanode.data.dir
specifies the URIs of the directories where the DataNode stores blocks. Its recommended to configure disks in JBOD configuration
Add the following properties to /etc/hadoop/conf/core-site.xml
(on all machines, try using rsync or scp to keep these configurations in sync across cluster)
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode-host.company.com:8020</value>
</property>
<property>
<name>hadoop.proxyuser.mapred.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.mapred.hosts</name>
<value>*</value>
</property>
Add the following properties to /etc/hadoop/conf/hdfs-site.xml
(on all nodes in the cluster)
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/1/dfs/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/1/dfs/dn</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>namenode.host.address:50070</value>
<description>
The address and the base port on which the dfs NameNode Web UI will listen.
</description>
</property>
Creating and configuring storage directories for use by HDFS & YARN:
NOTE: These paths should be consistent with the ones in the configuration files
sudo mkdir -p /data/1/{dfs,yarn}
sudo mkdir -p /data/1/dfs/{dn,nn}
sudo chown -R hdfs:hdfs /data/1/dfs
sudo mkdir -p /data/1/yarn/{local,logs}
sudo chown -R yarn:yarn /data/1/yarn
Format the namenode (Run this command from where you'll run namenode daemon):
sudo -u hdfs hdfs namenode -format
###Starting HDFS:
- Starting Namenode (Run this command from where you are willing to run namenode service):
sudo service hadoop-hdfs-namenode start
- Starting SecondaryNameNode (Run this command from where you are willing to run secondarynamenode service):
sudo service hadoop-hdfs-secondarynamenode start
- Starting DataNode on all worker nodes:
sudo service hadoop-hdfs-datanode start
Add the following properties to /etc/hadoop/conf/mapred-site.xml
(on all nodes in the cluster)
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>historyserver.cw.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>historyserver.cw.com:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
<description>Staging directory for temporary files created by running job</description>
</property>
Add the following properties to yarn-site.xml (on all nodes in the cluster)
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemanager.company.com</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>resourcemanager.company.com:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanager.company.com:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>resourcemanager.company.com:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>resourcemanager.company.com:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>resourcemanager.company.com:8088</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///data/1/yarn/local</value>
<description>URIs of the dirs where NodeManagers stores its localized files</description>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///data/1/yarn/logs</value>
<description>URIs of the dirs where NodeManagers stores container log files</description>
</property>
<property>
<name>yarn.log.aggregation.enable</name>
<value>true</value>
</property>
<property>
<description>Where to aggregate logs</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://<namenode-host.company.com>:8020/var/log/hadoop-yarn/apps</value>
<description>URI of the dir where logs are aggregated</description>
</property>
Check if the following line is present in yarn-env.sh, if not present add the line (If changed sync this configuration files across all nodes in the cluster using rsync or scp):
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
Creating directories required for YARN/MapReduce in HDFS (you can run this command from where HDFS related daemons are running or from gateway if you have configured one):
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir -p /user/history
sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
sudo -u hdfs hadoop fs -chown mapred:hadoop /user/history
sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
Starting Services:
- Start Resource Manager (from node where you are willing to run resourcemanager service)
sudo service hadoop-yarn-resourcemanager start
- Start NodeManager on all worker nodes
sudo service hadoop-yarn-nodemanager start
- Start MapReduce Job History Server (from node where you are willing to run jobhistoryserver service)
sudo service hadoop-mapreduce-historyserver start
Finally create a home directory for each user who will access HDFS & MapReduce
NOTE: Replace with your linux username who will try to access the HDFS filesystem to run mapreduce jobs
sudo -u hdfs hadoop fs -mkdir /user/<username>
sudo -u hdfs hadoop fs -chown <user> /user/<username>