ashrithr/cdh5_mr2_installation.md

## cdh5_mr2_installation.md

      
    Raw
  

              cdh5_mr2_installation.md
            
          
    ###Pre-Req's:
Installing Cloudera Repo on all nodes that you'll install hadoop daemons on:
sudo cat > /etc/yum.repos.d/cloudera.repo <<EOF
[cloudera-cdh5]
# Packages for Cloudera's Distribution for Hadoop, Version 5, on RedHat	or CentOS 6 x86_64
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera    
gpgcheck = 1
EOF

Disable IPTables:
sudo service iptables stop
sudo chkconfig iptables off

Disable SELINUX:
sudo /usr/sbin/setenforce 0
sudo sed -i.old s/SELINUX=enforcing/SELINUX=disabled/ /etc/selinux/config

Set Hostname (replace [name_of_host] with your systems hostname):
sudo hostname [name_of_host]

Make sure the /etc/hosts file on each system contains the IP addresses and
fully-qualified domain names (FQDN) of all the members of the cluster. Also,
make sure the /etc/sysconfig/network on each system contains the hostname
you just set for that system
Validate the hostname settings:

Run uname -a and verify if the output matches the hostname command
Run /sbin/ifconfig and note the inet addr in the eth0 entry
Run host -v -t A $(hostname) and make sure that hostname matches the output
of the hostname command and has the same IP address as reported by
ifconfig.

Install JDK on all nodes (Recommended to install Oracle JDK):
wget https://dl.dropboxusercontent.com/u/5756075/jdk-6u45-linux-x64-rpm.bin
chmod +x jdk-6u45-linux-x64-rpm.bin
./jdk-6u45-linux-x64-rpm.bin

###Installing Packages
Installing daemons on the respective nodes:
On NameNode (node on where you are trying to install & manage it):
sudo yum clean all; sudo yum install hadoop-hdfs-namenode

On SecondaryNameNode (node on where you are trying to install & manage it):
sudo yum clean all; sudo yum install hadoop-hdfs-secondarynamenode

On Resource Manager (node on where you are trying to install & manage it)::
sudo yum clean all; sudo yum install hadoop-yarn-resourcemanager \
  hadoop-mapreduce-historyserver

On all worker nodes (nodes where you are willing to manage as workers):
sudo yum clean all; sudo yum install hadoop-yarn-nodemanager \
  hadoop-hdfs-datanode hadoop-mapreduce

On Gateway node (this node is optional but if configured this will act as a entry point to your cluster, recommended to install on a seperate node where no hadoop related daemons are running):
sudo yum clean all; sudo yum install hadoop-client

Configuring and Starting HDFS


NOTE: Update the configuration properties as required, replace the hostnames
in the sample configuration to the hostname's representing your environment

Export JAVA_HOME in /etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.6.0_45/

Configuring local storage directories:

dfs.namenode.name.dir specifies the URIs of the directories in local file
system where namenode stores its metadata and edit logs.
dfs.datanode.data.dir specifies the URIs of the directories where the
DataNode stores blocks. Its recommended to configure disks in JBOD configuration

Add the following properties to /etc/hadoop/conf/core-site.xml (on all machines, try using rsync or scp to keep these configurations in sync across cluster)
<property>
 <name>fs.defaultFS</name>
 <value>hdfs://namenode-host.company.com:8020</value>
</property>
<property>
 <name>hadoop.proxyuser.mapred.groups</name>
 <value>*</value>
</property>
<property>
 <name>hadoop.proxyuser.mapred.hosts</name>
 <value>*</value>
</property>

Add the following properties to /etc/hadoop/conf/hdfs-site.xml (on all nodes in the cluster)
<property>
 <name>dfs.permissions.superusergroup</name>
 <value>hadoop</value>
</property>
<property>
 <name>dfs.namenode.name.dir</name>
 <value>file:///data/1/dfs/nn</value>
</property>
<property>
 <name>dfs.datanode.data.dir</name>
 <value>file:///data/1/dfs/dn</value>
</property>
<property>
  <name>dfs.namenode.http-address</name>
  <value>namenode.host.address:50070</value>
  <description>
    The address and the base port on which the dfs NameNode Web UI will listen.
  </description>
</property>

Creating and configuring storage directories for use by HDFS & YARN:

NOTE: These paths should be consistent with the ones in the configuration files

sudo mkdir -p /data/1/{dfs,yarn}
sudo mkdir -p /data/1/dfs/{dn,nn}
sudo chown -R hdfs:hdfs /data/1/dfs
sudo mkdir -p /data/1/yarn/{local,logs}
sudo chown -R yarn:yarn /data/1/yarn

Format the namenode (Run this command from where you'll run namenode daemon):
sudo -u hdfs hdfs namenode -format

###Starting HDFS:

Starting Namenode (Run this command from where you are willing to run namenode service):

sudo service hadoop-hdfs-namenode start


Starting SecondaryNameNode (Run this command from where you are willing to run secondarynamenode service):

sudo service hadoop-hdfs-secondarynamenode start


Starting DataNode on all worker nodes:

sudo service hadoop-hdfs-datanode start

Configuring MapReduce2 on YARN Cluster

Add the following properties to /etc/hadoop/conf/mapred-site.xml (on all nodes in the cluster)
<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
</property>
<property>
  <name>mapreduce.jobhistory.address</name>
  <value>historyserver.cw.com:10020</value>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>historyserver.cw.com:19888</value>
</property>
<property>
  <name>yarn.app.mapreduce.am.staging-dir</name>
  <value>/user</value>
  <description>Staging directory for temporary files created by running job</description>
</property>

Add the following properties to yarn-site.xml (on all nodes in the cluster)
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>resourcemanager.company.com</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>resourcemanager.company.com:8031</value>
</property>
<property>
  <name>yarn.resourcemanager.address</name>
  <value>resourcemanager.company.com:8032</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address</name>
  <value>resourcemanager.company.com:8030</value>
</property>
<property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>resourcemanager.company.com:8033</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>resourcemanager.company.com:8088</value>
</property>
<property>
  <description>Classpath for typical applications.</description>
  <name>yarn.application.classpath</name>
  <value>
      $HADOOP_CONF_DIR,
      $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
      $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
      $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
      $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
  </value>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.nodemanager.local-dirs</name>
  <value>file:///data/1/yarn/local</value>
  <description>URIs of the dirs where NodeManagers stores its localized files</description>
</property>
<property>
  <name>yarn.nodemanager.log-dirs</name>
  <value>file:///data/1/yarn/logs</value>
  <description>URIs of the dirs where NodeManagers stores container log files</description>
</property>
<property>
  <name>yarn.log.aggregation.enable</name>
  <value>true</value> 
</property>
<property>
  <description>Where to aggregate logs</description>
  <name>yarn.nodemanager.remote-app-log-dir</name>
  <value>hdfs://<namenode-host.company.com>:8020/var/log/hadoop-yarn/apps</value>
  <description>URI of the dir where logs are aggregated</description>
</property>

Check if the following line is present in yarn-env.sh, if not present add the line (If changed sync this configuration files across all nodes in the cluster using rsync or scp):
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Creating directories required for YARN/MapReduce in HDFS (you can run this command from where HDFS related daemons are running or from gateway if you have configured one):
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir -p /user/history
sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
sudo -u hdfs hadoop fs -chown mapred:hadoop /user/history
sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

Starting Services:

Start Resource Manager (from node where you are willing to run resourcemanager service)

sudo service hadoop-yarn-resourcemanager start


Start NodeManager on all worker nodes

sudo service hadoop-yarn-nodemanager start


Start MapReduce Job History Server (from node where you are willing to run jobhistoryserver service)

sudo service hadoop-mapreduce-historyserver start

Finally create a home directory for each user who will access HDFS & MapReduce

NOTE: Replace  with your linux username who will try to access the HDFS filesystem to run mapreduce jobs

sudo -u hdfs hadoop fs -mkdir  /user/<username>
sudo -u hdfs hadoop fs -chown <user> /user/<username>