wuweiweiwu/setup.md

## setup.md

      
    Raw
  

              setup.md
            
          
    Setting up hadoop, yarn, and giraph

Table of Contents

Hadoop

Single node
Multiple node
Running
Errors

Giraph

Setup
Running

hadoop

download the distribution at: http://hadoop.apache.org/releases.html
and unzip to a location of your choosing
Setting up single node cluster:

follow instructions at https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/SingleCluster.html for whatever version you're running
Setting up distributed cluster:

(I used 2.4.0 because it was compatible with giraph)
Setting up environment variables:

in ~/.bashrc file add
export HADOOP_HOME=/project/cluster15/hadoop/hadoop-2.4.0 #this path is where you unzipped your file downloaded from apache
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

then do source ~/.bashrc
Setting up hadoop configuration files:

do cd $HADOOP_CONF_DIR
in core-site.xml add:
<property>
	<name>fs.defaultFS</name>
	<value>hdfs://jupiter:9000</value>
</property>

This is the location where your namenode and secondary namenode is going to be. replace jupiter with whatever computer you are running from
in hdfs-site.xml add:
<property>
	<name>dfs.replication</name>
 	<value>1</value>
</property>

This is how many times you are going to replicate your data in hdfs. 1 = no backup, 2 = 1 backup, etc....
in mapred-site.xml add: (you might have to create it)
<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
</property>

<property>
	<name>mapreduce.tasktracker.map.tasks.maximum</name>
	<value>4</value>
</property>

<property>
	<name>mapreduce.shuffle.port</name>
	<value>13564</value>
</property>

The first one specifies the framework name. The second one specifies the max number of tasks. The third one specifies which port mapreduce runs on.
in yarn-site.xml add:
<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>

<property>
	<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
	<name>yarn.resourcemanager.resource-tracker.address</name>
	<value>jupiter:8025</value>
</property>

<property>
	<name>yarn.resourcemanager.scheduler.address</name>
	<value>jupiter:8030</value>
</property>

<property>
	<name>yarn.resourcemanager.address</name>
	<value>jupiter:8050</value>
</property>

<property>
	<name>yarn.nodemanager.localizer.address</name>
	<value>${yarn.nodemanager.hostname}:8060</value>
</property>

<property>
	<name>yarn.nodemanager.webapp.address</name>
	<value>${yarn.nodemanager.hostname}:8070</value>
</property>

The first two specifes some nodemanager configurations. The others are just specifying which ports services are running on.
where I have jupiter you should replace with whatever you're running the namenode on. ex: mycomputer.cs.umn.edu
in slaves file:
delete localhost and add your slave machines
nuclear01
nuclear02
nuclear03
nuclear04

Running Hadoop

execute
start-dfs.sh
start-yarn.sh

and everything should start. The online interface is at localhost:50070 by default and you can check your node statuses there.
commands for making directories and copying files are here: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
Error checking

To check which services are running execute jps.
on your master it should show something like
11790 ResourceManager
11322 NameNode
11582 SecondaryNameNode
21613 Jps

on your slave it should show something like
6198 NodeManager
20821 Jps
5551 DataNode

Its important to check the log files if some services don't start. They are located at $HADOOP_HOME/logs
If the log file says something like
Caused by: java.net.BindException: Port in use: 0.0.0.0:8042

That means that one of the ports mapreduce is trying to use is already in use.
So check your configuration files and the default configuration at:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/core-default.xml
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
https://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
and search for the port number. In this case 8042 and then add that <property> into the xml file corresponding to the configuration.
For example:
<property>
	<name>yarn.nodemanager.webapp.address</name>
	<value>${yarn.nodemanager.hostname}:9999</value>
</property>

Also format the namenode by executing:
hadoop namenode -format

giraph

setup

This link has a good tutorial: https://giraph.apache.org/quick_start.html
But here is the quick way on the cs machines.
make sure you set up hadoop already
execute git clone https://github.com/apache/giraph.git in the directory you want to have your giraph stuff
also download maven v3 or above and extract the folder to somewhere: https://maven.apache.org/download.cgi
add the following to your ~/.bashrc file
export MAVEN_HOME=/project/cluster15/hadoop/apache-maven-3.3.9
export PATH=$PATH:$MAVEN_HOME/bin
export GIRAPH_HOME=/project/cluster15/hadoop/giraph

and execute
source ~/.bashrc
and then execute
cd $GIRAPH_HOME
mvn -Phadoop_yarn -Dhadoop.version=2.4.0 -DskipTests package

and it should build. It will take a while since it has to download dependencies. For me it took 14 minues.
If it doesnt work it might be because you went over 1GB disk quota on the cs machines. Execute:
cd ~
du -a | sort -n

It will show you the files on your computer and you can delete the ones that are taking too much space
The ~/.m2 folder is maven dependencies and you can delete that after you finish building
WARNING: some newer versions of hadoop doesnt support Giraph so I used a older version 2.4.0
run sample job

create a tiny graph with the following (name it tinygraph.txt):
[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]

copy it to hdfs
hdfs dfs -mkdir -p /user/wuxx1045/input
hdfs dfs -copyFromLocal tinygraph.txt /user/wuxx1045/input/tinygraph.txt

to see input parameters execute:
hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar\
org.apache.giraph.GiraphRunner -h

to run the job execute:
hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.3.0-SNAPSHOT-for-hadoop-2.5.1-jar-with-dependencies.jar \
org.apache.giraph.GiraphRunner \
org.apache.giraph.examples.SimpleShortestPathsComputation \
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
-vip /user/wuxx1045/input/tinygraph.txt \
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op /user/wuxx1045/output \
-w 1 \
-ca giraph.SplitMasterWorker=false

check output by executing
hdfs dfs -cat /user/wuxx1045/output/*

and it should show something like
0 1.0
2 2.0
1 0.0
3 1.0
4 5.0