download the distribution at: http://hadoop.apache.org/releases.html
and unzip to a location of your choosing
follow instructions at https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/SingleCluster.html for whatever version you're running
(I used 2.4.0 because it was compatible with giraph)
in ~/.bashrc
file add
export HADOOP_HOME=/project/cluster15/hadoop/hadoop-2.4.0 #this path is where you unzipped your file downloaded from apache
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
then do source ~/.bashrc
do cd $HADOOP_CONF_DIR
in core-site.xml add:
<property>
<name>fs.defaultFS</name>
<value>hdfs://jupiter:9000</value>
</property>
This is the location where your namenode and secondary namenode is going to be. replace jupiter with whatever computer you are running from
in hdfs-site.xml add:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
This is how many times you are going to replicate your data in hdfs. 1 = no backup, 2 = 1 backup, etc....
in mapred-site.xml add: (you might have to create it)
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapreduce.shuffle.port</name>
<value>13564</value>
</property>
The first one specifies the framework name. The second one specifies the max number of tasks. The third one specifies which port mapreduce runs on.
in yarn-site.xml add:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>jupiter:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>jupiter:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>jupiter:8050</value>
</property>
<property>
<name>yarn.nodemanager.localizer.address</name>
<value>${yarn.nodemanager.hostname}:8060</value>
</property>
<property>
<name>yarn.nodemanager.webapp.address</name>
<value>${yarn.nodemanager.hostname}:8070</value>
</property>
The first two specifes some nodemanager configurations. The others are just specifying which ports services are running on.
where I have jupiter
you should replace with whatever you're running the namenode on. ex: mycomputer.cs.umn.edu
in slaves file:
delete localhost
and add your slave machines
nuclear01
nuclear02
nuclear03
nuclear04
execute
start-dfs.sh
start-yarn.sh
and everything should start. The online interface is at localhost:50070
by default and you can check your node statuses there.
commands for making directories and copying files are here: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
To check which services are running execute jps
.
on your master it should show something like
11790 ResourceManager
11322 NameNode
11582 SecondaryNameNode
21613 Jps
on your slave it should show something like
6198 NodeManager
20821 Jps
5551 DataNode
Its important to check the log files if some services don't start. They are located at $HADOOP_HOME/logs
If the log file says something like
Caused by: java.net.BindException: Port in use: 0.0.0.0:8042
That means that one of the ports mapreduce is trying to use is already in use.
So check your configuration files and the default configuration at:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/core-default.xml https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml https://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
and search for the port number. In this case 8042 and then add that <property>
into the xml file corresponding to the configuration.
For example:
<property>
<name>yarn.nodemanager.webapp.address</name>
<value>${yarn.nodemanager.hostname}:9999</value>
</property>
Also format the namenode by executing:
hadoop namenode -format
This link has a good tutorial: https://giraph.apache.org/quick_start.html
But here is the quick way on the cs machines.
make sure you set up hadoop already
execute git clone https://github.com/apache/giraph.git
in the directory you want to have your giraph stuff
also download maven v3 or above and extract the folder to somewhere: https://maven.apache.org/download.cgi
add the following to your ~/.bashrc
file
export MAVEN_HOME=/project/cluster15/hadoop/apache-maven-3.3.9
export PATH=$PATH:$MAVEN_HOME/bin
export GIRAPH_HOME=/project/cluster15/hadoop/giraph
and execute
source ~/.bashrc
and then execute
cd $GIRAPH_HOME
mvn -Phadoop_yarn -Dhadoop.version=2.4.0 -DskipTests package
and it should build. It will take a while since it has to download dependencies. For me it took 14 minues.
If it doesnt work it might be because you went over 1GB disk quota on the cs machines. Execute:
cd ~
du -a | sort -n
It will show you the files on your computer and you can delete the ones that are taking too much space
The ~/.m2
folder is maven dependencies and you can delete that after you finish building
WARNING: some newer versions of hadoop doesnt support Giraph so I used a older version 2.4.0
create a tiny graph with the following (name it tinygraph.txt):
[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]
copy it to hdfs
hdfs dfs -mkdir -p /user/wuxx1045/input
hdfs dfs -copyFromLocal tinygraph.txt /user/wuxx1045/input/tinygraph.txt
to see input parameters execute:
hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar\
org.apache.giraph.GiraphRunner -h
to run the job execute:
hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.3.0-SNAPSHOT-for-hadoop-2.5.1-jar-with-dependencies.jar \
org.apache.giraph.GiraphRunner \
org.apache.giraph.examples.SimpleShortestPathsComputation \
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
-vip /user/wuxx1045/input/tinygraph.txt \
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op /user/wuxx1045/output \
-w 1 \
-ca giraph.SplitMasterWorker=false
check output by executing
hdfs dfs -cat /user/wuxx1045/output/*
and it should show something like
0 1.0
2 2.0
1 0.0
3 1.0
4 5.0