Skip to content

Instantly share code, notes, and snippets.

@wuweiweiwu
Last active January 8, 2020 06:27
Show Gist options
  • Save wuweiweiwu/7a58f552f7208a499bb52ea896440dac to your computer and use it in GitHub Desktop.
Save wuweiweiwu/7a58f552f7208a499bb52ea896440dac to your computer and use it in GitHub Desktop.
Setting up Hadoop, Yarn, and Giraph for Distributed Systems Lab at University of Minnesota

Setting up hadoop, yarn, and giraph

Table of Contents

Hadoop

Giraph

hadoop

download the distribution at: http://hadoop.apache.org/releases.html

and unzip to a location of your choosing

Setting up single node cluster:

follow instructions at https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/SingleCluster.html for whatever version you're running

Setting up distributed cluster:

(I used 2.4.0 because it was compatible with giraph)

Setting up environment variables:

in ~/.bashrc file add

export HADOOP_HOME=/project/cluster15/hadoop/hadoop-2.4.0 #this path is where you unzipped your file downloaded from apache
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

then do source ~/.bashrc

Setting up hadoop configuration files:

do cd $HADOOP_CONF_DIR

in core-site.xml add:

<property>
	<name>fs.defaultFS</name>
	<value>hdfs://jupiter:9000</value>
</property>

This is the location where your namenode and secondary namenode is going to be. replace jupiter with whatever computer you are running from

in hdfs-site.xml add:

<property>
	<name>dfs.replication</name>
 	<value>1</value>
</property>

This is how many times you are going to replicate your data in hdfs. 1 = no backup, 2 = 1 backup, etc....

in mapred-site.xml add: (you might have to create it)

<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
</property>

<property>
	<name>mapreduce.tasktracker.map.tasks.maximum</name>
	<value>4</value>
</property>

<property>
	<name>mapreduce.shuffle.port</name>
	<value>13564</value>
</property>

The first one specifies the framework name. The second one specifies the max number of tasks. The third one specifies which port mapreduce runs on.

in yarn-site.xml add:

<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>

<property>
	<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
	<name>yarn.resourcemanager.resource-tracker.address</name>
	<value>jupiter:8025</value>
</property>

<property>
	<name>yarn.resourcemanager.scheduler.address</name>
	<value>jupiter:8030</value>
</property>

<property>
	<name>yarn.resourcemanager.address</name>
	<value>jupiter:8050</value>
</property>

<property>
	<name>yarn.nodemanager.localizer.address</name>
	<value>${yarn.nodemanager.hostname}:8060</value>
</property>

<property>
	<name>yarn.nodemanager.webapp.address</name>
	<value>${yarn.nodemanager.hostname}:8070</value>
</property>

The first two specifes some nodemanager configurations. The others are just specifying which ports services are running on.

where I have jupiter you should replace with whatever you're running the namenode on. ex: mycomputer.cs.umn.edu

in slaves file:

delete localhost and add your slave machines

nuclear01
nuclear02
nuclear03
nuclear04

Running Hadoop

execute

start-dfs.sh
start-yarn.sh

and everything should start. The online interface is at localhost:50070 by default and you can check your node statuses there.

commands for making directories and copying files are here: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html

Error checking

To check which services are running execute jps.

on your master it should show something like

11790 ResourceManager
11322 NameNode
11582 SecondaryNameNode
21613 Jps

on your slave it should show something like

6198 NodeManager
20821 Jps
5551 DataNode

Its important to check the log files if some services don't start. They are located at $HADOOP_HOME/logs If the log file says something like

Caused by: java.net.BindException: Port in use: 0.0.0.0:8042

That means that one of the ports mapreduce is trying to use is already in use.

So check your configuration files and the default configuration at:

https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/core-default.xml https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml https://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

and search for the port number. In this case 8042 and then add that <property> into the xml file corresponding to the configuration.

For example:

<property>
	<name>yarn.nodemanager.webapp.address</name>
	<value>${yarn.nodemanager.hostname}:9999</value>
</property>

Also format the namenode by executing:

hadoop namenode -format

giraph

setup

This link has a good tutorial: https://giraph.apache.org/quick_start.html

But here is the quick way on the cs machines.

make sure you set up hadoop already

execute git clone https://github.com/apache/giraph.git in the directory you want to have your giraph stuff

also download maven v3 or above and extract the folder to somewhere: https://maven.apache.org/download.cgi

add the following to your ~/.bashrc file

export MAVEN_HOME=/project/cluster15/hadoop/apache-maven-3.3.9
export PATH=$PATH:$MAVEN_HOME/bin
export GIRAPH_HOME=/project/cluster15/hadoop/giraph

and execute

source ~/.bashrc

and then execute

cd $GIRAPH_HOME
mvn -Phadoop_yarn -Dhadoop.version=2.4.0 -DskipTests package

and it should build. It will take a while since it has to download dependencies. For me it took 14 minues.

If it doesnt work it might be because you went over 1GB disk quota on the cs machines. Execute:

cd ~
du -a | sort -n

It will show you the files on your computer and you can delete the ones that are taking too much space

The ~/.m2 folder is maven dependencies and you can delete that after you finish building

WARNING: some newer versions of hadoop doesnt support Giraph so I used a older version 2.4.0

run sample job

create a tiny graph with the following (name it tinygraph.txt):

[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]

copy it to hdfs

hdfs dfs -mkdir -p /user/wuxx1045/input
hdfs dfs -copyFromLocal tinygraph.txt /user/wuxx1045/input/tinygraph.txt

to see input parameters execute:

hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar\
org.apache.giraph.GiraphRunner -h

to run the job execute:

hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.3.0-SNAPSHOT-for-hadoop-2.5.1-jar-with-dependencies.jar \
org.apache.giraph.GiraphRunner \
org.apache.giraph.examples.SimpleShortestPathsComputation \
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
-vip /user/wuxx1045/input/tinygraph.txt \
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op /user/wuxx1045/output \
-w 1 \
-ca giraph.SplitMasterWorker=false

check output by executing

hdfs dfs -cat /user/wuxx1045/output/*

and it should show something like

0 1.0
2 2.0
1 0.0
3 1.0
4 5.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment