Skip to content

Instantly share code, notes, and snippets.

@acout
Last active July 17, 2020 09:43
Show Gist options
  • Save acout/be0bb746a4e5456bcdd6c35b7c319f1a to your computer and use it in GitHub Desktop.
Save acout/be0bb746a4e5456bcdd6c35b7c319f1a to your computer and use it in GitHub Desktop.
Instructions to setup and use hadoop_g5k tool as of 2020-05-06

Updated Setup and Usage instructions for hadoop_g5k

Sources

https://lipn.univ-paris13.fr/bigdata/index.php/How_to_use_Spark_on_Grid5000

https://github.com/mliroz/hadoop_g5k/wiki

https://github.com/mliroz/hadoop_g5k/wiki/spark_g5k

Updated Tutorial

hadoop_g5k Setup

Prepare the needed files by downloading:

You will need them in their archives, so don't extract their content.

Install Execo Using Pip

(No Proxy Needed unlike what told in old tutorial, easy_install not supported anymore on g5k?)

(frontend)$ python -m pip install --user execo

Retrieve hadoop_g5k sources from GitHub and then unzip it.

(frontend)$ wget https://github.com/mliroz/hadoop_g5k/archive/master.zip .
unzip master.zip

Update util.py to avoid a Python error whenever checking java version (which has changed since the package release).

(frontend)$ nano hadoop_g5k-master/hadoop_g5k/util/util.py

then make the check_java_version return True, commenting out all the function's code. It works with the current version of OpenJDK installed by default anyway so checking is unnecessary.

... Edit utils.py / check_java_version code to return True all the time ...

Inside the hadoop_g5k_master folder, launch the python setup command.

python setup.py install --user

Depending on your Python configuration, the scripts will be installed in a different directory. You may add this directory to the PATH in order to be able to call them from any directory.

To automatically add it to the PATH whenever connecting to g5k, add the following lines to your .bash_profile file.

PATH="/home/$USER/.local/bin:$PATH"
export PATH

Hadoop_g5k usage

From a frontend, reserve your nodes as usual. For example:

$ oarsub -I -t allow_classic_ssh -l nodes=4,walltime=2

Then, from inside your reservation, create and initialize the hadoop cluster.

#--version 2 says we are working on an Hadoop 2.x.y version
$ hg5k --create $OAR_NODEFILE --version 2
#Change the hadoop archive path to yours
$ hg5k --bootstrap /home/$USER/hadoop-2.7.7.tar.gz
$ hg5k --initialize --start

Now create the STANDALONE based Spark cluster (hadoop_g5k does not work well anymore in YARN mode)

$ spark_g5k --create STANDALONE --hid 1

Then, install Spark (with compatible Hadoop dependency as Hadoop version installed in previous steps) on every cluster node.

#Change the Spark archive path to yours, ensuring the -hadoopX.Y version is the one deployed before
$ spark_g5k --bootstrap /home/$USER/spark-2.4.5-bin-hadoop2.7.tgz

Finally, initialize the Spark cluster and start it to make it available to process jobs.

$ spark_g5k --initialize --start

You are ready to submit a job from its assembly jar. For example:

$ spark_g5k --scala_job /home/$USER/some-spark-assembly.jar --main_class Main

After all your jobs are done, you shoud clean all temporary files created during previous phases.

$ spark_g5k --delete
$ hg5k --delete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment