acout/hadoop_g5k_getting_started_2020.md

## hadoop_g5k_getting_started_2020.md

      
    Raw
  

              hadoop_g5k_getting_started_2020.md
            
          
    Updated Setup and Usage instructions for hadoop_g5k

Sources

https://lipn.univ-paris13.fr/bigdata/index.php/How_to_use_Spark_on_Grid5000
https://github.com/mliroz/hadoop_g5k/wiki
https://github.com/mliroz/hadoop_g5k/wiki/spark_g5k
Updated Tutorial

hadoop_g5k Setup

Prepare the needed files by downloading:

Spark, e.g.: https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
Compatible Hadoop, e.g.: https://archive.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz

You will need them in their archives, so don't extract their content.
Install Execo Using Pip
(No Proxy Needed unlike what told in old tutorial, easy_install not supported anymore on g5k?)
(frontend)$ python -m pip install --user execo
Retrieve hadoop_g5k sources from GitHub and then unzip it.
(frontend)$ wget https://github.com/mliroz/hadoop_g5k/archive/master.zip .
unzip master.zip
Update util.py to avoid a Python error whenever checking java version (which has changed since the package release).
(frontend)$ nano hadoop_g5k-master/hadoop_g5k/util/util.py
then make the check_java_version return True, commenting out all the function's code. It works with the current version of OpenJDK installed by default anyway so checking is unnecessary.
... Edit utils.py / check_java_version code to return True all the time ...

Inside the hadoop_g5k_master folder, launch the python setup command.
python setup.py install --user
Depending on your Python configuration, the scripts will be installed in a different directory. You may add this directory to the PATH in order to be able to call them from any directory.
To automatically add it to the PATH whenever connecting to g5k, add the following lines to your .bash_profile file.
PATH="/home/$USER/.local/bin:$PATH"
export PATH

Hadoop_g5k usage

From a frontend, reserve your nodes as usual. For example:
$ oarsub -I -t allow_classic_ssh -l nodes=4,walltime=2
Then, from inside your reservation, create and initialize the hadoop cluster.
#--version 2 says we are working on an Hadoop 2.x.y version
$ hg5k --create $OAR_NODEFILE --version 2
#Change the hadoop archive path to yours
$ hg5k --bootstrap /home/$USER/hadoop-2.7.7.tar.gz
$ hg5k --initialize --start
Now create the STANDALONE based Spark cluster (hadoop_g5k does not work well anymore in YARN mode)
$ spark_g5k --create STANDALONE --hid 1
Then, install Spark (with compatible Hadoop dependency as Hadoop version installed in previous steps) on every cluster node.
#Change the Spark archive path to yours, ensuring the -hadoopX.Y version is the one deployed before
$ spark_g5k --bootstrap /home/$USER/spark-2.4.5-bin-hadoop2.7.tgz
Finally, initialize the Spark cluster and start it to make it available to process jobs.
$ spark_g5k --initialize --start
You are ready to submit a job from its assembly jar. For example:
$ spark_g5k --scala_job /home/$USER/some-spark-assembly.jar --main_class Main

After all your jobs are done, you shoud clean all temporary files created during previous phases.
$ spark_g5k --delete
$ hg5k --delete