LiShuMing/pyspark-on-yarn-self-contained.md

## pyspark-on-yarn-self-contained.md

      
    Raw
  

              pyspark-on-yarn-self-contained.md
            
          
    PySpark on YARN in self-contained environments

Author: https://github.com/seanorama
Reference:


https://community.cloudera.com/t5/Community-Articles/Using-VirtualEnv-with-PySpark/ta-p/245905
https://community.cloudera.com/t5/Community-Articles/Running-PySpark-with-Conda-Env/ta-p/247551
https://stackoverflow.com/questions/48770263/bundling-python3-packages-for-pyspark-results-in-missing-imports
https://stackoverflow.com/questions/53524696/livy-no-yarn-application-is-found-with-tag-livy-batch-10-hg3po7kp-in-120-seconds
https://conda.github.io/conda-pack/spark.html
https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties
https://spark.apache.org/docs/latest/configuration.html
https://livy.incubator.apache.org/docs/latest/rest-api.html
https://github.com/apache/zeppelin/blob/master/docs/interpreter/livy.md#configuration

Why

By default, pyspark jobs will use the Python from the local system of each YARN NodeManager host. Which means:

If a different version of Python is needed, it must be installed and maintain on all of the hosts
If additional modules (i.e. from pip) are needed they must be installed and maintained on all of the hosts

This presents a lot of overhead and introduces many risks. Also, the Spark developers typically do not have direct access to YARN NodeManager hosts.
Further, it is a good practice to manage dependencies from the development side, which is only possible if all Python dependencies are "self-contained".
Overview

There are 2 methods to have self-contained environments:

a) Use an archive (i.e. tar.gz) of a Python environment (virtualenv or conda):

Benefits:

Faster load time than an empty 'virtualenv' since the packages are already present.
Uses the YARN Shared cache: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/SharedCache.html

During 1st use of the environment, it will be cached. Future uses will load very fast.


Not tied to the Python driver of the NodeManager machines, meaning any Python version can be used and nothing ever has to be done on the NodeManagers.
Completely self-contained. No external dependencies.


Caveats:

Can't install additional packages at run-time.


b) Use python.pyspark.virtualenv which creates a new virtualenv at Spark runtime:

Benefits:

Install packages at runtime.


Caveats:

Not entirely self-contained since it depends on the interpreter being available on all YARN NodeManager hosts (i.e. at /usr/bin/python3 or wherever you place it). Which also makes it harder to change versions.
Very SLOW as every job will require downloading and building the pip packages.
Depends on public internet access, or a local network pypi/conda repo.


How to: Use an archive (i.e. tar.gz) of a Python environment (virtualenv or conda):

Overview:

Create an environment with virtualenv or conda
Archive the environment to a .tar.gz or .zip.
Upload the archive to HDFS
Tell Spark (via spark-submit, pyspark, livy, zeppelin) to use this environment
Repeat for each different virtualenv that is required or when the virtualenv needs updating

(Optional) Create a shared HDFS path for storing the environment(s):

Only necessary if sharing the environment and if a location doesn't already exist.
sudo -u hdfs -i

## if kerberos
keytab=/etc/security/keytabs/hdfs.headless.keytab
kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)
## endif

hdfs dfs -mkdir -p /share/python-envs
hdfs dfs -chmod -R 775 /share

## replace the group with a user group that will be managing the archives
hdfs dfs -chown -R hdfs:hadoop /share

exit


Create archive using Python virtualenv and tar


Install python3 and python-virtualenv on a host with same Operating System and CPU architecture as the cluster, such as an edge host:

Can be in your home-directory without root/sudo access

downloading python manually
using pyenv or similar application


Or system-wide:


sudo yum install python3 python-virtualenv


Create the archive:

## Create requirements for `pip`
tee requirements.txt > /dev/null << EOF
arrow
jupyter
numpy
pandas
scikit-learn
EOF

## The name of the environment
env="python3-venv"

## Create the environment

python3 -m venv ${env} --copies
virtualenv --python=/usr/bin/python3 ${env}

source ${env}/bin/activate
pip install -U pip
pip install -r requirements.txt
deactivate

## Archive the environment
cd ${env}
tar -zcf ../${env}.tar.gz *
cd ..

Or create archive using conda and conda-pack


Install Anaconda or Conda on a host with same Operating System and CPU architecture as the cluster, such as an edge host:

Can be in your home-directory without root/sudo access

Anaconda: https://www.anaconda.com/distribution/
Miniconda: https://docs.conda.io/en/latest/miniconda.html


Or system-wide with yum or apt:

https://docs.conda.io/projects/conda/en/latest/user-guide/install/rpm-debian.html


Create the archive:


## Create requirements for `conda`
tee requirements.txt > /dev/null << EOF
arrow
jupyter
numpy
pandas
scikit-learn
EOF

## The name of the environment
env="python3-venv"

## Create the environment
conda create -y -n ${env} python=3.7 ## May give a file-system error. If so, simply run again. This is due to conda not being initialized
conda activate ${env}
conda install -y -n ${env} -f requirements.txt
conda install -y -n ${env} -c conda-forge conda-pack

## Archive the environment
conda pack -f -n ${env} -o ${env}.tar.gz

conda deactivate

Use the archive: Same steps for a virtualenv or for conda:


Put to HDFS:

You may need to kinit 1st. Can do this as your own user.


hdfs dfs -put -f ${env}.tar.gz /share/python-envs/
hdfs dfs -chmod 0664 /share/python-envs/${env}.tar.gz


Test it:

## Create test script
tee test.py > /dev/null << EOF
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf()
conf.setAppName('pyspark-test')
sc = SparkContext(conf=conf)

import numpy
print("Hello World!")
sc.parallelize(range(1,10)).map(lambda x : numpy.__version__).collect()
EOF

## Submit to Spark
deactivate
conda deactivate

## Note: `--archives` can be used instead of `--conf spark.yarn.dist.archives`. I prefer to see the full conf statement.
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--conf spark.yarn.dist.archives=hdfs:///share/python-envs/${env}.tar.gz#environment \
--master yarn \
--deploy-mode cluster \
test.py

## Check the logs: Update the `id` to the id of your job from above.
id=application_GetTheIdFromOutputOfCommandAbove
yarn logs -applicationId ${id} | grep "Hello World"


Update Zeppelin %livy2.pyspark


Note: This only works with YARN cluster deploy mode. It is the default in HDP3.
In Zeppelin: click "top right" menu -> Interpreters -> Add to the livy2 interpreter:

livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
livy.spark.yarn.dist.archives=hdfs:///share/python-envs/python3-venv.tar.gz#environment


Test from a notebook:

%livy2.pyspark
import numpy
print("Hello World!")
sc.parallelize(range(1,10)).map(lambda x : numpy.__version__).collect()


Howto create a virtualenv at run-time


Install python3 and python-virtualenv

For this method it must be done on all hosts in the cluster.
See the archives instructions for more details of how to install.


Example using pyspark shell:


PYSPARK_PYTHON=/bin/python3 \
pyspark \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv \
--conf spark.pyspark.virtualenv.python_version=3.6 \
--master yarn \
--deploy-mode client

## then, in the shell, install packages:
sc.install_packages(["numpy"])


Example using spark-submit.

Note the addition of a requirements file.

This is optional, you could use the same sc.install_packages file instead inside test.py.
If using from HDFS, make sure to upload it ;)
You can use it with pyspark above, but the file then must be on the localhost that you are executing pyspark from.


spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3 \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv \
--conf spark.pyspark.virtualenv.python_version=3.6 \
--conf spark.pyspark.virtualenv.requirements=hdfs:///share/python-envs/python3-venv.requirements.txt \
--master yarn \
--deploy-mode cluster \
test.py


Update Zeppelin %livy2.pyspark

In Zeppelin: click "top right" menu -> Interpreters -> Add to the livy2 interpreter:


livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3
livy.spark.pyspark.virtualenv.enabled=true
livy.spark.pyspark.virtualenv.type=virtualenv
livy.spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv
livy.spark.pyspark.virtualenv.python_version=3.7
livy.spark.pyspark.virtualenv.requirements=hdfs:///share/python-envs/python3-venv.requirements.txt ## this is optional