Skip to content

Instantly share code, notes, and snippets.

@jjstill
Last active February 4, 2024 15:58
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 7 You must be signed in to fork a gist
  • Save jjstill/8099669931cdfbb90ce6f4c307971514 to your computer and use it in GitHub Desktop.
Save jjstill/8099669931cdfbb90ce6f4c307971514 to your computer and use it in GitHub Desktop.
Running Spark job on local kubernetes (minikube)
# Starting minikube with 8Gb of memory and 3 CPUs
minikube --memory 8192 --cpus 3 start
# Creating separate Namespace for Spark driver and executor pods
kubectl create namespace spark
# Creating ServiceAccount and ClusterRoleBinding for Spark
kubectl create serviceaccount spark-serviceaccount --namespace spark
kubectl create clusterrolebinding spark-rolebinding --clusterrole=edit --serviceaccount=spark:spark-serviceaccount --namespace=spark
# Spark home dir
cd $SPARK_HOME
# Asking local environment to use Docker daemon inside the Minikube
eval $(minikube docker-env)
# Building Docker image from provided Dockerfile
docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .
# Submitting SparkPi example job
# $KUBERNETES_MASTER can be taken from output of kubectl cluster-info
bin/spark-submit --master k8s://$KUBERNETES_MASTER --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.driver.pod.name=spark-pi-driver --conf spark.kubernetes.container.image=spark:latest --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar
# Printing Spark driver's log
kubectl logs spark-pi-driver --namespace spark
# When the application completes, the executor pods terminate and are cleaned up,
# but the driver pod persists logs and remains in "completed" state.
# Deleting spark-pi-driver pod
kubectl delete pod spark-pi-driver --namespace spark
@michTalebzadeh
Copy link

Hi,

this is great stuff thanks.

I am using the following command to create python version of docker image.

/opt/spark/bin/docker-image-tool.sh -r pytest-repo -t 3.1.1 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

the docker image is created successfully. However, I want to use external packages likr pyyaml etc.

I try this spark-submit command

        spark-submit --verbose \
           --master k8s://$K8S_SERVER \
           --archives=hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${pyspark_venv}.tar.gz#${pyspark_venv} \
           --deploy-mode cluster \
           --name pytest \
           --conf spark.kubernetes.namespace=spark \
           --conf spark.executor.instances=1 \
           --conf spark.kubernetes.driver.limit.cores=1 \
           --conf spark.executor.cores=1 \
           --conf spark.executor.memory=500m \
           --conf spark.kubernetes.container.image=${IMAGE} \
           --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount \
           --py-files hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
           hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${APPLICATION}

This sounds OK and the python packages needs to extracted from
--archives=hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${pyspark_venv}.tar.gz#${pyspark_venv} \

This is the unpackking statement

Unpacking an archive hdfs://50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv from /tmp/spark-09e456aa-334e-4780-a780-80b21d09840a/pyspark_venv.tar.gz to /opt/spark/work-dir/./pyspark_venv

But it does not work because it cannot find any external package likr pyyaml or numpy etc

Traceback (most recent call last):
File "/tmp/spark-09e456aa-334e-4780-a780-80b21d09840a/testyml.py", line 25, in
main()
File "/tmp/spark-09e456aa-334e-4780-a780-80b21d09840a/testyml.py", line 22, in main
import yaml
ModuleNotFoundError: No module named 'yaml'

Do you have any ideas how I can make these external packages available inside minikube?

Thanks

@ChenShuai1981
Copy link

How to use SparkLauncher programmatically submit spark job to minikube? Any example is appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment