Skip to content

Instantly share code, notes, and snippets.

@shawnhermans
Last active February 18, 2016 05:09
Show Gist options
  • Save shawnhermans/aa87056998725395bc95 to your computer and use it in GitHub Desktop.
Save shawnhermans/aa87056998725395bc95 to your computer and use it in GitHub Desktop.
A script that runs Python-based Spark jobs by bundling requirements defined in a requirements.txt file. It downloads wheel archives which are zip files. I haven't done extensive testing on this yet, but it seems to work.
#!/usr/bin/env bash
TMP_DIR=`mktemp -d`
cp spark_test.py ${TMP_DIR}
cp requirements.txt ${TMP_DIR}
cd ${TMP_DIR}
pip wheel -r requirements.txt
PY_FILES=$(ls -m *.whl | tr -d ' ' | tr -d '\n')
spark-submit --py-files ${PY_FILES} spark_test.py
@shawnhermans
Copy link
Author

Doesn't seem to play well with NumPy and SciPy. Going to do some testing on pure Python packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment