Skip to content

Instantly share code, notes, and snippets.

@ficolo
Last active May 22, 2019 23:15
Show Gist options
  • Save ficolo/45325b226a3fdfd8d0954c80d1480ea6 to your computer and use it in GitHub Desktop.
Save ficolo/45325b226a3fdfd8d0954c80d1480ea6 to your computer and use it in GitHub Desktop.

SPOT developer's talk [24/05/2019]

Spark 101 - Federico López Gómez

Setting up your environment

Requirements

  • You need to have Java 8 installed, I tried to run Spark using Java 11 and it didn't work, so please make sure that you JAVA_HOME environment variable is pointing to your Java 8 installation directory. In MacOS you can check it like this:

    • To know where is you JAVA_HOME pointing to:
      echo $JAVA_HOME
    • To find out which Java version you have installed ans where are they:
      /usr/libexec/java_home -V
    • Choose the Java 8 version and set it as you JAVA_HOME:
      export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home
  • Install Python 3+, if you don't have it already. Friendly reminder: Python 2.7 will not be maintained past 2020.

  • Create a directory to work in:

mkdir spot-dev-talk-spark && cd spot-dev-talk-spark
  • Create a virtual enviroment using you brand new Python3 installation:
python3 -m venv .venv
  • Activate your awesome virtual environment:
source .venv/bin/activate
  • Now lets get some nice Python packages to help us get going:
pip install jupyter findspark

Setting up your Spark enviroment

  • First Download Apache Spark
wget https://www-eu.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
  • Let's set an environment variable for your Spark installation and the Spark local IP address:
export SPARK_HOME=$(pwd)/spark-2.4.3-bin-hadoop2.7/
export SPARK_LOCAL_IP="127.0.0.1"

Downloading the data

wget https://wwwdev.ebi.ac.uk/~federico/DR10.0.tar.gz
tar -xzf DR10.0.tar.gz

Download the tutorial Jupyter notebook:

wget 'https://wwwdev.ebi.ac.uk/~federico/SPOT Dev Talk - Spark 101.ipynb'

Start your Jupyter notebook

jupyter notebook

And open the SPOT Dev Talk - Spark 101.ipynb file. Run the code that is already there so you can be sure your Spark setup is OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment