Skip to content

Instantly share code, notes, and snippets.

@pualien
Created April 13, 2021 06:41
Show Gist options
  • Save pualien/46749519ccf788b147064871ec7057a9 to your computer and use it in GitHub Desktop.
Save pualien/46749519ccf788b147064871ec7057a9 to your computer and use it in GitHub Desktop.
Install Spark/Pyspark 3.1.1 with Google Cloud Storage Connector
#!/bin/bash
export CUSTOM_SPARK_VERSION="3.1.1"
export CUSTOM_PYSPARK_VERSION="3.1.1"
export CUSTOM_HADOOP_VERSION="3.2"
export CUSTOM_HADOOP_VERSION_INDEX="3"
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
export SPARK_HOME="/content/spark-$CUSTOM_SPARK_VERSION-bin-hadoop$CUSTOM_HADOOP_VERSION"
export CUSTOM_SPARK_JARS="$SPARK_HOME/jars"
apt-get update
#install JDK java that is necessary to correct work of Pyspark and components hadoop
apt-get install openjdk-8-jdk-headless -qq > /dev/null
wget -q https://archive.apache.org/dist/spark/spark-$CUSTOM_SPARK_VERSION/spark-$CUSTOM_SPARK_VERSION-bin-hadoop$CUSTOM_HADOOP_VERSION.tgz
tar xf spark-$CUSTOM_SPARK_VERSION-bin-hadoop$CUSTOM_HADOOP_VERSION.tgz
pip install -q findspark
wget -O /content/spark-$CUSTOM_SPARK_VERSION-bin-hadoop$CUSTOM_HADOOP_VERSION/jars/gcs-connector-hadoop2-latest.jar -q https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop$CUSTOM_HADOOP_VERSION_INDEX-latest.jar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment