Skip to content

Instantly share code, notes, and snippets.

@tommypratama
Last active February 17, 2019 18:23
Show Gist options
  • Save tommypratama/9dea04fc68a1f3c43bf7d24514ab9533 to your computer and use it in GitHub Desktop.
Save tommypratama/9dea04fc68a1f3c43bf7d24514ab9533 to your computer and use it in GitHub Desktop.
Instalation PySpark and Java on CentOS

Install Java Openjdk

  1. sudo yum install java-1.8.0-openjdk
  2. java -version
  3. sudo /usr/sbin/alternatives --config java
    • Choose >> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64
  4. nano ~/.bashrc
  5. Append the following line and save :
    • export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64/jre"
  6. source ~/.bashrc
  7. echo $JAVA_HOME

Install PySpark

  1. Creae folder
    • mkdir pyspark
  2. Download PySpark ( use wget, curl, aria2 to dwownload file )
  3. Extract files
    • tar -zxvf spark-2.3.3-bin-hadoop2.7.tgz
  4. Append the following line to the ~/.bashrc file:
    • export SPARK_HOME="/home/[yourusername]/spark/spark-2.3.3-bin-hadoop2.7"
    • export PATH="$SPARK_HOME/bin:$PATH"
  5. source ~/.bashrc
  6. Check PySpark and test in shell
    • pyspark
    • sc

Setting Spark on Jupyter Notebook Manually

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('pyspark')
sc = SparkContext(conf=conf)

and test

sc

Setting automatically

  1. nano ~/.bashrc
  2. append the following line :

export PYSPARK_SUBMIT_ARGS="pyspark-shell"
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark

  1. source ~/.bashrc
  2. Run jupyter notebook with the following command :

pyspark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment