Skip to content

Instantly share code, notes, and snippets.

@welly87
Created October 1, 2020 03:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save welly87/2c9778b0946181f17b58d03934489f62 to your computer and use it in GitHub Desktop.
Save welly87/2c9778b0946181f17b58d03934489f62 to your computer and use it in GitHub Desktop.
@welly87
Copy link
Author

welly87 commented Oct 1, 2020

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
!tar xf spark-3.0.1-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install -q pyarrow

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

check spark version
!spark-3.0.1-bin-hadoop2.7/bin/spark-shell -version

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

import findspark
findspark.init("spark-3.0.1-bin-hadoop2.7")

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

sdf = spark.read.csv("/content/sample_data/california_housing_train.csv", header=True)

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

pdf = sdf.select("*").toPandas()

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

sdf.createOrReplaceTempView("california_housing")

sqlDF = spark.sql("SELECT sum(population) FROM california_housing WHERE total_rooms > 1000")
sqlDF.head()

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz

!pip install -q findspark
!pip install -q pyarrow

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

!rm -rf *.tgz
!rm -rf spark-3.0.1-bin-hadoop2.7/

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

!wget https://github.com/welly87/spark-load/raw/master/mysql-connector-java-8.0.14.jar
!mv /content/mysql-connector-java-8.0.14.jar /content/spark-2.4.7-bin-hadoop2.7/jars/mysql-connector-java-8.0.14.jar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment