welly87/pyspark.sh

Created October 1, 2020 03:17

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/welly87/2c9778b0946181f17b58d03934489f62.js"></script>
Save welly87/2c9778b0946181f17b58d03934489f62 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

pyspark.sh

Author

welly87 commented Oct 1, 2020

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
!tar xf spark-3.0.1-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install -q pyarrow

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

Author

welly87 commented Oct 1, 2020

check spark version
!spark-3.0.1-bin-hadoop2.7/bin/spark-shell -version

Author

welly87 commented Oct 1, 2020

import findspark
findspark.init("spark-3.0.1-bin-hadoop2.7")

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

Author

welly87 commented Oct 1, 2020

sdf = spark.read.csv("/content/sample_data/california_housing_train.csv", header=True)

Author

welly87 commented Oct 1, 2020

pdf = sdf.select("*").toPandas()

Author

welly87 commented Oct 1, 2020 •

edited

Loading

sdf.createOrReplaceTempView("california_housing")

sqlDF = spark.sql("SELECT sum(population) FROM california_housing WHERE total_rooms > 1000")
sqlDF.head()

Author

welly87 commented Oct 1, 2020

http://spark.apache.org/docs/latest/sql-getting-started.html

Author

welly87 commented Oct 1, 2020

http://spark.apache.org/docs/latest/ml-guide.html

Author

welly87 commented Oct 1, 2020

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz

!pip install -q findspark
!pip install -q pyarrow

Author

welly87 commented Oct 1, 2020

!rm -rf *.tgz
!rm -rf spark-3.0.1-bin-hadoop2.7/

Author

welly87 commented Oct 1, 2020

https://relational.fit.cvut.cz/dataset/CCS

Author

welly87 commented Oct 1, 2020

https://github.com/hackathonBI/CCS

Author

welly87 commented Oct 1, 2020

!wget https://github.com/welly87/spark-load/raw/master/mysql-connector-java-8.0.14.jar
!mv /content/mysql-connector-java-8.0.14.jar /content/spark-2.4.7-bin-hadoop2.7/jars/mysql-connector-java-8.0.14.jar

welly87/pyspark.sh

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020 • edited Loading

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020

welly87 commented Oct 1, 2020 •

edited

Loading