Skip to content

Instantly share code, notes, and snippets.

@welly87
Created October 1, 2020 06:10
Show Gist options
  • Save welly87/08bf38934df8c2e385f6b4fa8f751b0f to your computer and use it in GitHub Desktop.
Save welly87/08bf38934df8c2e385f6b4fa8f751b0f to your computer and use it in GitHub Desktop.
@welly87
Copy link
Author

welly87 commented Oct 1, 2020

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz

!pip install -q findspark
!pip install -q pyarrow

!wget https://repo1.maven.org/maven2/org/apache/kudu/kudu-spark2_2.11/1.12.0/kudu-spark2_2.11-1.12.0.jar
!mv /content/kudu-spark2_2.11-1.12.0.jar /content/spark-2.4.7-bin-hadoop2.7/jars/kudu-spark2_2.11-1.12.0.jar

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

#  --driver-class-path {driver_path} --jars {driver_path}

os.environ['PYSPARK_SUBMIT_ARGS'] = f'pyspark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.12.0'

import findspark
findspark.init("spark-2.4.7-bin-hadoop2.7")

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Bea Cukai - PySpark/Kudu") \
    .config('spark.driver.bindAddress','localhost') \
    .config('spark.driver.allowMultipleContexts','true') \
    .config('spark.sql.execution.arrow.pyspark.enabled', 'true') \
    .getOrCreate()

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

table = "transactions_1k"
kudu_master = "178.128.112.105:7051,178.128.112.105:7151,178.128.112.105:7251"
sfmta_kudu = spark.read.option("kudu.master", kudu_master).option("kudu.table", table).option("kudu.scanLocality", "leader_only").format("kudu").load()

sfmta_kudu.createOrReplaceTempView(table)

sdf = spark.sql("SELECT count(*) FROM " + table)

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

pdf = sdf.select("*").toPandas()
pdf.head()

@welly87
Copy link
Author

welly87 commented Oct 1, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment