Skip to content

Instantly share code, notes, and snippets.

@drboyer
Last active July 3, 2018 15:26
Show Gist options
  • Save drboyer/c44280f5d3fe0e990e4b10a017c407f4 to your computer and use it in GitHub Desktop.
Save drboyer/c44280f5d3fe0e990e4b10a017c407f4 to your computer and use it in GitHub Desktop.
Simple PySpark CSV query example
quad val
nw 0
ne 1
se 2
sw 3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("csv-reader").getOrCreate()
# quiet down logging
spark.sparkContext.setLogLevel("WARN")
# read csv file with the DataFrame column names coming from the first row in the CSV
csv_df = spark.read.csv("/tmp/myfile.csv", header=True)
# you can filter your dataframe with a "where" SQL clause or use
# the select() function to pick out only certain columns
filtered = csv_df.where("quad == 'ne'")
# print the results as a table
filtered.show(truncate=False)

To run this, install pyspark via:

pip install pyspark

(either globally or in a virtualenv)

You'll also need Java 8 installed and may need to set the JAVA_HOME environment variable.

Once that's done, to run the example here, save myfile.csv to your /tmp directory. Then you can run

spark-submit query_csv.py

to run your Spark application and print the results.

References:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment