Skip to content

Instantly share code, notes, and snippets.

@evanvolgas
Last active October 6, 2023 20:48
Show Gist options
  • Save evanvolgas/e4aa75fec4179bb7075a5283867f127c to your computer and use it in GitHub Desktop.
Save evanvolgas/e4aa75fec4179bb7075a5283867f127c to your computer and use it in GitHub Desktop.
from __future__ import print_function
from pyspark.sql import Row, SparkSession
from pyspark.sql.functions import col, from_json
import duckdb
spark = SparkSession.builder.appName("csv parser").getOrCreate()
sc = spark.sparkContext
df = spark.read.option("delimiter", ",").option("header", True).csv('AAPL.csv')
df.createOrReplaceTempView("df")
spark.sql("""
SELECT
min(volume_weighted_average_price) as min_price,
max(volume_weighted_average_price) as max_price
FROM df""").show()
"""
Yields the wrong answers:
+---------+---------+
|min_price|max_price|
+---------+---------+
| 103.4718| 97.9313|
+---------+---------+
"""
duckdb.sql("""
SELECT
min(volume_weighted_average_price) as min_price,
max(volume_weighted_average_price) as max_price
FROM AAPL.csv""").df()
"""
Yields the right answer
min_price max_price
0 35.895 196.1507
"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment