Skip to content

Instantly share code, notes, and snippets.

@rikturr
Created July 21, 2020 14:36
Show Gist options
  • Save rikturr/629a066939244211e58634900b16b422 to your computer and use it in GitHub Desktop.
Save rikturr/629a066939244211e58634900b16b422 to your computer and use it in GitHub Desktop.
init spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
taxi = spark.read.csv('s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv',
header=True,
inferSchema=True,
timestampFormat='yyyy-MM-dd HH:mm:ss',
).sample(fraction=0.1, withReplacement=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment