Skip to content

Instantly share code, notes, and snippets.

@oneryalcin
Last active September 23, 2019 21:07
Show Gist options
  • Save oneryalcin/858e73c076319f2e88275ef0c9590d52 to your computer and use it in GitHub Desktop.
Save oneryalcin/858e73c076319f2e88275ef0c9590d52 to your computer and use it in GitHub Desktop.
3 Sparkify Read Data
# Read data into spark.
# Note: Ideally data should be in a schema supported format like parquet,
# which also supports partitioning, something very important while ingesting big data.
# Also data may be placed in a distributed filesystem like HDFS or in a cloud
# provider storage bucket like AWS S3 / Google Cloud Storage for faster reads.
# here we only read from local disk.
data = spark.read.json('mini_sparkify_event_data.json')
# How many user activity rows do we have?
data.count()
>> 286500
# Have a look at the inferred schema
data.printSchema()
#>>root
# |-- artist: string (nullable = true)
# |-- auth: string (nullable = true)
# |-- firstName: string (nullable = true)
# |-- gender: string (nullable = true)
# |-- itemInSession: long (nullable = true)
# |-- lastName: string (nullable = true)
# |-- length: double (nullable = true)
# |-- level: string (nullable = true)
# |-- location: string (nullable = true)
# |-- method: string (nullable = true)
# |-- page: string (nullable = true)
# |-- registration: long (nullable = true)
# |-- sessionId: long (nullable = true)
# |-- song: string (nullable = true)
# |-- status: long (nullable = true)
# |-- ts: long (nullable = true)
# |-- userAgent: string (nullable = true)
# |-- userId: string (nullable = true)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment