Last active
September 23, 2019 21:07
-
-
Save oneryalcin/858e73c076319f2e88275ef0c9590d52 to your computer and use it in GitHub Desktop.
3 Sparkify Read Data
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Read data into spark. | |
# Note: Ideally data should be in a schema supported format like parquet, | |
# which also supports partitioning, something very important while ingesting big data. | |
# Also data may be placed in a distributed filesystem like HDFS or in a cloud | |
# provider storage bucket like AWS S3 / Google Cloud Storage for faster reads. | |
# here we only read from local disk. | |
data = spark.read.json('mini_sparkify_event_data.json') | |
# How many user activity rows do we have? | |
data.count() | |
>> 286500 | |
# Have a look at the inferred schema | |
data.printSchema() | |
#>>root | |
# |-- artist: string (nullable = true) | |
# |-- auth: string (nullable = true) | |
# |-- firstName: string (nullable = true) | |
# |-- gender: string (nullable = true) | |
# |-- itemInSession: long (nullable = true) | |
# |-- lastName: string (nullable = true) | |
# |-- length: double (nullable = true) | |
# |-- level: string (nullable = true) | |
# |-- location: string (nullable = true) | |
# |-- method: string (nullable = true) | |
# |-- page: string (nullable = true) | |
# |-- registration: long (nullable = true) | |
# |-- sessionId: long (nullable = true) | |
# |-- song: string (nullable = true) | |
# |-- status: long (nullable = true) | |
# |-- ts: long (nullable = true) | |
# |-- userAgent: string (nullable = true) | |
# |-- userId: string (nullable = true) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment