Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save mkaranasou/5050b55bccf6d237ede1ef3699f0aac2 to your computer and use it in GitHub Desktop.
Save mkaranasou/5050b55bccf6d237ede1ef3699f0aac2 to your computer and use it in GitHub Desktop.
Parse a json column in pyspark and expand the dict into columns
json_col = 'json_col'
# either infer the features schema:
schema = self.spark.read.json(df.select(json_col).rdd.map(lambda x: x[0])).schema
# parse the features string into a map
df = df.withColumn(json_col, (F.from_json(F.col(json_col), schema)))
# access the feature columns by name
df.select(F.col(json_col)['some_key']).show()
# or if you know how the json is like - a dict in our case:
schema = T.MapType(T.StringType(), T.FloatType())
df = df.withColumn(json_col, (F.from_json(F.col('features'), schema)))
df.select(F.col(json_col)['some_key']).show()
# get all the features in a list
current_keys = df.select(F.map_keys(json_col)).take(1)[0][0]
# expand the features into columns
for k in current_keys:
df = df.withColumn(k, F.col(json_col)[k])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment