Skip to content

Instantly share code, notes, and snippets.

@LeonardAukea
Created November 15, 2017 00:37
Show Gist options
  • Save LeonardAukea/9aaea595fb050591b4285e7f64f0855a to your computer and use it in GitHub Desktop.
Save LeonardAukea/9aaea595fb050591b4285e7f64f0855a to your computer and use it in GitHub Desktop.
Pyspark: dual explode generator
def dualExplode(row):
"""Explode weights and category_ids list elements to separate rows.
Args:
row: Row
Yield:
Row(**newDict)
"""
rowDict = row.asDict()
xList = rowDict.pop('x')
yList = rowDict.pop('y')
for x,y in zip(xList, yList):
newDict = dict(rowDict)
newDict['category_ids'] = x
newDict['weights'] = y
yield Row(**newDict)
# Example usage
exploded_df = sqlContext.createDataFrame(df.rdd.flatMap(dualExplode))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment