Skip to content

Instantly share code, notes, and snippets.

@korkridake
Created January 3, 2019 07:58
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save korkridake/972e315e5ce094096e17c6ad1ef599fd to your computer and use it in GitHub Desktop.
Save korkridake/972e315e5ce094096e17c6ad1ef599fd to your computer and use it in GitHub Desktop.
How to melt Spark DataFrame?
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable
def melt(
df: DataFrame,
id_vars: Iterable[str], value_vars: Iterable[str],
var_name: str="variable", value_name: str="value") -> DataFrame:
"""
Convert :class:`DataFrame` from wide to long format.
Source: https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe
"""
# -------------------------------------------------------------------------------
# Create array<struct<variable: str, value: ...>>
# -------------------------------------------------------------------------------
_vars_and_vals = array(*(
struct(lit(c).alias(var_name), col(c).alias(value_name))
for c in value_vars))
# -------------------------------------------------------------------------------
# Add to the DataFrame and explode
# -------------------------------------------------------------------------------
_tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
cols = id_vars + [
col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
return _tmp.select(*cols)
# -------------------------------------------------------------------------------
# Let's Implement Wide to Long in Pyspark!
# -------------------------------------------------------------------------------
melt(df_web_browsing_full_test,
id_vars=['ID_variable'],
value_vars=['VALUE_variable_1', 'VALUE_variable_2']).show()
@kebab-mai-haddi
Copy link

I am unable to use this solution here. Could you perhaps help me with the logic?

@jwhite-mscience
Copy link

from pyspark.sql.functions import array, col, explode, lit, struct

def melt(df, id_vars, value_vars, var_name, value_name):
    """Convert :class:`DataFrame` from wide to long format."""
    _vars_and_vals = F.array(*[F.struct(F.lit(c).alias(var_name),
                                        F.col(c).alias(value_name)) 
                               for c in value_vars])

    # Add to the DataFrame and explode
    _tmp = df.withColumn("_vars_and_vals", F.explode(_vars_and_vals))

    cols = id_vars + [F.col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
    return _tmp.select(*cols)

@glenlewis
Copy link

This function will fail if the variable has either a full stop in it or two adjacent spaces. Any idea how to exception them out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment