Skip to content

Instantly share code, notes, and snippets.

@mkaranasou
Last active November 2, 2018 11:57
Show Gist options
  • Save mkaranasou/7e4db638cfef9737b8b723daf2a9b1eb to your computer and use it in GitHub Desktop.
Save mkaranasou/7e4db638cfef9737b8b723daf2a9b1eb to your computer and use it in GitHub Desktop.
Function to union pyspark data frames with different columns
def union_uneven(df_base, df_new, default=None):
"""
Union dfs with different columns
:param: pyspark.DataFrame df_base: the dataframe to join to
:param: pyspark.DataFrame df_new: the dataframe to be joined
:return: the union of the two dataframes, having the missing columns filled with the default value
:rtype: pyspark.DataFrame
"""
base_columns = set(df_base.columns)
df_new_columns = set(df_new.columns)
for c in base_columns.difference(df_new_columns):
df_new = df_new.withColumn(c, F.lit(default))
for c in df_new_columns.difference(base_columns):
df_base = df_base.withColumn(c, F.lit(default))
# the sequence of columns matters in union:
return df_base.select(df_base.columns).union(df_new.select(df_base.columns))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment