Skip to content

Instantly share code, notes, and snippets.

@stefanthoss
Last active January 28, 2020 19:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stefanthoss/1b1298da7e9c0b02cc0cc38a416f219c to your computer and use it in GitHub Desktop.
Save stefanthoss/1b1298da7e9c0b02cc0cc38a416f219c to your computer and use it in GitHub Desktop.
Returns a new PySpark DataFrame containing the union of two DataFrames. This more advanced version works even when the two DataFrames have different columns and a different order of columns. If a column does not exist in either DataFrame, its fields will be empty.
def advanced_dataframe_union(df1, df2):
df1_fields = set((f.name, f.dataType) for f in df1.schema)
df2_fields = set((f.name, f.dataType) for f in df2.schema)
df2 = df2.select(
df2.columns
+ [
F.lit(None).cast(datatype).alias(name)
for name, datatype in df1_fields.difference(df2_fields)
]
)
df1 = df1.select(
df1.columns
+ [
F.lit(None).cast(datatype).alias(name)
for name, datatype in df2_fields.difference(df1_fields)
]
)
return df1.select(df2.columns).union(df2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment