Skip to content

Instantly share code, notes, and snippets.

@mlgill
Last active March 8, 2023 18:19
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mlgill/11334821 to your computer and use it in GitHub Desktop.
Save mlgill/11334821 to your computer and use it in GitHub Desktop.
A modified version of pandas merge command that will replace overlapping columns not associated with the join rather than appending a suffix.
import pandas as pa
def rmerge(left,right,**kwargs):
"""Perform a merge using pandas with optional removal of overlapping
column names not associated with the join.
Though I suspect this does not adhere to the spirit of pandas merge
command, I find it useful because re-executing IPython notebook cells
containing a merge command does not result in the replacement of existing
columns if the name of the resulting DataFrame is the same as one of the
two merged DataFrames, i.e. data = pa.merge(data,new_dataframe). I prefer
this command over pandas df.combine_first() method because it has more
flexible join options.
The column removal is controlled by the 'replace' flag which is
'left' (default) or 'right' to remove overlapping columns in either the
left or right DataFrame. If 'replace' is set to None, the default
pandas behavior will be used. All other parameters are the same
as pandas merge command.
Examples
--------
>>> left >>> right
a b c a c d
0 1 4 9 0 1 7 13
1 2 5 10 1 2 8 14
2 3 6 11 2 3 9 15
3 4 7 12
>>> rmerge(left,right,on='a')
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
>>> rmerge(left,right,on='a',how='left')
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 NaN NaN
>>> rmerge(left,right,on='a',how='left',replace='right')
a b c d
0 1 4 9 13
1 2 5 10 14
2 3 6 11 15
3 4 7 12 NaN
>>> rmerge(left,right,on='a',how='left',replace=None)
a b c_x c_y d
0 1 4 9 7 13
1 2 5 10 8 14
2 3 6 11 9 15
3 4 7 12 NaN NaN
"""
# Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
def flatten(lst):
return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )
# Set default for removing overlapping columns in "left" to be true
myargs = {'replace':'left'}
myargs.update(kwargs)
# Remove the replace key from the argument dict to be sent to
# pandas merge command
kwargs = {k:v for k,v in myargs.iteritems() if k is not 'replace'}
if myargs['replace'] is not None:
# Generate a list of overlapping column names not associated with the join
skipcols = set(flatten([v for k, v in myargs.iteritems() if k in ['on','left_on','right_on']]))
leftcols = set(left.columns)
rightcols = set(right.columns)
dropcols = list((leftcols & rightcols).difference(skipcols))
# Remove the overlapping column names from the appropriate DataFrame
if myargs['replace'].lower() == 'left':
left = left.copy().drop(dropcols,axis=1)
elif myargs['replace'].lower() == 'right':
right = right.copy().drop(dropcols,axis=1)
df = pa.merge(left,right,**kwargs)
return df
@wendysun0315
Copy link

Can this be used in dataframes with millions of rows?

Copy link

ghost commented Dec 19, 2017

i recently found myself in need of this behavior and implemented something similar. do you know if pandas has expressed explicit interest or disinterest in a kwarg with this functionality?

@mc-buckets
Copy link

This is huge...been looking for this exact behavior. Thanks!

The snippet is compatible with Python 3 if iteritems() is changed to items()

@c1carpenter
Copy link

if i use how=outer
it doesn't fill it with either or?

@HASumner
Copy link

I'm actually surprised that this isn't already an option in Panda's 'update' method, using an 'on=' argument or something. I used to use an update function in SAS programming all the time.

@xgerrmann
Copy link

This is a golden nugget. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment