Skip to content

Instantly share code, notes, and snippets.

@toddbirchard
Last active August 26, 2023 00:48
Helper function to compare two DataFrames and find rows which are unique or shared.
"""Find Symmetric Differences between two Pandas DataFrames."""
def dataframe_difference(df1, df2, which=None):
"""Find rows which are different."""
comparison_df = df1.merge(
df2,
indicator=True,
how='outer'
)
if which is None:
diff_df = comparison_df[comparison_df['_merge'] != 'both']
else:
diff_df = comparison_df[comparison_df['_merge'] == which]
diff_df.to_csv('data/diff.csv')
return diff_df
@99rig
Copy link

99rig commented Mar 5, 2020

Thank you!
excellent work!

@Per48edjes
Copy link

Super useful -- thank you!

@velascog
Copy link

velascog commented May 5, 2020

I am able to get a left or right but not both, what am I doing wrong?
(below is whole script)

import pandas as pd
def dataframe_difference(wk15, wk16, which=None):
"""Find rows which are different between two DataFrames."""
comparison_df = wk15.merge(wk16,indicator=True, how='outer')
if which is None:
diff_df = comparison_df[comparison_df['_merge'] != 'both']
else:
diff_df = comparison_df[comparison_df['_merge'] == which]
diff_df.to_csv('diff.csv')
return diff_df

wk15 = pd.read_excel('GRNI-wk15.xlsx', sheet_name='GRNI-wk15')
wk16 = pd.read_excel('GRNI-wk16.xlsx', sheet_name='GRNI-wk16')

dataframe_difference(wk15,wk16)

@randallscott25
Copy link

Yes, I am receiving the same results as Velascog;
Can you please explain?

@tapendra-joshi
Copy link

I am also getting the same issue. For small data sets its working file, I have tested it for 50 rows but for around 25k rows. It breaks :(

@TimLuijt
Copy link

TimLuijt commented Aug 23, 2023

Thanks for this helper function!

Might as well help with some questions while I am here:

I am able to get a left or right but not both, what am I doing wrong? (below is whole script)

import pandas as pd def dataframe_difference(wk15, wk16, which=None): """Find rows which are different between two DataFrames.""" comparison_df = wk15.merge(wk16,indicator=True, how='outer') if which is None: diff_df = comparison_df[comparison_df['_merge'] != 'both'] else: diff_df = comparison_df[comparison_df['_merge'] == which] diff_df.to_csv('diff.csv') return diff_df

wk15 = pd.read_excel('GRNI-wk15.xlsx', sheet_name='GRNI-wk15') wk16 = pd.read_excel('GRNI-wk16.xlsx', sheet_name='GRNI-wk16')

dataframe_difference(wk15,wk16)

The code is selecting !='both' (not equal to 'both') when not passing in a which paramater value - so you only get left_only and right_only.
To get the values in 'both' dataframes try:
dataframe_difference(wk15,wk16,which='both')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment