Skip to content

Instantly share code, notes, and snippets.

@toddbirchard
Last active August 26, 2023 00:48
Show Gist options
  • Save toddbirchard/9f6f1508cb24e78315778837a7d31328 to your computer and use it in GitHub Desktop.
Save toddbirchard/9f6f1508cb24e78315778837a7d31328 to your computer and use it in GitHub Desktop.
Helper function to compare two DataFrames and find rows which are unique or shared.
"""Find Symmetric Differences between two Pandas DataFrames."""
def dataframe_difference(df1, df2, which=None):
"""Find rows which are different."""
comparison_df = df1.merge(
df2,
indicator=True,
how='outer'
)
if which is None:
diff_df = comparison_df[comparison_df['_merge'] != 'both']
else:
diff_df = comparison_df[comparison_df['_merge'] == which]
diff_df.to_csv('data/diff.csv')
return diff_df
@Per48edjes
Copy link

Super useful -- thank you!

@velascog
Copy link

velascog commented May 5, 2020

I am able to get a left or right but not both, what am I doing wrong?
(below is whole script)

import pandas as pd
def dataframe_difference(wk15, wk16, which=None):
"""Find rows which are different between two DataFrames."""
comparison_df = wk15.merge(wk16,indicator=True, how='outer')
if which is None:
diff_df = comparison_df[comparison_df['_merge'] != 'both']
else:
diff_df = comparison_df[comparison_df['_merge'] == which]
diff_df.to_csv('diff.csv')
return diff_df

wk15 = pd.read_excel('GRNI-wk15.xlsx', sheet_name='GRNI-wk15')
wk16 = pd.read_excel('GRNI-wk16.xlsx', sheet_name='GRNI-wk16')

dataframe_difference(wk15,wk16)

@randallscott25
Copy link

Yes, I am receiving the same results as Velascog;
Can you please explain?

@tapendra-joshi
Copy link

I am also getting the same issue. For small data sets its working file, I have tested it for 50 rows but for around 25k rows. It breaks :(

@TimLuijt
Copy link

TimLuijt commented Aug 23, 2023

Thanks for this helper function!

Might as well help with some questions while I am here:

I am able to get a left or right but not both, what am I doing wrong? (below is whole script)

import pandas as pd def dataframe_difference(wk15, wk16, which=None): """Find rows which are different between two DataFrames.""" comparison_df = wk15.merge(wk16,indicator=True, how='outer') if which is None: diff_df = comparison_df[comparison_df['_merge'] != 'both'] else: diff_df = comparison_df[comparison_df['_merge'] == which] diff_df.to_csv('diff.csv') return diff_df

wk15 = pd.read_excel('GRNI-wk15.xlsx', sheet_name='GRNI-wk15') wk16 = pd.read_excel('GRNI-wk16.xlsx', sheet_name='GRNI-wk16')

dataframe_difference(wk15,wk16)

The code is selecting !='both' (not equal to 'both') when not passing in a which paramater value - so you only get left_only and right_only.
To get the values in 'both' dataframes try:
dataframe_difference(wk15,wk16,which='both')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment