Skip to content

Instantly share code, notes, and snippets.

@namoopsoo
Created June 18, 2017 21:43
Show Gist options
  • Save namoopsoo/4d0bf8997c9ae7ba683769f5ca296fa0 to your computer and use it in GitHub Desktop.
Save namoopsoo/4d0bf8997c9ae7ba683769f5ca296fa0 to your computer and use it in GitHub Desktop.
pandas manip

Drop duplicate cols

df = df.loc[:,~df.columns.duplicated()]

consolidate cols with same name,

  • somehow wasnt able to achieve this effect through simple merges...
df1 = pd.DataFrame({'a':[np.nan, 1, 2, np.nan]}, index=[0,1, 2,3])
df2 = pd.DataFrame({'a':[9, np.nan, np.nan, 4]}, index=[0,1, 2,3])
df3 = pd.DataFrame({'a':[9, 1, 2, 4]}, index=[0,1, 2,3])

In [87]: df1.merge(df2, how='outer', left_index=True, right_index=True)
Out[87]: 
   a_x  a_y
0  NaN  9.0
1  1.0  NaN
2  2.0  NaN
3  NaN  4.0


In [104]: df1[df1.a.notnull()]
Out[104]: 
     a
1  1.0
2  2.0

In [105]: df1[df1.a.notnull()].merge(
     ...:         df2[df2.a.notnull()],
     ...:         how='outer', left_index=True, right_index=True
     ...:         )
     ...: 
Out[105]: 
   a_x  a_y
0  NaN  9.0
1  1.0  NaN
2  2.0  NaN
3  NaN  4.0

In [106]: df1[df1.a.notnull()].merge(
     ...:         df2[df2.a.notnull()],
     ...:         how='outer', on=['a']
     ...:         )
     ...: 
Out[106]: 
     a
0  1.0
1  2.0
2  9.0
3  4.0
  • => that seems fine ^^ but in fact, when merging on=['a'], the index is ignored, so there are cases where the above fails.
  • better solution...
def choose_non_null(val1, val2):
    if np.isnan(val1):
        return val2
    else:
        return val1

def selective_merge(df1, df2):
    '''
    If there is a column in common, be greedy about getting more data,
    when there are NaNs, when interleaving, throw out the NaNs.
    '''
    common_cols = list(set(df1.columns.tolist()) & set(df2.columns.tolist()))

    outdf = df1.merge(df2, how='outer', left_index=True, right_index=True
            )

    for col in common_cols:
        left, right = col + '_x', col + '_y'
        outdf[col] = outdf.apply(
                lambda row: choose_non_null(row[left], row[right]), axis=1)
        outdf = outdf.drop(labels=[left, right], axis=1)

    return outdf
  • example...
In [126]: selective_merge(df2, df3)
Out[126]: 
     a
0  9.0
1  1.0
2  2.0
3  4.0


In [130]: selective_merge(df1, df2)
Out[130]: 
     a
0  9.0
1  1.0
2  2.0
3  4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment