namoopsoo/pandas manip.md

## pandas manip.md

      
    Raw
  

              pandas manip.md
            
          
    Drop duplicate cols


per https://stackoverflow.com/a/40435354/472876 ,

df = df.loc[:,~df.columns.duplicated()]
consolidate cols with same name,


somehow wasnt able to achieve this effect through simple merges...

df1 = pd.DataFrame({'a':[np.nan, 1, 2, np.nan]}, index=[0,1, 2,3])
df2 = pd.DataFrame({'a':[9, np.nan, np.nan, 4]}, index=[0,1, 2,3])
df3 = pd.DataFrame({'a':[9, 1, 2, 4]}, index=[0,1, 2,3])

In [87]: df1.merge(df2, how='outer', left_index=True, right_index=True)
Out[87]: 
   a_x  a_y
0  NaN  9.0
1  1.0  NaN
2  2.0  NaN
3  NaN  4.0


In [104]: df1[df1.a.notnull()]
Out[104]: 
     a
1  1.0
2  2.0

In [105]: df1[df1.a.notnull()].merge(
     ...:         df2[df2.a.notnull()],
     ...:         how='outer', left_index=True, right_index=True
     ...:         )
     ...: 
Out[105]: 
   a_x  a_y
0  NaN  9.0
1  1.0  NaN
2  2.0  NaN
3  NaN  4.0

In [106]: df1[df1.a.notnull()].merge(
     ...:         df2[df2.a.notnull()],
     ...:         how='outer', on=['a']
     ...:         )
     ...: 
Out[106]: 
     a
0  1.0
1  2.0
2  9.0
3  4.0

=> that seems fine ^^ but in fact, when merging on=['a'], the index is ignored, so there are cases where the above fails.
better solution...

def choose_non_null(val1, val2):
    if np.isnan(val1):
        return val2
    else:
        return val1

def selective_merge(df1, df2):
    '''
    If there is a column in common, be greedy about getting more data,
    when there are NaNs, when interleaving, throw out the NaNs.
    '''
    common_cols = list(set(df1.columns.tolist()) & set(df2.columns.tolist()))

    outdf = df1.merge(df2, how='outer', left_index=True, right_index=True
            )

    for col in common_cols:
        left, right = col + '_x', col + '_y'
        outdf[col] = outdf.apply(
                lambda row: choose_non_null(row[left], row[right]), axis=1)
        outdf = outdf.drop(labels=[left, right], axis=1)

    return outdf

example...

In [126]: selective_merge(df2, df3)
Out[126]: 
     a
0  9.0
1  1.0
2  2.0
3  4.0


In [130]: selective_merge(df1, df2)
Out[130]: 
     a
0  9.0
1  1.0
2  2.0
3  4.0