Skip to content

Instantly share code, notes, and snippets.

@liannewriting
Last active January 21, 2020 15:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save liannewriting/7e435da79d09d37605dabbcf3094c4f8 to your computer and use it in GitHub Desktop.
Save liannewriting/7e435da79d09d37605dabbcf3094c4f8 to your computer and use it in GitHub Desktop.
data_cleaning_202001
# impute the missing values and create the missing value indicator variables for each non-numeric column.
df_non_numeric = df.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
for col in non_numeric_cols:
missing = df[col].isnull()
num_missing = np.sum(missing)
if num_missing > 0: # only do the imputation for the columns that have missing values.
print('imputing missing values for: {}'.format(col))
df['{}_ismissing'.format(col)] = missing
top = df[col].describe()['top'] # impute with the most frequent value.
df[col] = df[col].fillna(top)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment