Last active
January 21, 2020 15:54
-
-
Save liannewriting/7e435da79d09d37605dabbcf3094c4f8 to your computer and use it in GitHub Desktop.
data_cleaning_202001
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# impute the missing values and create the missing value indicator variables for each non-numeric column. | |
df_non_numeric = df.select_dtypes(exclude=[np.number]) | |
non_numeric_cols = df_non_numeric.columns.values | |
for col in non_numeric_cols: | |
missing = df[col].isnull() | |
num_missing = np.sum(missing) | |
if num_missing > 0: # only do the imputation for the columns that have missing values. | |
print('imputing missing values for: {}'.format(col)) | |
df['{}_ismissing'.format(col)] = missing | |
top = df[col].describe()['top'] # impute with the most frequent value. | |
df[col] = df[col].fillna(top) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment