This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# if it's a larger dataset and the visualization takes too long can do this. | |
# % of missing. | |
for col in df.columns: | |
pct_missing = np.mean(df[col].isnull()) | |
print('{} - {}%'.format(col, round(pct_missing*100))) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# life_sq has a lot of missing values. | |
# life_sq: living area in square meters, excluding loggias, balconies and other non-residential areas | |
df['life_sq'].value_counts(dropna=False) | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df['life_sq'].describe() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# create an ismissing indicator variable for life_sq. | |
df['life_sq_ismissing'] = df['life_sq'].isnull() | |
df['life_sq_ismissing'].value_counts(dropna=False) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df[['life_sq', 'life_sq_ismissing']] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# replace missing values with the median. | |
med = df['life_sq'].median() | |
print(med) | |
df['life_sq'] = df['life_sq'].fillna(med) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# impute the missing values and create the missing value indicator variables for each numeric column. | |
df_numeric = df.select_dtypes(include=[np.number]) | |
numeric_cols = df_numeric.columns.values | |
for col in numeric_cols: | |
missing = df[col].isnull() | |
num_missing = np.sum(missing) | |
if num_missing > 0: # only do the imputation for the columns that have missing values. | |
print('imputing missing values for: {}'.format(col)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# impute the missing values and create the missing value indicator variables for each non-numeric column. | |
df_non_numeric = df.select_dtypes(exclude=[np.number]) | |
non_numeric_cols = df_non_numeric.columns.values | |
for col in non_numeric_cols: | |
missing = df[col].isnull() | |
num_missing = np.sum(missing) | |
if num_missing > 0: # only do the imputation for the columns that have missing values. | |
print('imputing missing values for: {}'.format(col)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# first create missing indicator for features with missing data | |
for col in df.columns: | |
missing = df[col].isnull() | |
num_missing = np.sum(missing) | |
if num_missing > 0: | |
print('created missing indicator for: {}'.format(col)) | |
df['{}_ismissing'.format(col)] = missing | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# drop rows with a lot of missing values. | |
ind_missing = df[df['num_missing'] > 35].index | |
df_less_missing_rows = df.drop(ind_missing, axis=0) |