This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df[['life_sq', 'life_sq_ismissing']] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# drop rows with a lot of missing values. | |
ind_missing = df[df['num_missing'] > 35].index | |
df_less_missing_rows = df.drop(ind_missing, axis=0) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# impute the missing values and create the missing value indicator variables for each numeric column. | |
df_numeric = df.select_dtypes(include=[np.number]) | |
numeric_cols = df_numeric.columns.values | |
for col in numeric_cols: | |
missing = df[col].isnull() | |
num_missing = np.sum(missing) | |
if num_missing > 0: # only do the imputation for the columns that have missing values. | |
print('imputing missing values for: {}'.format(col)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# impute the missing values and create the missing value indicator variables for each non-numeric column. | |
df_non_numeric = df.select_dtypes(exclude=[np.number]) | |
non_numeric_cols = df_non_numeric.columns.values | |
for col in non_numeric_cols: | |
missing = df[col].isnull() | |
num_missing = np.sum(missing) | |
if num_missing > 0: # only do the imputation for the columns that have missing values. | |
print('imputing missing values for: {}'.format(col)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# we know that column 'id' is unique, but what if we drop it? | |
df_dedupped = df.drop('id', axis=1).drop_duplicates() | |
# there were duplicate rows | |
print(df.shape) | |
print(df_dedupped.shape) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# drop duplicates based on an subset of variables. | |
key = ['timestamp', 'full_sq', 'life_sq', 'floor', 'build_year', 'num_room', 'price_doc'] | |
df_dedupped2 = df.drop_duplicates(subset=key) | |
print(df.shape) | |
print(df_dedupped2.shape) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# bar chart - distribution of a categorical variable | |
df['ecology'].value_counts().plot.bar() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df['sub_area'].value_counts(dropna=False) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# make everything lower case. | |
df['sub_area_lower'] = df['sub_area'].str.lower() | |
df['sub_area_lower'].value_counts(dropna=False) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# group some categories together. | |
df['ecology'].value_counts(dropna=False) |