Instantly share code, notes, and snippets.

Last active January 25, 2024 14:23
Show Gist options
Imputation of missing values with knn.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 import numpy as np import pandas as pd from collections import defaultdict from scipy.stats import hmean from scipy.spatial.distance import cdist from scipy import stats import numbers def weighted_hamming(data): """ Compute weighted hamming distance on categorical variables. For one variable, it is equal to 1 if the values between point A and point B are different, else it is equal the relative frequency of the distribution of the value across the variable. For multiple variables, the harmonic mean is computed up to a constant factor. @params: - data = a pandas data frame of categorical variables @returns: - distance_matrix = a distance matrix with pairwise distance for all attributes """ categories_dist = [] for category in data: X = pd.get_dummies(data[category]) X_mean = X * X.mean() X_dot = X_mean.dot(X.transpose()) X_np = np.asarray(X_dot.replace(0,1,inplace=False)) categories_dist.append(X_np) categories_dist = np.array(categories_dist) distances = hmean(categories_dist, axis=0) return distances def distance_matrix(data, numeric_distance = "euclidean", categorical_distance = "jaccard"): """ Compute the pairwise distance attribute by attribute in order to account for different variables type: - Continuous - Categorical For ordinal values, provide a numerical representation taking the order into account. Categorical variables are transformed into a set of binary ones. If both continuous and categorical distance are provided, a Gower-like distance is computed and the numeric variables are all normalized in the process. If there are missing values, the mean is computed for numerical attributes and the mode for categorical ones. Note: If weighted-hamming distance is chosen, the computation time increases a lot since it is not coded in C like other distance metrics provided by scipy. @params: - data = pandas dataframe to compute distances on. - numeric_distances = the metric to apply to continuous attributes. "euclidean" and "cityblock" available. Default = "euclidean" - categorical_distances = the metric to apply to binary attributes. "jaccard", "hamming", "weighted-hamming" and "euclidean" available. Default = "jaccard" @returns: - the distance matrix """ possible_continuous_distances = ["euclidean", "cityblock"] possible_binary_distances = ["euclidean", "jaccard", "hamming", "weighted-hamming"] number_of_variables = data.shape[1] number_of_observations = data.shape[0] # Get the type of each attribute (Numeric or categorical) is_numeric = [all(isinstance(n, numbers.Number) for n in data.iloc[:, i]) for i, x in enumerate(data)] is_all_numeric = sum(is_numeric) == len(is_numeric) is_all_categorical = sum(is_numeric) == 0 is_mixed_type = not is_all_categorical and not is_all_numeric # Check the content of the distances parameter if numeric_distance not in possible_continuous_distances: print "The continuous distance " + numeric_distance + " is not supported." return None elif categorical_distance not in possible_binary_distances: print "The binary distance " + categorical_distance + " is not supported." return None # Separate the data frame into categorical and numeric attributes and normalize numeric data if is_mixed_type: number_of_numeric_var = sum(is_numeric) number_of_categorical_var = number_of_variables - number_of_numeric_var data_numeric = data.iloc[:, is_numeric] data_numeric = (data_numeric - data_numeric.mean()) / (data_numeric.max() - data_numeric.min()) data_categorical = data.iloc[:, [not x for x in is_numeric]] # Replace missing values with column mean for numeric values and mode for categorical ones. With the mode, it # triggers a warning: "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame" # but the value are properly replaced if is_mixed_type: data_numeric.fillna(data_numeric.mean(), inplace=True) for x in data_categorical: data_categorical[x].fillna(data_categorical[x].mode()[0], inplace=True) elif is_all_numeric: data.fillna(data.mean(), inplace=True) else: for x in data: data[x].fillna(data[x].mode()[0], inplace=True) # "Dummifies" categorical variables in place if not is_all_numeric and not (categorical_distance == 'hamming' or categorical_distance == 'weighted-hamming'): if is_mixed_type: data_categorical = pd.get_dummies(data_categorical) else: data = pd.get_dummies(data) elif not is_all_numeric and categorical_distance == 'hamming': if is_mixed_type: data_categorical = pd.DataFrame([pd.factorize(data_categorical[x])[0] for x in data_categorical]).transpose() else: data = pd.DataFrame([pd.factorize(data[x])[0] for x in data]).transpose() if is_all_numeric: result_matrix = cdist(data, data, metric=numeric_distance) elif is_all_categorical: if categorical_distance == "weighted-hamming": result_matrix = weighted_hamming(data) else: result_matrix = cdist(data, data, metric=categorical_distance) else: result_numeric = cdist(data_numeric, data_numeric, metric=numeric_distance) if categorical_distance == "weighted-hamming": result_categorical = weighted_hamming(data_categorical) else: result_categorical = cdist(data_categorical, data_categorical, metric=categorical_distance) result_matrix = np.array([[1.0*(result_numeric[i, j] * number_of_numeric_var + result_categorical[i, j] * number_of_categorical_var) / number_of_variables for j in range(number_of_observations)] for i in range(number_of_observations)]) # Fill the diagonal with NaN values np.fill_diagonal(result_matrix, np.nan) return pd.DataFrame(result_matrix) def knn_impute(target, attributes, k_neighbors, aggregation_method="mean", numeric_distance="euclidean", categorical_distance="jaccard", missing_neighbors_threshold = 0.5): """ Replace the missing values within the target variable based on its k nearest neighbors identified with the attributes variables. If more than 50% of its neighbors are also missing values, the value is not modified and remains missing. If there is a problem in the parameters provided, returns None. If to many neighbors also have missing values, leave the missing value of interest unchanged. @params: - target = a vector of n values with missing values that you want to impute. The length has to be at least n = 3. - attributes = a data frame of attributes with n rows to match the target variable - k_neighbors = the number of neighbors to look at to impute the missing values. It has to be a value between 1 and n. - aggregation_method = how to aggregate the values from the nearest neighbors (mean, median, mode) Default = "mean" - numeric_distances = the metric to apply to continuous attributes. "euclidean" and "cityblock" available. Default = "euclidean" - categorical_distances = the metric to apply to binary attributes. "jaccard", "hamming", "weighted-hamming" and "euclidean" available. Default = "jaccard" - missing_neighbors_threshold = minimum of neighbors among the k ones that are not also missing to infer the correct value. Default = 0.5 @returns: target_completed = the vector of target values with missing value replaced. If there is a problem in the parameters, return None """ # Get useful variables possible_aggregation_method = ["mean", "median", "mode"] number_observations = len(target) is_target_numeric = all(isinstance(n, numbers.Number) for n in target) # Check for possible errors if number_observations < 3: print "Not enough observations." return None if attributes.shape[0] != number_observations: print "The number of observations in the attributes variable is not matching the target variable length." return None if k_neighbors > number_observations or k_neighbors < 1: print "The range of the number of neighbors is incorrect." return None if aggregation_method not in possible_aggregation_method: print "The aggregation method is incorrect." return None if not is_target_numeric and aggregation_method != "mode": print "The only method allowed for categorical target variable is the mode." return None # Make sure the data are in the right format target = pd.DataFrame(target) attributes = pd.DataFrame(attributes) # Get the distance matrix and check whether no error was triggered when computing it distances = distance_matrix(attributes, numeric_distance, categorical_distance) if distances is None: return None # Get the closest points and compute the correct aggregation method for i, value in enumerate(target.iloc[:, 0]): if pd.isnull(value): order = distances.iloc[i,:].values.argsort()[:k_neighbors] closest_to_target = target.iloc[order, :] missing_neighbors = [x for x in closest_to_target.isnull().iloc[:, 0]] # Compute the right aggregation method if at least more than 50% of the closest neighbors are not missing if sum(missing_neighbors) >= missing_neighbors_threshold * k_neighbors: continue elif aggregation_method == "mean": target.iloc[i] = np.ma.mean(np.ma.masked_array(closest_to_target,np.isnan(closest_to_target))) elif aggregation_method == "median": target.iloc[i] = np.ma.median(np.ma.masked_array(closest_to_target,np.isnan(closest_to_target))) else: target.iloc[i] = stats.mode(closest_to_target, nan_policy='omit')[0][0] return target

### YohanObadia commented Jan 31, 2017

This is my first shared code. I hope it's going to be useful. Any comment is welcome to help me improve :)

### ayush488 commented Jan 30, 2018

yes . thank you. THis is exactly what I was looking for.

### CassSVY commented Mar 23, 2018

Thank you for your posting! Really helpful! And one quick question: for knn imputation, when I tried to fill both column age and Embarked missing values, it seems that there are some NaN values still out there after knn imputation. And by any chance, I was wondering if maybe some points that have been missed out by me or extra steps that I should take to fill them in?

### Sekhar84 commented Apr 15, 2018

Thank you . This is exactly what I need .Hope you dont mind I use your code.

### kamrankausar commented Jun 21, 2018

Great, and Thank you.

### Jackil1993 commented Jun 25, 2018

It helped me a lot. Thank you!

### arushi02 commented Jul 6, 2018 • edited

I have a mixed dataset with both numeric and categorical variables and my target variable is also categorical. How can I use this in that case?
When I gave the dataset as it is I got this error "TypeError: '<' not supported between instances of 'float' and 'str'" and then I also tried by converting all independent categorical variables to one-hot encoding , still getting the same error.

### SaravananStat commented Sep 28, 2018 • edited

I have a mixed dataset with both numeric and categorical variables and my target variable is also categorical. How can I use this in that case?
When I gave the dataset as it is I got this error "TypeError: '<' not supported between instances of 'float' and 'str'" and then I also tried by converting all independent categorical variables to one-hot encoding , still getting the same error.

Hi
You need to work only with one variable at once, say if your target is "Embarked" then drop that in your attributes. By doing this you wont get the above said error

### omidbazgirTTU commented Oct 9, 2018

This code just works for a single case, it can not be used for other problems!

### ghost commented Nov 22, 2018

Dear @Saravanansat, I am having the same problem as @arushi02, and even considering one target doesn't fix the error. It disappears putting missing_neighbors_threshold = 0, but even in this case still no imputation is performed. @YohanObadia, I am afraid the function has some problems in imputing a categorical target variable.

### ghost commented Nov 22, 2018

@arushi02 I solved the problem with:
target.iloc[i] = stats.mode(closest_to_target.dropna())[0][0]

### manojrajpurohit commented Oct 28, 2019

I had problems while importing this module into Anaconda(Python version 3.7.3) in Windows 10. I followed below steps:

1. copied this module as python file(knn_impute.py) into a directory D:\python_external
2. In Anaconda site packages path -> C:\Users\manoj\Anaconda3\Lib\site-packages created a path file python_external.pth.
The path file(python_external.pth) has the folder location of knn_impute module.

restarted anaconda and tried importing the module
import knn_impute

below is the error message I got,
_Traceback (most recent call last):

File "C:\Users\manoj\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3325, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 1, in
import knn_impute

File "D:\python_external\knn_impute.py", line 73
print "The continuous distance " + numeric_distance + " is not supported."
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("The continuous distance " + numeric_distance + " is not supported.")?_

Which shows that I was able to import the module but the python interpreter is unable to parse the python syntax. On further search found that this module seems to be python version 2 compatible.

Can someone please point me toward python 3.7 compatible module of knn imputer or can this code me converted to python 3.7 compatible module, if yes then please suggest how.

### deshwalmahesh commented Dec 24, 2019

What if other columns have missing values too? who do we fill first and how ?

### YohanObadia commented Feb 25, 2020

@deshwalmahesh the other ones are first filled with either median or mean of their column and then you run the imputation of the column of interest. You then remove the median values for one of the column and impute it with the KNN and so on.

### lhicelaytomeow commented Feb 22, 2021

Your code is amazing! I can't find any existing libraries in Python which caters on categorical imputation through nearest neighbors. Do you mind if I import your code and use in my imputation problem? I will just add a part where it loops through all attributes with missing data so I can use it on my data which has multiple columns with missing values. I've tested that it currently works only on a single column at a time.

### YohanObadia commented Feb 22, 2021

Your code is amazing! I can't find any existing libraries in Python which caters on categorical imputation through nearest neighbors. Do you mind if I import your code and use in my imputation problem? I will just add a part where it loops through all attributes with missing data so I can use it on my data which has multiple columns with missing values. I've tested that it currently works only on a single column at a time.

I did it for that purpose. You are more than welcome to reuse it to your heart content !
Best if you specify where it came from but thats about it ;)