Skip to content

Instantly share code, notes, and snippets.

@vishalkuo
Last active March 31, 2021 21:53
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save vishalkuo/f4aec300cf6252ed28d3 to your computer and use it in GitHub Desktop.
Save vishalkuo/f4aec300cf6252ed28d3 to your computer and use it in GitHub Desktop.
Remove outliers using numpy. Normally, an outlier is outside 1.5 * the IQR experimental analysis has shown that a higher/lower IQR might produce more accurate results. Interestingly, after 1000 runs, removing outliers creates a larger standard deviation between test run results.
import numpy as np
def removeOutliers(x, outlierConstant):
a = np.array(x)
upper_quartile = np.percentile(a, 75)
lower_quartile = np.percentile(a, 25)
IQR = (upper_quartile - lower_quartile) * outlierConstant
quartileSet = (lower_quartile - IQR, upper_quartile + IQR)
resultList = []
for y in a.tolist():
if y >= quartileSet[0] and y <= quartileSet[1]:
resultList.append(y)
return resultList
@adrian-alberto
Copy link

adrian-alberto commented Jan 9, 2017

Line 11, you should use >= and <=. Otherwise, a list of mostly the same number (e.g. [0,0,0,0,0,0,0,0,0,0,0,0,5]) will return an empty list.

I'm using this code snippet for some cancer research stuff. Thank you for publishing.

Copy link

ghost commented May 1, 2017

I will use this to remove false detections in an object tracking application. Thanks for posting.

@vishalkuo
Copy link
Author

Thanks, @adrian-alberto! Updated

@NithyaGrace
Copy link

How can this piece of code be adopted for a dataframe? to drop values across the dataframe

@lgribeiro
Copy link

Thanks for posting, I need this code for dataframe too. I will try to modify it for my case .

@SwRoy
Copy link

SwRoy commented Apr 28, 2018

How do I decide what the constant is ?

@rpicatoste
Copy link

rpicatoste commented Feb 26, 2019

Hi, here is my suggestion to take advantage of numpy's speed instead of a python loop with a growing list. With big arrays the difference in time is noticeable.

def removeOutliers(x, outlierConstant):
    a = np.array(x)
    upper_quartile = np.percentile(a, 75)
    lower_quartile = np.percentile(a, 25)
    IQR = (upper_quartile - lower_quartile) * outlierConstant
    quartileSet = (lower_quartile - IQR, upper_quartile + IQR)
    
    result = a[np.where((a >= quartileSet[0]) & (a <= quartileSet[1]))]
    
    return result.tolist()

@marcoruizrueda
Copy link

Thanks, @adrian-alberto! Updated

Did you mean 0.25 and 0.75 rather than 25 and 75? Percentiles go from 0 to 100. Thanks for the code.

@braindotai
Copy link

@marcoruizrueda
What you are talking about are quantiles.

0 quartile = 0 quantile = 0 percentile
1 quartile = 0.25 quantile = 25 percentile
2 quartile = .5 quantile = 50 percentile (median)
3 quartile = .75 quantile = 75 percentile
4 quartile = 1 quantile = 100 percentile

@prasadsonar2
Copy link

what is outlier constant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment