Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Remove outliers using numpy. Normally, an outlier is outside 1.5 * the IQR experimental analysis has shown that a higher/lower IQR might produce more accurate results. Interestingly, after 1000 runs, removing outliers creates a larger standard deviation between test run results.
import numpy as np
def removeOutliers(x, outlierConstant):
a = np.array(x)
upper_quartile = np.percentile(a, 75)
lower_quartile = np.percentile(a, 25)
IQR = (upper_quartile - lower_quartile) * outlierConstant
quartileSet = (lower_quartile - IQR, upper_quartile + IQR)
resultList = []
for y in a.tolist():
if y >= quartileSet[0] and y <= quartileSet[1]:
resultList.append(y)
return resultList
@adrian-alberto

This comment has been minimized.

Copy link

@adrian-alberto adrian-alberto commented Jan 9, 2017

Line 11, you should use >= and <=. Otherwise, a list of mostly the same number (e.g. [0,0,0,0,0,0,0,0,0,0,0,0,5]) will return an empty list.

I'm using this code snippet for some cancer research stuff. Thank you for publishing.

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented May 1, 2017

I will use this to remove false detections in an object tracking application. Thanks for posting.

@vishalkuo

This comment has been minimized.

Copy link
Owner Author

@vishalkuo vishalkuo commented Sep 27, 2017

Thanks, @adrian-alberto! Updated

@NithyaGrace

This comment has been minimized.

Copy link

@NithyaGrace NithyaGrace commented Dec 15, 2017

How can this piece of code be adopted for a dataframe? to drop values across the dataframe

@lgribeiro

This comment has been minimized.

Copy link

@lgribeiro lgribeiro commented Feb 22, 2018

Thanks for posting, I need this code for dataframe too. I will try to modify it for my case .

@SwRoy

This comment has been minimized.

Copy link

@SwRoy SwRoy commented Apr 28, 2018

How do I decide what the constant is ?

@rpicatoste

This comment has been minimized.

Copy link

@rpicatoste rpicatoste commented Feb 26, 2019

Hi, here is my suggestion to take advantage of numpy's speed instead of a python loop with a growing list. With big arrays the difference in time is noticeable.

def removeOutliers(x, outlierConstant):
    a = np.array(x)
    upper_quartile = np.percentile(a, 75)
    lower_quartile = np.percentile(a, 25)
    IQR = (upper_quartile - lower_quartile) * outlierConstant
    quartileSet = (lower_quartile - IQR, upper_quartile + IQR)
    
    result = a[np.where((a >= quartileSet[0]) & (a <= quartileSet[1]))]
    
    return result.tolist()
@marcoruizrueda

This comment has been minimized.

Copy link

@marcoruizrueda marcoruizrueda commented Dec 2, 2019

Thanks, @adrian-alberto! Updated

Did you mean 0.25 and 0.75 rather than 25 and 75? Percentiles go from 0 to 100. Thanks for the code.

@braindotai

This comment has been minimized.

Copy link

@braindotai braindotai commented Apr 19, 2020

@marcoruizrueda
What you are talking about are quantiles.

0 quartile = 0 quantile = 0 percentile
1 quartile = 0.25 quantile = 25 percentile
2 quartile = .5 quantile = 50 percentile (median)
3 quartile = .75 quantile = 75 percentile
4 quartile = 1 quantile = 100 percentile

@prasadsonar2

This comment has been minimized.

Copy link

@prasadsonar2 prasadsonar2 commented May 5, 2020

what is outlier constant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment