Skip to content

Instantly share code, notes, and snippets.

@nickkraakman
Created April 5, 2022 15:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nickkraakman/266732bd59263aed6ed990430f988378 to your computer and use it in GitHub Desktop.
Save nickkraakman/266732bd59263aed6ed990430f988378 to your computer and use it in GitHub Desktop.
A Python function that uses Chauvenet's Criterion to filter outliers from a dataset and returns only the reasonable values.
import numpy
from scipy.special import erfc
def filter_outliers(datapoints):
"""Run Chauvenet's Criterion to remove outliers
@See: https://www.statisticshowto.com/chauvenets-criterion/
@See: https://github.com/msproteomicstools/msproteomicstools/blob/master/msproteomicstoolslib/math/chauvenet.py
Args:
datapoints (list): Array of datapoints from which to filter the outliers
Returns:
list: Valid datapoints with outliers removed
"""
criterion = 1.0/(2*len(datapoints))
valid_datapoints = []
# Step 1: Determine sample mean
mean = numpy.mean(datapoints)
# Step 2: Calculate standard deviation of sample
standard_deviation = numpy.std(datapoints)
# Step 3: For each value, calculate distance to mean in standard deviations
# Compare to criterion and store those that pass in valid_periods array
for datapoint in datapoints:
distance = abs(datapoint-mean)/standard_deviation # Distance of a value to mean in stdv's
distance /= 2.0**0.5 # The left and right tail threshold values
probability = erfc(distance) # Area normal distribution
if probability >= criterion:
valid_datapoints.append(datapoint) # Store only non-outliers
return valid_datapoints
# Let's use the function to filter some outliers from a list
mylist = [745, 801, 129876, 793, 698]
valid_list = filter_outliers(mylist)
print(valid_list)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment