Last active
August 23, 2023 15:57
-
-
Save jrosell/c09cb15bfef8b29ae85a5799f72bad5b to your computer and use it in GitHub Desktop.
Detect anomalies over time using percentiles and using a Isolation forest model.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "anomalies" | |
format: html | |
jupyter: python3 | |
editor_options: | |
chunk_output_type: console | |
--- | |
Install requisites: | |
```{python} | |
!pip install numpy==1.23 pandas matplotlib scipy pycaret seaborn | |
``` | |
Import packages and check versions: | |
```{python} | |
import numpy as np | |
import pandas as pd | |
import matplotlib.pyplot as plt | |
import seaborn as sns | |
from scipy.stats import norm | |
from pycaret.anomaly import * | |
print(f'numpy ', np.__version__) | |
``` | |
Anomaly detection: | |
```{python} | |
# Set random seed | |
np.random.seed(2) | |
# Simulate data | |
elapsed = np.random.normal(size=200) + 20 | |
elapsed = np.maximum(elapsed, 1) | |
data = pd.DataFrame({ | |
'x': np.arange(1, 201).astype(float), | |
'elapsed': elapsed | |
}) | |
print(data) | |
``` | |
Getting anomalies | |
```{python} | |
exp_name = setup(data=data) | |
iforest = create_model('iforest') | |
anomalies = assign_model(iforest, transformation=True, score=True) | |
anomalies = anomalies[anomalies['Anomaly'] == 1] | |
print(anomalies) | |
``` | |
Calculate confidence intervals based on actual elapsed values | |
```{python} | |
quantiles = data.merge(anomalies[['x', 'Anomaly_Score']], on='x', how='left') | |
quantiles['upper'] = quantiles['elapsed'].quantile(0.95) | |
quantiles['lower'] = quantiles['elapsed'].quantile(0.05) | |
quantiles['Anomaly_Score'] = quantiles['Anomaly_Score'].fillna(0) | |
print(quantiles.isna().any()) | |
``` | |
Plot anomalies and confidence intervals | |
```{python} | |
plt.figure(figsize=(10, 6)) | |
plt.plot(data['x'], quantiles['elapsed'], label='Actual') | |
plt.fill_between(data['x'], quantiles['lower'], quantiles['upper'], color='grey', alpha=0.3, label='Quantiles 5% and 95%') | |
plt.scatter(anomalies['x'], anomalies['elapsed'], color='red', label='Anomalies') | |
plt.title('Isolation Forest with quantiles 5% and 95%') | |
plt.xlabel('# Execution') | |
plt.ylabel('Elapsed time (s)') | |
plt.legend() | |
plt.show() | |
``` |
Author
jrosell
commented
Aug 23, 2023
R version with GAM or Isolation Forest, here: https://gist.github.com/jrosell/959ca3160df1f2658531088b1e922708
elapsed = np.random.normal(size=200) + 20
elapsed = np.maximum(elapsed, 1)
data = pd.DataFrame({
'x': np.arange(1, 201).astype(float),
'elapsed': elapsed
})
q75 = data['elapsed'].quantile(0.75)
q25 = data['elapsed'].quantile(0.25)
IQR_1_5_lower = q25 - 1.5*(q75-q25)
IQR_1_5_upper = q75 + 1.5*(q75-q25)
IQR_3_lower = q25 - 3*(q75-q25)
IQR_3_upper = q75 + 3*(q75-q25)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment