Skip to content

Instantly share code, notes, and snippets.

@alfred-landrum
Forked from welch/README.md
Last active August 29, 2015 14:14
Show Gist options
  • Save alfred-landrum/1b99f3f96f3ab392fdda to your computer and use it in GitHub Desktop.
Save alfred-landrum/1b99f3f96f3ab392fdda to your computer and use it in GitHub Desktop.

Smooth

Exponential smoother/forecaster with de-seasonalization

Smooth models an unobserved level and trend component in a noisy signal, along with optional "seasonal" effects at the level of day-over-day variation. The result is a smoothed version of the signal which may be used to forecast future values or detect unexpected variation.

In this example, we apply our smoother to a sample 10-day timeseries published in Twitter's AnomalyDetection package. The series is 10 days of counts, with a regular daily pattern and occasional surprises (spikes and dips). The Juttle output focuses on the three days around Oct 1.

The first timechart displays the decomposition of the raw data (orange) into a repeating daily component (green curve) and extra variation (yellow curve) to produce a smoothed forecast curve (blue). Each point in the blue curve was estimated from data received prior to that moment in time. The pink curve shows smoothed RMS prediction error.

The daily variation is continually re-estimated as data arrives. It takes a few days' worth of data for the cyclic estimate to settle down. If you were running this smoother as part of a realtime monitoring query, you would issue the query with a timerange that included a few days history prior to realtime. The cyclic component would be estimated immediately using this historic data before rolling into realtime.

The parameters for this forecaster (slevel, sdaily) were selected ad-hoc (though without much fuss). In a subsequent post we will show how to use likelihood methods to automatically choose parameter levels using historic data.

Detecting Anomalies

The second timechart in the juttle output displays the forecast curve (blue) along with confidence bands computed from the prediction error (teal, gold). When an observation falls outside the confidence interval, it may indicate an anomaly.

Quick blips like the one on September 29 show up well using confidence intervals. The big September 30 anomaly shows a shortcoming of confidence intervals computed this way -- if an anomaly persists for more than a blip, it can inflate the estimate of the signal's inherent variation, expanding the confidence interval and possibly masking other anomalies until things have quieted down. In a subsequent post we'll show how to automatically adjust the forecast level and confidence intervals to give less weight to "surprising" observations (using Kalman filtering) and thus recover more quickly when such anomalies are encountered.

Kinks in the error curve (pink, first chart) also indicate sudden changes in the forecasted timeseries. Depending on the kind of behavior you consider anomalous (rapid rise? rapid fall? sustained change from usual daily value?) you may choose to use error tresholds, error change thresholds, or confidence intervals to detect it. Smooth computes these for you to use as best suits.

// interpolation functions.
//
// project_time: project time onto a cycle of intervals of duration period/n.
// Return [idx, u] where 0 <= idx < n is the periodic interval this
// time falls into, and 0 <= u < 1 is the coordinate within the
// unit interval
//
export function project_time(time, n, period) {
var interval = period / n;
var t0 = Date.quantize(time, interval);
var u = (time - t0) / interval;
var idx = Math.floor((t0 - Date.new(0)) / interval) % n;
return [idx, u];
}
//
// linear: piecewise-linear interpolation (periodic)
// interpolate values to get a continuous periodic function of time.
// A is an array of samples representing an overall cycle of duration period
// with samples uniformly spaced and s.t. A[0] should be the value at t==0.
//
export function linear(time, A, len, period) {
var IDX = project_time(time, len, period);
var idx=IDX[0], u=IDX[1];
return (1 - u) * A[idx] + u * A[ (idx+1) % len ];
}
//
// catrom_weights: weighting for successive points for a given 0 <= u < 1
//
function catrom_weights(u) {
var u2 = u*u;
var u3 = u2 * u;
var W = [-u/2 + u2 - u3/2,
1 - 2.5 * u2 + 1.5 * u3,
u / 2 + 2 * u2 - 1.5 * u3,
-u2 / 2 + u3 / 2];
return W;
}
//
// project_cubic: cubic spline interpolation along with segment index and weights
// Return an array [val, idx, W], where
// val == W[0]*A[idx] + W[1]*A[idx+1] + W[2]*A[idx+2] + W[3]*A[idx+3], indexed modulo len
//
export function project_cubic(time, A, len, period) {
var IDX = project_time(time, len, period);
var idx=(IDX[0] + len - 1) % len, u=IDX[1];
var W = catrom_weights(u);
var val = W[0]*A[idx] + W[1]*A[(idx+1) % len] + W[2]*A[(idx+2) % len] + W[3]*A[(idx+3) % len];
return [val, idx, W];
}
//
// cubic: cubic spline interpolation (periodic)
// interpolate values to get a smooth periodic function of time.
// A is an array of samples representing an overall cycle of duration period
// with samples uniformly spaced and s.t. A[0] should be the value at t==0.
//
export function cubic(time, A, len, period) {
return project_cubic(time, A, len, period)[0];
}
//
// demo: draw two periods of the cycle, interpolating linearly and cubically
//
const values = [10, 5, 20, 15, 30];
const N = 5;
emitter -start 0 -limit 60
| put pwl = linear(time, values, N, :30s:), curve = cubic(time, values, N, :30s:)
| split pwl, curve
| @timechart
// Forecasting and anomaly example using the Twitter anomaly timeseries from
// https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series
// (this is a different computational approach, applied to the test data supplied with their R package)
//
import "smooth.juttle" as Sm;
source "https://gist.githubusercontent.com/welch/a34b41e57db2075b5c4a/raw/f215975aa432cc5ea254769f3ebef77806d2aa44/twitter-anomaly.json"
| Sm.forecastPI -Y "count" -z 2 -alpha .5 -slevel .1 -strend null -sdaily .2
|( keep count, count_pred, count_level, count_trend, count_day, count_err, time
| split count, count_pred, count_level, count_trend, count_day, count_err
| filter time < :1980-10-02: | @timechart -display.duration :3d: -display.dataDensity 0
; keep count, count_pred, count_low, count_hi, time
| split count, count_pred, count_low, count_hi
| filter time < :1980-10-02: | @timechart -display.duration :3d: -display.dataDensity 0
)
// Exponential smoother/forecaster with de-seasonalization, based on Holt-Winters.
// This models an unobserved level and trend component in a noisy signal, along with
// optional "seasonal" effects at the level of day-over-day variation.
// The result is a smoothed version of the signal which may be used to forecast
// future values or detect unexpected variation.
//
// Y is the point field to smooth.
// slevel, strend, sdaily are smoothing factors, numbers ranging [0..1].
// They determine how quickly the smoother adjusts to new values as they arrive.
// Setting a factor to 0 freezes the feature at its initial estimate,
// while setting a factor to 1 causes that feature to depend solely
// on the most recent point. Setting strend or sdaily to null
// removes that feature from the model.
//
// smooth assumes arriving points are equally spaced in time.
// smooth returns the forecast of the current value (using state prior to the
// current value) and also updates to internal state estimates, as an array:
// [value, level, trend, daily]
// level is the estimate of the deseasonalized series value, trend the estimate of the
// deseasonalized series slope. daily is the estimated seasonal factor
// to be used at the next occurrence of the current position in the cycle.
//
// This implementation differs from additive Holt-Winters in its handling of
// seasonality. It independently fits a C1 piecewise cubic curve to
// day-over-day values, and maintains this via gradient descent on prediction error
// rather than as part of a recursive Winters update (the small changes to make this
// a Winters-style update are commented in the code below).
//
import "interpolate.juttle" as interp;
export reducer smooth(Y, slevel, strend, sdaily) {
var level, trend, day, daily, time, yhat;
var initial = true, n_daily=6;
function predict() {
return level + trend + day;
}
function update() {
var err, Wday, i_daily, W, du;
if (initial) {
yhat = *Y ;
level = 0;
trend = 0;
day = 0;
time = *"time";
daily = [*Y,*Y,*Y,*Y,*Y,*Y];
initial = false;
} else {
// error-correction form of Holt-Winters, eg, https://www.otexts.org/fpp
// du = 1 ; // Winters-style: don't scale by interval
du = (*"time" - time) / :hour: ;
time = *"time";
if (sdaily != null) {
// get the previous daily factors for this time
Wday = interp.project_cubic(*"time", daily, n_daily, :day:);
day = Wday[0];
i_daily = Wday[1];
W = Wday[2];
}
yhat = predict() ; // 1-step forecast from previous state as reducer return value
err = *Y - yhat;
if (sdaily != null) {
// update day and project the change back onto the daily array
// var derr = err ; // Winters-style: recursive update sees level and trend error
var derr = (*Y - day);
var dday = sdaily * derr * du;
day = day + dday;
daily[i_daily ] = daily[i_daily] + dday * W[0];
daily[(i_daily+1) % n_daily] = daily[(i_daily+1) % n_daily] + dday * W[1];
daily[(i_daily+2) % n_daily] = daily[(i_daily+2) % n_daily] + dday * W[2];
daily[(i_daily+3) % n_daily] = daily[(i_daily+3) % n_daily] + dday * W[3];
}
if (strend != null) {
trend = trend + slevel * strend * err * du;
}
level = level + trend + slevel * err * du;
}
}
function result() {
return [yhat, level, trend, day ];
}
function reset() {
initial = true;
}
}
// derivative (d/dt) of a timeseries, approximated
// as a backwards difference of Y vs the time field.
reducer ddt(Y, units) {
var y, time, dy, dt;
var initial = true;
function update() {
if (initial) {
dy = 0;
y = *Y;
dt = :1s:;
time = *"time";
initial = false;
} else {
dy = *Y - y ;
y = *Y;
dt = *"time" - time;
time = *"time";
}
}
function result() {
return dy * Duration.seconds(units) / Duration.seconds(dt);
}
function reset() {
initial = true;
}
}
// Compute a forecast along with prediction intervals based
// on a smoothed stderr estimate. Y is the field to forecast. z
// is the number of stdandard deviations from mean to include in
// the prediction interval (z = 1.96 is the 97.5 percentile and thus
// a 95% confidence bound, if errors were idependent and normally
// distributed (often untrue)).
// slevel and strend are smoothing factors for Holt.
//
// For a field name Y, this populates the current point with:
// Y_pred: predicted next y value
// Y_level: de-seasonalized prediction
// Y_trend: predicted next slope
// Y_day: predicted next daily seasonal factor
// Y_err: estimated stderr (RMS prediction error over time)
// Y_hi, Y_low: prediction interval bounds
//
export sub forecastPI(Y, z, alpha, slevel, strend, sdaily) {
put state = smooth(Y, slevel, strend, sdaily)
| put err2 = (*Y - state[0]) * (*Y - state[0])
| put stderr = Math.sqrt(smooth(err2, alpha, null, null)[0])
| put *(Y+"_pred") = state[0]
| put *(Y+"_level") = state[1]
| put *(Y+"_trend") = state[2]
| put *(Y+"_day") = state[3]
| put *(Y+"_err") = stderr
| put *(Y+"_derr") = ddt(stderr, :hour:)
| put *(Y+"_hi") = state[0] + z * stderr
| put *(Y+"_low") = state[0] - z * stderr
| remove state, err2, stderr
}
// Data generation utilities for demos:
sub hourly_cpu(from, to) {
demo cdn metrics 'cpu' -from from -to to -every :m:
-nhosts 4 -dos .5 -dos_dur :5m: -ripple .3 -cpu_alpha 0.8
| filter host ~ 'sjc*'
| put cpu=value
| remove type, host, pop, value
}
sub daily_cpu(from, to, every) {
demo cdn metrics 'cpu' -from from -to to -every every
-nhosts 4 -dos 0.3 -dos_dur :30m: -cpu_alpha 0.8 -daily 4
| filter host ~ 'sjc*'
| put cpu=value
| remove type, host, pop, value
}
sub bumpy_cpu(from, to) {
demo cdn metrics 'cpu' -from from -to to -every :5m:
-nhosts 4 -dos .3 -dos_dur :15s: -daily 3
| filter host ~ 'sjc*'
| put cpu=value
| remove type, host, pop, value
}
// Demo: forecasting for simulated CPU
//
// This demo performs a smoothing forecast including a continuous
// model of repeated day-to-day varition. It displays timeseries of
// the original input, the predicted/smoothed input, a 95% confidence
// interval around this prediction, and the stderr. It also displays
// the prediction components: level, trend and daily variation.
// The seasonal component needs to see a few days' data to settle in.
//
daily_cpu -from :2014-01-01: -to :2014-01-08: -every :10m:
| forecastPI -Y "cpu" -z 1.96 -alpha .5 -slevel .1 -strend null -sdaily .2
|( keep cpu, cpu_pred, cpu_level, cpu_trend, cpu_day, cpu_err, cpu_derr, time
| split cpu, cpu_pred, cpu_level, cpu_trend, cpu_day, cpu_err, cpu_derr
| @timechart -display.duration :3d:
; keep cpu, cpu_pred, cpu_low, cpu_hi, time
| split cpu, cpu_pred, cpu_low, cpu_hi
| @timechart -display.duration :3d:
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment