Skip to content

Instantly share code, notes, and snippets.

@vene
Last active August 29, 2015 14:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vene/832ca45cdf9a3cade9d3 to your computer and use it in GitHub Desktop.
Save vene/832ca45cdf9a3cade9d3 to your computer and use it in GitHub Desktop.
# Author: Vlad Niculae <vlad@vene.ro>
# License: 2-clause BSD
"""2D implementation of the robust Siegel Repeated Median slope estimator
This estimator tolerates corruption of up to 50% of the input points in either
the X or the Y dimension.
Vectorized implementation, and a naive implementation for sanity-check.
"""
import numpy as np
def siegel_slow(x, y):
n_obs = len(x)
slopes = []
for i in range(n_obs):
slopes_i = []
for j in range(n_obs):
if i == j:
continue
slopes_i.append((y[j] - y[i]) / (x[j] - x[i]))
slopes.append(np.median(slopes_i))
return np.median(slopes)
def siegel_fast(x, y):
# slopes computation reused from scipy.stats.theilslopes
x = np.asarray(x)
y = np.asarray(y)
deltax = x[:, np.newaxis] - x
deltay = y[:, np.newaxis] - y
olderr = np.seterr(all='ignore')
try:
slopes = deltay / deltax
finally:
np.seterr(**olderr)
return np.median(np.nanmedian(slopes, axis=0))
if __name__ == '__main__':
# no noise
x = np.random.randn(10)
y = 10 + 3 * x
print siegel_slow(x, y) # should be 3
print siegel_fast(x, y)
y_noise = y.copy()
x_noise = x.copy()
x_noise[0] *= 100.0
y_noise[-1] *= -100.0
print siegel_slow(x_noise, y_noise) # should also be 3
print siegel_fast(x_noise, y_noise)
y_noise += 0.1 * np.random.randn(10)
x_noise += 0.1 * np.random.randn(10)
print siegel_slow(x_noise, y_noise) # should also be close to 3
print siegel_fast(x_noise, y_noise)
@eickenberg
Copy link

What type of data would you use this on typically? I ask because siegel_fast is currently O(n_obs ** 2) in memory. In that sense the siegel_slow implementation could be rendered more memory efficient by using some sort of incremental median and a jit compiler.

@vene
Copy link
Author

vene commented Aug 28, 2015

@eickenberg: sorry for catching this so late, the notification probably got buried. I was trying to use this to summarize the trend of a short (3-10 values) but variable-length sequence and use that as a further feature. So the memory concern wasn't an issue. But yeah, something like a numba version might be an interesting exercise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment