Last active
August 29, 2015 14:09
-
-
Save vene/832ca45cdf9a3cade9d3 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Author: Vlad Niculae <vlad@vene.ro> | |
# License: 2-clause BSD | |
"""2D implementation of the robust Siegel Repeated Median slope estimator | |
This estimator tolerates corruption of up to 50% of the input points in either | |
the X or the Y dimension. | |
Vectorized implementation, and a naive implementation for sanity-check. | |
""" | |
import numpy as np | |
def siegel_slow(x, y): | |
n_obs = len(x) | |
slopes = [] | |
for i in range(n_obs): | |
slopes_i = [] | |
for j in range(n_obs): | |
if i == j: | |
continue | |
slopes_i.append((y[j] - y[i]) / (x[j] - x[i])) | |
slopes.append(np.median(slopes_i)) | |
return np.median(slopes) | |
def siegel_fast(x, y): | |
# slopes computation reused from scipy.stats.theilslopes | |
x = np.asarray(x) | |
y = np.asarray(y) | |
deltax = x[:, np.newaxis] - x | |
deltay = y[:, np.newaxis] - y | |
olderr = np.seterr(all='ignore') | |
try: | |
slopes = deltay / deltax | |
finally: | |
np.seterr(**olderr) | |
return np.median(np.nanmedian(slopes, axis=0)) | |
if __name__ == '__main__': | |
# no noise | |
x = np.random.randn(10) | |
y = 10 + 3 * x | |
print siegel_slow(x, y) # should be 3 | |
print siegel_fast(x, y) | |
y_noise = y.copy() | |
x_noise = x.copy() | |
x_noise[0] *= 100.0 | |
y_noise[-1] *= -100.0 | |
print siegel_slow(x_noise, y_noise) # should also be 3 | |
print siegel_fast(x_noise, y_noise) | |
y_noise += 0.1 * np.random.randn(10) | |
x_noise += 0.1 * np.random.randn(10) | |
print siegel_slow(x_noise, y_noise) # should also be close to 3 | |
print siegel_fast(x_noise, y_noise) |
@eickenberg: sorry for catching this so late, the notification probably got buried. I was trying to use this to summarize the trend of a short (3-10 values) but variable-length sequence and use that as a further feature. So the memory concern wasn't an issue. But yeah, something like a numba version might be an interesting exercise.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What type of data would you use this on typically? I ask because
siegel_fast
is currentlyO(n_obs ** 2)
in memory. In that sense thesiegel_slow
implementation could be rendered more memory efficient by using some sort of incremental median and a jit compiler.