Instantly share code, notes, and snippets.

# thomasdarimont/100_redis_median_approx.md Last active Jul 28, 2019

PoC for approximating the median of a Stream via stochastic averaging in Redis with Lua

# Approximating the median of a Stream via stochastic averaging

Often it is useful to have access to the median value for fields of a data stream since they are more robust with respect to outliers. The median is defined as the value of a dataset such that, when sorted, 50% of the data is smaller than the value and 50% of the data is larger then the value. Ordinarily this is difficult to calculate on a stream because it requires the collection and sorting of all data.

The median of a data stream can be approximated with a technique called stochastic averaging. To approximate the median value of a data stream one could use the following approach:

Given the current estimate of the median M. If the next observed value in the stream is larger than M, increase the current estimate by r (= the learning rate). If it is smaller, decrease the estimate by r. When M is close to the median, it increases as often as it decreases, and therefore it stabilizes.

This approach was taken from the book "Real-time Analytics - Techniques to Analyze and Visualize Streaming Data" P. 296 / Byron Ellis / Wiley

# The data structure

We store the data we need for the necessary stats book-keeping in a redis hash.

``````hmset some_stats_value
current NaN -- the current / last seen value, defaults to NaN
median NaN -- the current median, defaults to NaN
learning_rate 0.7 -- the learning rate - how fast to move towards the read median
``````

Paste into redis:

``````hmset some_stats_value current NaN median NaN learning_rate 0.7
``````

# Compute the median estimation

```-- script load "
local key, value = KEYS, tonumber(ARGV);

local values = redis.call('HMGET', key, 'current', 'median', 'learning_rate');
local current, median, learning_rate = tonumber(values), tonumber(values), tonumber(values);

if (value ~= median) then -- only update of median is different than current value

if (median ~= median) then -- check for NaN
median = value; -- initialize median
else
median = median + (median < value and learning_rate or -learning_rate); -- update median
end;

redis.call('HMSET', key, 'current', value, 'median', median);
end;

if (ARGV == 'return_median') then
return {'current', value, 'median', median};
end;
--"```

Paste into redis: Note that I removed the comments from the script above.

``````script load "local key, value = KEYS, tonumber(ARGV); local values = redis.call('HMGET', key, 'current', 'median', 'learning_rate'); local current, median, learning_rate = tonumber(values), tonumber(values), tonumber(values); if (value ~= median) then if (median ~= median) then median = value; else median = median + (median < value and learning_rate or -learning_rate); end; redis.call('HMSET', key, 'current', value, 'median', median); end; if(ARGV=='return_median') then return {'current', value, 'median', median}; end;"
``````

Note that the `script load` command should return the sha hash `f572d4de2b58fd955e2b29ae3c55f6c0c0972553`.

# Usage

• Variant 1 Just updates the median.
``````evalsha SCRIPT_SHA 1 THE_STATS_KEY THE_VALUE
``````
• Variant 2 Update median and return stored values.
``````evalsha SCRIPT_SHA 1 THE_STATS_KEY THE_VALUE return_median
``````

# Example

Initialize stats container `some_stats_value`:

``````hmset some_stats_value current NaN median NaN learning_rate 0.7
``````

Send some data to the script:

``````evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 100
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 150
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 200
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 10
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 20
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 30
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 40
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 50
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 90
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 500
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 299
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 344
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 999
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 73
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 89

``````

Print the values:

``````127.0.0.1:6379> hgetall some_stats_value
1) "current"
2) "89"
3) "median"
4) "98.599999999999994"
5) "learning_rate"
6) "0.7"
``````

Note that the actual median in this example is 90.

Owner Author

### thomasdarimont commented Dec 1, 2014 • edited

 Idea: Compute a "stabilization factor" sf -> count how many times we are above / below "the current median". If both counts are roughly equal - then the current median value is near the actual median - otherwise we know whether we are below or above. sf = (count we are to small + 1) / (count we are to big + 1) -> we know that we are near true median once sf approaches 1.0 +- epsilon.