thomasdarimont/100_redis_median_approx.md

## 100_redis_median_approx.md

      
    Raw
  

              100_redis_median_approx.md
            
          
    Approximating the median of a Stream via stochastic averaging

Often it is useful to have access to the median value for fields of a data stream since they are more robust with respect to outliers.
The median is defined as the value of a dataset such that, when sorted, 50% of the data is smaller than the value and 50% of the data is larger then the value. Ordinarily this is difficult to calculate on a stream because it requires the collection and sorting of all data.
The median of a data stream can be approximated with a technique called stochastic averaging. To approximate the median value of a data stream one could use the following approach:
Given the current estimate of the median M. If the next observed value in the stream is larger than M, increase the current estimate by r (= the learning rate). If it is smaller, decrease the estimate by r. When M is close to the median, it increases as often as it decreases, and therefore it stabilizes.
This approach was taken from the book "Real-time Analytics - Techniques to Analyze and Visualize Streaming Data" P. 296 / Byron Ellis / Wiley
The data structure

We store the data we need for the necessary stats book-keeping in a redis hash.
hmset some_stats_value
  current NaN -- the current / last seen value, defaults to NaN
  median NaN -- the current median, defaults to NaN
  learning_rate 0.7 -- the learning rate - how fast to move towards the read median

Paste into redis:
hmset some_stats_value current NaN median NaN learning_rate 0.7

Compute the median estimation

-- script load "
local key, value = KEYS[1], tonumber(ARGV[1]);

local values = redis.call('HMGET', key, 'current', 'median', 'learning_rate');
local current, median, learning_rate = tonumber(values[1]), tonumber(values[2]), tonumber(values[3]);

if (value ~= median) then -- only update of median is different than current value

  if (median ~= median) then -- check for NaN
    median = value; -- initialize median
  else 
    median = median + (median < value and learning_rate or -learning_rate); -- update median 
  end;

  redis.call('HMSET', key, 'current', value, 'median', median);
end;

if (ARGV[2] == 'return_median') then
  return {'current', value, 'median', median};
end;
--"
Paste into redis:
Note that I removed the comments from the script above.
script load "local key, value = KEYS[1], tonumber(ARGV[1]); local values = redis.call('HMGET', key, 'current', 'median', 'learning_rate'); local current, median, learning_rate = tonumber(values[1]), tonumber(values[2]), tonumber(values[3]); if (value ~= median) then if (median ~= median) then median = value; else median = median + (median < value and learning_rate or -learning_rate); end; redis.call('HMSET', key, 'current', value, 'median', median); end; if(ARGV[2]=='return_median') then return {'current', value, 'median', median}; end;"

Note that the script load command should return the sha hash f572d4de2b58fd955e2b29ae3c55f6c0c0972553.
Usage


Variant 1
Just updates the median.

evalsha SCRIPT_SHA 1 THE_STATS_KEY THE_VALUE 


Variant 2
Update median and return stored values.

evalsha SCRIPT_SHA 1 THE_STATS_KEY THE_VALUE return_median

Example

Initialize stats container some_stats_value:
hmset some_stats_value current NaN median NaN learning_rate 0.7

Send some data to the script:
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 100
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 150
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 200
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 10
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 20
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 30
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 40
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 50
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 90
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 500
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 299
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 344
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 999
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 73
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 89


Print the values:
127.0.0.1:6379> hgetall some_stats_value
1) "current"
2) "89"
3) "median"
4) "98.599999999999994"
5) "learning_rate"
6) "0.7"

Note that the actual median in this example is 90.