Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
PoC for approximating the median of a Stream via stochastic averaging in Redis with Lua

Approximating the median of a Stream via stochastic averaging

Often it is useful to have access to the median value for fields of a data stream since they are more robust with respect to outliers. The median is defined as the value of a dataset such that, when sorted, 50% of the data is smaller than the value and 50% of the data is larger then the value. Ordinarily this is difficult to calculate on a stream because it requires the collection and sorting of all data.

The median of a data stream can be approximated with a technique called stochastic averaging. To approximate the median value of a data stream one could use the following approach:

Given the current estimate of the median M. If the next observed value in the stream is larger than M, increase the current estimate by r (= the learning rate). If it is smaller, decrease the estimate by r. When M is close to the median, it increases as often as it decreases, and therefore it stabilizes.

This approach was taken from the book "Real-time Analytics - Techniques to Analyze and Visualize Streaming Data" P. 296 / Byron Ellis / Wiley

The data structure

We store the data we need for the necessary stats book-keeping in a redis hash.

hmset some_stats_value
  current NaN -- the current / last seen value, defaults to NaN
  median NaN -- the current median, defaults to NaN
  learning_rate 0.7 -- the learning rate - how fast to move towards the read median

Paste into redis:

hmset some_stats_value current NaN median NaN learning_rate 0.7

Compute the median estimation

-- script load "
local key, value = KEYS[1], tonumber(ARGV[1]);

local values = redis.call('HMGET', key, 'current', 'median', 'learning_rate');
local current, median, learning_rate = tonumber(values[1]), tonumber(values[2]), tonumber(values[3]);

if (value ~= median) then -- only update of median is different than current value

  if (median ~= median) then -- check for NaN
    median = value; -- initialize median
  else 
    median = median + (median < value and learning_rate or -learning_rate); -- update median 
  end;

  redis.call('HMSET', key, 'current', value, 'median', median);
end;

if (ARGV[2] == 'return_median') then
  return {'current', value, 'median', median};
end;
--"

Paste into redis: Note that I removed the comments from the script above.

script load "local key, value = KEYS[1], tonumber(ARGV[1]); local values = redis.call('HMGET', key, 'current', 'median', 'learning_rate'); local current, median, learning_rate = tonumber(values[1]), tonumber(values[2]), tonumber(values[3]); if (value ~= median) then if (median ~= median) then median = value; else median = median + (median < value and learning_rate or -learning_rate); end; redis.call('HMSET', key, 'current', value, 'median', median); end; if(ARGV[2]=='return_median') then return {'current', value, 'median', median}; end;"

Note that the script load command should return the sha hash f572d4de2b58fd955e2b29ae3c55f6c0c0972553.

Usage

  • Variant 1 Just updates the median.
evalsha SCRIPT_SHA 1 THE_STATS_KEY THE_VALUE 
  • Variant 2 Update median and return stored values.
evalsha SCRIPT_SHA 1 THE_STATS_KEY THE_VALUE return_median

Example

Initialize stats container some_stats_value:

hmset some_stats_value current NaN median NaN learning_rate 0.7

Send some data to the script:

evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 100
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 150
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 200
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 10
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 20
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 30
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 40
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 50
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 90
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 500
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 299
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 344
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 999
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 73
evalsha f572d4de2b58fd955e2b29ae3c55f6c0c0972553 1 some_stats_value 89

Print the values:

127.0.0.1:6379> hgetall some_stats_value
1) "current"
2) "89"
3) "median"
4) "98.599999999999994"
5) "learning_rate"
6) "0.7"

Note that the actual median in this example is 90.

@thomasdarimont

This comment has been minimized.

Copy link
Owner Author

thomasdarimont commented Dec 1, 2014

Idea: Compute a "stabilization factor" sf -> count how many times we are above / below "the current median". If both counts are roughly equal - then the current median value is near the actual median - otherwise we know whether we are below or above.

sf = (count we are to small + 1) / (count we are to big + 1) -> we know that we are near true median once sf approaches 1.0 +- epsilon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.