Skip to content

Instantly share code, notes, and snippets.

@dah33
Created July 12, 2021 08:54
Show Gist options
  • Save dah33/79fea4c586b201cb9e798c7383538301 to your computer and use it in GitHub Desktop.
Save dah33/79fea4c586b201cb9e798c7383538301 to your computer and use it in GitHub Desktop.
Mutual information in SQL between discrete variables
with t1 as (
select
column1 as x,
column2 as y
from your_table
),
t as (
select x, y
from t1
where x is not null and y is not null
),
n as ( select count(*)::real as n from t ),
x as ( select x, count(*)::real as cx from t group by 1 ),
y as ( select y, count(*)::real as cy from t group by 1 ),
xy as ( select x,y, count(*)::real as cxy from t group by 1,2 ),
ixy as ( select sum(cxy/n * (ln(n)+ln(cxy)-ln(cx)-ln(cy))) as ixy from xy join x on xy.x = x.x join y on xy.y = y.y, n ),
hxy as ( select -sum(cxy/n * (ln(cxy)-ln(n))) as hxy from xy, n )
--select 1 - ixy/hxy from hxy, ixy; -- Jaccard distance
select ixy from ixy; -- Mutual information: I(X;Y)
@dah33
Copy link
Author

dah33 commented Jul 12, 2021

Removes records where x or y is NULL (i.e. missing observations). To keep these as known "other" values, rather than missing, replace the t CTE with:

t as (
  select
  	dense_rank() over (order by x) as x,
  	dense_rank() over (order by y) as y
  from t1
)

This chooses a sensible value to infill NULLs and works for various data types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment