Skip to content

Instantly share code, notes, and snippets.

@dlebech
Last active August 15, 2022 20:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dlebech/5e2885e852bdca38d99c27a55fb6636c to your computer and use it in GitHub Desktop.
Save dlebech/5e2885e852bdca38d99c27a55fb6636c to your computer and use it in GitHub Desktop.
Binomial probability calculation function in SQL (BigQuery)
-- Public Domain CC0 license. https://creativecommons.org/publicdomain/zero/1.0/
-- Calculate the probability of k successes for n trials with probability of success k,
-- using the binomial distribution.
-- Calculate the binomial coefficient using the "multiplicative formula"
CREATE OR REPLACE FUNCTION functions.binomial_coef(n INT64, k INT64) AS ((
-- k!/(n!*(n-k)!)
-- We're going to have a hard time doing factorials here,
-- but based on the "multiplicative formula" in Wiki, it should be possible:
-- https://en.wikipedia.org/wiki/Binomial_coefficient#Computing_the_value_of_binomial_coefficients
-- Taken together with the fact that we can do EXP(SUM(LOG(x))) to do product over multiple rows, we have:
SELECT
-- Binomial coefficient is always an integer,
-- but BigQuery is not super happy about EXP(SUM(LN(x))) trick,
-- so the ROUND is added to be safe :-)
ROUND(EXP(SUM(LN((n + 1 - i) / i))))
FROM
-- Take advantage of symmetry (n // k) == (n // (n-k))
UNNEST(GENERATE_ARRAY(1, IF(k > n - k, n - k, k), 1)) i
));
-- Calculate the probability of k successes in n trials where the probability of a success is p.
-- This uses the probability density function for the binomial distribution.
-- See example below.
CREATE OR REPLACE FUNCTION functions.binomial_prob(n INT64, k INT64, p FLOAT64) AS (
functions.binomial_coef(n, k) * POW(p, k) * POW(1-p, n-k)
);
-- The example from Wikipedia: https://en.wikipedia.org/wiki/Binomial_distribution#Example
-- Probability of getting 4 heads in 6 tosses with a coin that is biased and has heads probability 0.3
-- Calculates to 0.05953499999999999
SELECT functions.binomial_prob(6, 4, 0.3);
@benman1
Copy link

benman1 commented Aug 12, 2022

This only works with small numbers. Taking (16000, 8000, 0.5) as argument gives me Floating point error in function: EXP(11085.3)

@benman1
Copy link

benman1 commented Aug 12, 2022

(300, 150, 0.5) gives me 0.0460275144190249

@dlebech
Copy link
Author

dlebech commented Aug 13, 2022

@benman1 thanks for the comment, and nicely spotted. My original use case didn't involve n larger than 20, so I never tested it for very large trial sizes n 😅

The error occurs in the binomial coefficient calculation due to the way the product is calculated over multiple rows. For large trial sizes, the coefficient becomes incredibly high anyway and BigQuery might not be able to calculate it -- even with a different method. For example, the binomial coefficient for 300 and 150 (your second example) is already 9.37 * 10^88 [1].

I can't provide a compelling argument against using binomial probability for larger-than-1000 trial sizes since I'm not a math person, but I can conclude that the current calculation probably won't be feasible to use in any programming language for larger trial sizes -- it needs to be rewritten.

If you have a suggestion on how to get around this, I'm very curious to learn more about it though!

[1] 93,759,702,772,827,452,793,193,754,439,064,084,879,232,655,700,081,358,920,472,352,730,000,000,000,000,000,000,000,000 according to ref

@benman1
Copy link

benman1 commented Aug 13, 2022

Thanks for getting back. I liked the idea, btw. I tried another implementation, which also doesn't work.

My use case is a bit different from this actually - I needed a two sample test. I've implemented this in javascript (based on a Python implementation) and then added an approximation of the normal cdf to get the p-values that I found. I don't think I can share this though (corporate).

@dlebech
Copy link
Author

dlebech commented Aug 15, 2022

Aha, that makes sense @benman1 I'm happy to hear it worked out for you with a different solution 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment