Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@CMCDragonkai
Last active August 7, 2019 06:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save CMCDragonkai/eb874a87de93a7d091e9be583eeff5b8 to your computer and use it in GitHub Desktop.
Save CMCDragonkai/eb874a87de93a7d091e9be583eeff5b8 to your computer and use it in GitHub Desktop.
Calculate population variance from subpopulation variance #python

Population Variance and Subpopulation Variance

  • A population is the entire set of values.
  • A sample is a subset of the population.
  • A sample value is discrete observed value of the sample or population.
  • The population mean is some times called the "grand mean"
  • μ = E[X] - notation for mean
  • σ = sqrt(E[(x-μ)^2]) - notation for standard deviation
  • Var(X) = σ^2 = E[(x-μ)^2] - notation for variance
  • σ^2 - notation for population variance
  • s^2 - notation for sample variance
  • The total variation is what is needed to calculate the population variance.
  • The total variation (a.k.a. Sum of Squares Total/SST/TSS) is sum[(x - pop_mean)^2].
  • The total variation is equal to between group variation plus within the group variation.
  • The between group variation is sum[group_size*((group_mean - pop_mean)^2)].
  • The within group variation is sum[(group_size - 1) * group_var_sample].
  • The population variance is the total variation divided by the population count.
import numpy as np
# you can calculate the variance of the population based variance of each subpopulation
# the population variation is the sum of the "between group variation" and the "within group variation"
# the population variance is the population variation divided by the total number of sample values
groups = [np.array([1,2,3,4]), np.array([5,6])]
pop_mean = (1 + 2 + 3 + 4 + 5 + 6) / 6
pop_var = (
sum([group.size * ((group.mean() - pop_mean)**2) for group in groups]) + \
sum([(group.size - 1) * group.var(ddof=1) for group in groups])
) / sum([group.size for group in groups])
pop_var_sample = (
sum([group.size * ((group.mean() - pop_mean)**2) for group in groups]) + \
sum([(group.size - 1) * group.var(ddof=1) for group in groups])
) / (sum([group.size for group in groups]) - 1)
pop_std = np.sqrt(pop_var)
print(pop_var)
print(pop_var_sample)
print(np.var([1,2,3,4,5,6])) # variance
print(np.var([1,2,3,4,5,6], ddof=1)) # sample variance
# it is possible to use the above equation for online variance calculations (when combined with an cumulative moving average)
# just think that there are only 2 groups, the prior group which is everything before now
# and the successor group which is the new batch of data
# in fact for different sized batches (like a list of images) using the above
# equation is far more efficient than using welford's algorithm since due to
# the usage of groups, it is more amenable to vectorised operations of each image
# becareful as the group size for images can be the number of pixels (not number of images)
# in such a case you would be calculating the pixel variance
# more info: https://stats.stackexchange.com/q/10441/198729
def moving_mean_batch(avg_old, count_old, batch):
sum_old = avg_old * count_old
sum_new = sum(batch)
return (sum_old + sum_new) / (count_old + batch.size)
def online_variance(groups):
pop_mean = 0
pop_count = 0
pop_var = 0
pop_var_sample = 0
for group in groups:
pop_mean_new = moving_mean_batch(pop_mean, pop_count, group)
pop_count_new = pop_count + group.size
between_variation = (pop_count * ((pop_mean - pop_mean_new)**2) +
group.size * ((group.mean() - pop_mean_new)**2))
within_variation = ((pop_count - 1) * pop_var_sample + (group.size - 1) * group.var(ddof=1))
total_variation = between_variation + within_variation
pop_var_new = total_variation / pop_count_new
pop_var_sample_new = total_variation / (pop_count_new - 1)
pop_mean = pop_mean_new
pop_count = pop_count_new
pop_var = pop_var_new
pop_var_sample = pop_var_sample_new
return pop_var
@CMCDragonkai
Copy link
Author

Note that "pooled variance" is something quite different.

Put simply, the pooled variance is an (unbiased) estimate of the variance within each sample, under the assumption/constraint that those variances are equal.

It does not estimate the variance of a new "meta-sample" formed by concatenating the two individual samples, like you supposed. As you have already discovered, estimating that requires a completely different formula.
https://stats.stackexchange.com/a/302860/198729

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment