- A population is the entire set of values.
- A sample is a subset of the population.
- A sample value is discrete observed value of the sample or population.
- The population mean is some times called the "grand mean"
μ = E[X]
- notation for meanσ = sqrt(E[(x-μ)^2])
- notation for standard deviationVar(X) = σ^2 = E[(x-μ)^2]
- notation for varianceσ^2
- notation for population variances^2
- notation for sample variance- The total variation is what is needed to calculate the population variance.
- The total variation (a.k.a. Sum of Squares Total/SST/TSS) is
sum[(x - pop_mean)^2]
. - The total variation is equal to between group variation plus within the group variation.
- The between group variation is
sum[group_size*((group_mean - pop_mean)^2)]
. - The within group variation is
sum[(group_size - 1) * group_var_sample]
. - The population variance is the total variation divided by the population count.
Last active
August 7, 2019 06:26
-
-
Save CMCDragonkai/eb874a87de93a7d091e9be583eeff5b8 to your computer and use it in GitHub Desktop.
Calculate population variance from subpopulation variance #python
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
# you can calculate the variance of the population based variance of each subpopulation | |
# the population variation is the sum of the "between group variation" and the "within group variation" | |
# the population variance is the population variation divided by the total number of sample values | |
groups = [np.array([1,2,3,4]), np.array([5,6])] | |
pop_mean = (1 + 2 + 3 + 4 + 5 + 6) / 6 | |
pop_var = ( | |
sum([group.size * ((group.mean() - pop_mean)**2) for group in groups]) + \ | |
sum([(group.size - 1) * group.var(ddof=1) for group in groups]) | |
) / sum([group.size for group in groups]) | |
pop_var_sample = ( | |
sum([group.size * ((group.mean() - pop_mean)**2) for group in groups]) + \ | |
sum([(group.size - 1) * group.var(ddof=1) for group in groups]) | |
) / (sum([group.size for group in groups]) - 1) | |
pop_std = np.sqrt(pop_var) | |
print(pop_var) | |
print(pop_var_sample) | |
print(np.var([1,2,3,4,5,6])) # variance | |
print(np.var([1,2,3,4,5,6], ddof=1)) # sample variance | |
# it is possible to use the above equation for online variance calculations (when combined with an cumulative moving average) | |
# just think that there are only 2 groups, the prior group which is everything before now | |
# and the successor group which is the new batch of data | |
# in fact for different sized batches (like a list of images) using the above | |
# equation is far more efficient than using welford's algorithm since due to | |
# the usage of groups, it is more amenable to vectorised operations of each image | |
# becareful as the group size for images can be the number of pixels (not number of images) | |
# in such a case you would be calculating the pixel variance | |
# more info: https://stats.stackexchange.com/q/10441/198729 | |
def moving_mean_batch(avg_old, count_old, batch): | |
sum_old = avg_old * count_old | |
sum_new = sum(batch) | |
return (sum_old + sum_new) / (count_old + batch.size) | |
def online_variance(groups): | |
pop_mean = 0 | |
pop_count = 0 | |
pop_var = 0 | |
pop_var_sample = 0 | |
for group in groups: | |
pop_mean_new = moving_mean_batch(pop_mean, pop_count, group) | |
pop_count_new = pop_count + group.size | |
between_variation = (pop_count * ((pop_mean - pop_mean_new)**2) + | |
group.size * ((group.mean() - pop_mean_new)**2)) | |
within_variation = ((pop_count - 1) * pop_var_sample + (group.size - 1) * group.var(ddof=1)) | |
total_variation = between_variation + within_variation | |
pop_var_new = total_variation / pop_count_new | |
pop_var_sample_new = total_variation / (pop_count_new - 1) | |
pop_mean = pop_mean_new | |
pop_count = pop_count_new | |
pop_var = pop_var_new | |
pop_var_sample = pop_var_sample_new | |
return pop_var |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Note that "pooled variance" is something quite different.