CMCDragonkai/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Population Variance and Subpopulation Variance


A population is the entire set of values.
A sample is a subset of the population.
A sample value is discrete observed value of the sample or population.
The population mean is some times called the "grand mean"
μ = E[X] - notation for mean
σ = sqrt(E[(x-μ)^2]) - notation for standard deviation
Var(X) = σ^2 = E[(x-μ)^2] - notation for variance
σ^2 - notation for population variance
s^2 - notation for sample variance
The total variation is what is needed to calculate the population variance.
The total variation (a.k.a. Sum of Squares Total/SST/TSS) is sum[(x - pop_mean)^2].
The total variation is equal to between group variation plus within the group variation.
The between group variation is sum[group_size*((group_mean - pop_mean)^2)].
The within group variation is sum[(group_size - 1) * group_var_sample].
The population variance is the total variation divided by the population count.


## pop_var_from_subpop_var.py
import numpy as np

# you can calculate the variance of the population based variance of each subpopulation
# the population variation is the sum of the "between group variation" and the "within group variation"
# the population variance is the population variation divided by the total number of sample values

groups = [np.array([1,2,3,4]), np.array([5,6])]
pop_mean = (1 + 2 + 3 + 4 + 5 + 6) / 6
pop_var = (
    sum([group.size * ((group.mean() - pop_mean)**2) for group in groups]) + \
    sum([(group.size - 1) * group.var(ddof=1) for group in groups])
) / sum([group.size for group in groups])
pop_var_sample = (
    sum([group.size * ((group.mean() - pop_mean)**2) for group in groups]) + \
    sum([(group.size - 1) * group.var(ddof=1) for group in groups])
) / (sum([group.size for group in groups]) - 1)
pop_std = np.sqrt(pop_var)

print(pop_var)
print(pop_var_sample)
print(np.var([1,2,3,4,5,6])) # variance
print(np.var([1,2,3,4,5,6], ddof=1)) # sample variance

# it is possible to use the above equation for online variance calculations (when combined with an cumulative moving average)
# just think that there are only 2 groups, the prior group which is everything before now
# and the successor group which is the new batch of data
# in fact for different sized batches (like a list of images) using the above
# equation is far more efficient than using welford's algorithm since due to
# the usage of groups, it is more amenable to vectorised operations of each image
# becareful as the group size for images can be the number of pixels (not number of images)
# in such a case you would be calculating the pixel variance

# more info: https://stats.stackexchange.com/q/10441/198729

def moving_mean_batch(avg_old, count_old, batch):
    sum_old = avg_old * count_old
    sum_new = sum(batch)
    return (sum_old + sum_new) / (count_old + batch.size)

def online_variance(groups):
    pop_mean = 0
    pop_count = 0
    pop_var = 0
    pop_var_sample = 0
    for group in groups:
        pop_mean_new = moving_mean_batch(pop_mean, pop_count, group)
        pop_count_new = pop_count + group.size
        between_variation = (pop_count * ((pop_mean - pop_mean_new)**2) +
                             group.size * ((group.mean() - pop_mean_new)**2))
        within_variation = ((pop_count - 1) * pop_var_sample + (group.size - 1) * group.var(ddof=1))
        total_variation = between_variation + within_variation
        pop_var_new = total_variation / pop_count_new
        pop_var_sample_new = total_variation / (pop_count_new - 1)
        pop_mean = pop_mean_new
        pop_count = pop_count_new
        pop_var = pop_var_new
        pop_var_sample = pop_var_sample_new
    return pop_var
	import numpy as np

	# you can calculate the variance of the population based variance of each subpopulation
	# the population variation is the sum of the "between group variation" and the "within group variation"
	# the population variance is the population variation divided by the total number of sample values

	groups = [np.array([1,2,3,4]), np.array([5,6])]
	pop_mean = (1 + 2 + 3 + 4 + 5 + 6) / 6
	pop_var = (
	sum([group.size * ((group.mean() - pop_mean)**2) for group in groups]) + \
	sum([(group.size - 1) * group.var(ddof=1) for group in groups])
	) / sum([group.size for group in groups])
	pop_var_sample = (
	sum([group.size * ((group.mean() - pop_mean)**2) for group in groups]) + \
	sum([(group.size - 1) * group.var(ddof=1) for group in groups])
	) / (sum([group.size for group in groups]) - 1)
	pop_std = np.sqrt(pop_var)

	print(pop_var)
	print(pop_var_sample)
	print(np.var([1,2,3,4,5,6])) # variance
	print(np.var([1,2,3,4,5,6], ddof=1)) # sample variance

	# it is possible to use the above equation for online variance calculations (when combined with an cumulative moving average)
	# just think that there are only 2 groups, the prior group which is everything before now
	# and the successor group which is the new batch of data
	# in fact for different sized batches (like a list of images) using the above
	# equation is far more efficient than using welford's algorithm since due to
	# the usage of groups, it is more amenable to vectorised operations of each image
	# becareful as the group size for images can be the number of pixels (not number of images)
	# in such a case you would be calculating the pixel variance

	# more info: https://stats.stackexchange.com/q/10441/198729

	def moving_mean_batch(avg_old, count_old, batch):
	sum_old = avg_old * count_old
	sum_new = sum(batch)
	return (sum_old + sum_new) / (count_old + batch.size)

	def online_variance(groups):
	pop_mean = 0
	pop_count = 0
	pop_var = 0
	pop_var_sample = 0
	for group in groups:
	pop_mean_new = moving_mean_batch(pop_mean, pop_count, group)
	pop_count_new = pop_count + group.size
	between_variation = (pop_count * ((pop_mean - pop_mean_new)**2) +
	group.size * ((group.mean() - pop_mean_new)**2))
	within_variation = ((pop_count - 1) * pop_var_sample + (group.size - 1) * group.var(ddof=1))
	total_variation = between_variation + within_variation
	pop_var_new = total_variation / pop_count_new
	pop_var_sample_new = total_variation / (pop_count_new - 1)
	pop_mean = pop_mean_new
	pop_count = pop_count_new
	pop_var = pop_var_new
	pop_var_sample = pop_var_sample_new
	return pop_var