ruggeri/initialization.md

## initialization.md

      
    Raw
  

              initialization.md
            
          
    unit normal = normal(mean = 0, stddev = 1)
What does stddev mean?
Mean is the expected value of the random variable: mean = \Int Pr(x) x
It's not the most likely value necessarily. That would be the mode.
Let's say you sample from the normal distribution. What is the expected value of the squared error of the mean and the sample?
How good is the mean value as a predictor of samples drawn from the normal distribution
The less good the mean value is at predicting, the more "random" the normal distribution is. The "wider" the dispersion.
Standard deviation is a metric of how disperse a distribution is.
variance = Expected value of (a sample - the mean)^2
stddev = sqrt(Expected value of (a sample - the mean)^2) = sqrt(variance)
"Sample distribution"
If you have 100 students, and they each take a test, the "sample mean" is the mean score on their hundred tests.
If we run a medical experiment on 100 patients, and 60 out of 100 get better, that doesn't mean that if we roll out to the entire population, 60% of people will get better. It could be 65% or 55%. Prolly not 10% though.
"Sample mean" and "Population mean".
"Sample std deviation":
We take the sqrt of the average of (student score - sample mean score)^2
It approximates (AKA "estimates") the "population std deviation."
For two variables with mean = 0.0. What is the mean of a sum of the two values? Zero.
For two variables with standard deviation = 1.0. What is the standard deviation of a sum of the two values?
Let's assume that the mean of both variables is = 0.0.
E[dice roll] = Sum_{i} P(rolling an i) * i
Var(x + y)

E[ ((x+y) - 0.0)^2 ]

E[ (x+y)^2 ] = E[ x^2 + xy + y^2 ]

E[ x^2 ] + E[xy] + E[y^2]

Var(x) + E[xy] + Var(y)
///
E[xy] = \Int_{x, y} p(x, y)*(xy) = \Int_{x, y} p(x)*p(y) * xy

\Int_{x, y} (p(x)*x) * (p(y) * y)

\Int_x [(p(x)*x) * \Int_y (p(y) * y)]

0
\Int_y (p(y) * y)=E[y]=0.0
///
Var(x + y)

Var(x) + E[xy] + Var(y)

Var(x) + Var(y)
Add variables together, dispersion gets wider. Uncertainties combine. Variance of the sum of two mean zero variables is the sum
of the variances.
The mean of the sum of two dice rolls is: 7. The sum of the mean of each. You're right.
The mean of the sum of two (dice rolls minus 3.5) is: 0. I'm trying to "mean normalize" the dice rolls.
variance of one (dice roll minus 3.5):
E[ (dice roll - mean)^2 ] (sort of like the mse of using the mean)

\Sum_{i = 1}^6 P(i) * (i - mean)^2

\Sum_{i = 1}^6 1/6 * (i - 0.0)^2

\Sum_{i = 1}^6 1/6 * i^2

1/6 * (1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2)

1/6 (1 + 4 + 9 + 16 + 25 + 36 ) = 1/691 = 15.16...
variance of the sum of two (dice roll minus 3.5):
var(x + y) = var(x) + var(y)

~30.32
var(x) = var(y) = var(z) = 1
var((x + y) + z)
var(x + y) = 2
var((x + y) + z)
var(x + y) + var(z) =  3.0
// Let's return to our 784 pixels, each is sampled from a unit normal distribution (snow)
mean of (sum of all pixels in a particular image) is = 0
variance of the sum of the pixels is = 784
Let's remember how h1 is calculated:
h1_j = sigmoid(sum_{i = 0}^784 W_{i, j} * x_i)
Let's assume that W_{i, j} always has at the beginning literal value = 1.0. (this isn't true but just let's say).
The weighted sum inside the sigmoid formula for h1_j has variance 784 (Var(sum_{i = 0}^784 W_{i, j} * x_i))=784)
so that means that most of the time, the sigmoid will be in the very unsensitive range.
This makes it seem like: what's the point of changinbg W_{i, j} because changes to the input of the sigmoid have no change on h1_j.
And the only point of changing W_{i, j} would be to change h1_j in some beneficial way.
In GD the change in a weight is proportional to the partial derivative wrt that weight. So weights that impact the CE more, get more of a change each iteration.
For this reason, we want to keep the input to the sigmoid to be variance one. That keeps the initial value of z1_j FAIRLY close to zero AKA the sensitive zone.
One way to do that is to set all of the W_{i, j} initially to sqrt(1/784). That will set the variance = 1.0
Var(3 * x) = 9 * Var(x)
Var(3x) = E[ (3x - 3*mean)^2 ] = E [ 9(x - mean)^2] = 9E[(x-mean)^2]
h1_j = sigmoid(sum_{i = 0}^784 [W_{i, j}/sqrt(784)] * x_i)
That's why we set the standard deviation of the initial W values to be 1/784. To cancel out the effect of the increased variance from the sum
of the x_i values.
Xavier initialization.