dvigneshwer/ML_Cheetsheet.md

## ML_Cheetsheet.md

      
    Raw
  

              ML_Cheetsheet.md
            
          
    Popular activation functions:


tanh

tanh(x) = sinh(x)/cosh(x) = ( e^x - e^-x )/( e^x + e^-x ) 


Sigmoid

S(x) = 1/(1 + e^-x) = e^x/(e^x + 1)


Relu

f(x) = max(0,x)


Noisy Relu

f(x) = max(0,x+Y), Y  Y ∼ N(0,σ(x)) 

N is gaussian noise 


Leaky Relu

f(x) = x if x>0 
       0.01x otherwise

or

f(x) = max(x, 0.01x)


Parametric Relu

f(x) = x if x>0 
       ax otherwise
or

f(x) = max(x, ax)


ELU

f(x) = max(x, a(e^x -1))

Ref: https://en.wikipedia.org/wiki/Activation_function
Popular optimization techniques:


Gradient Descent

θ = θ − η⋅∇θJ(θ)


Stocastic Gradient Descent

θ= θ − η⋅∇θJ(θ;x(i);y(i))


Mini-batch gradient descent

θ = θ − η⋅∇θJ(θ;x(i:i+n);y(i:i+n))


SGD + Momentum

vt = γ vt−1 + η∇θJ(θ)
θ = θ − vt


Nesterov accelerated gradient

vt = γvt−1 + η∇θJ(θ− γvt−1)
θ = θ − vt


Adagrad

g(t,i) = ∇θJ(θ(t,i))

θ(t+1,i) = θ(t,i) − η⋅g(t,i)

θ(t+1,i) = θ (t,i)− η/√(G(t,i)i+ϵ)⋅g(t,i)

θ(t+1) = (θ(t) − η/√(Gt+ϵ)) ⊙ gt


Adadelta

E[g2]t = γE[g2]t−1 + (1−γ)g2t
Δθt = −η⋅gt,i
θt+1 = θt + Δθt
Δθt= (−η/(√E[g2]t+ϵ)) gt

E[Δθ^2]t = γE[Δθ^2]t−1 + (1−γ) Δθ^2t
RMS[Δθ]t = √E[Δθ^2]t+ϵ
Δθt= (−RMS[Δθ]t−1/RMS[g]t).gt
θt+1 = θt + Δθt


Adam

mt = β1 mt−1 + (1−β1) gt 
vt = β2 vt−1+ (1−β2) g2

m̂ t = mt/1−βt1
v̂ t = vt/1-βt2

θt+1 = θt − (η/√(v̂ t+ϵ))m̂ t

ref: http://ruder.io/optimizing-gradient-descent/
Popular Regularilization functions


L1

J(w) = ∑i (y(i) - f(xi))^2 + a(w)


L2

J(w) = ∑i (y(i) - f(xi))^2 + a(w^2)

Popular Cost Functions


Quadractic Cost

aka mean squared error, maximum likelihood, and sum squared error.
C MST(W,B,Sr,Er) = 0.5∑j(aLj−Erj)2

∇aC MST=(aL−Er)


Cross-entropy cost

aka Bernoulli negative log-likelihood and Binary Cross-Entropy
C CE(W,B,Sr,Er)= −∑j [Erj ln(aLj) + (1−Erj) ln(1−aLj) ]

∇aCCE = (aL−Er)/(1−aL)(aL)

Core Layers


Dense
Conv1D
Conv2D
Pooling
Stride
Embedding
Recurrent Neural Network
LSTM
GRU
Locally connected layer

TBC.