Created February 12, 2018 11:38
Important ML formulaes

Popular activation functions:

  • tanh
tanh(x) = sinh(x)/cosh(x) = ( e^x - e^-x )/( e^x + e^-x ) 
  • Sigmoid
S(x) = 1/(1 + e^-x) = e^x/(e^x + 1)
  • Relu
f(x) = max(0,x)
  • Noisy Relu
f(x) = max(0,x+Y), Y  Y ∼ N(0,σ(x)) 

N is gaussian noise 
  • Leaky Relu
f(x) = x if x>0 
       0.01x otherwise


f(x) = max(x, 0.01x)

  • Parametric Relu
f(x) = x if x>0 
       ax otherwise

f(x) = max(x, ax)

  • ELU
f(x) = max(x, a(e^x -1))


Popular optimization techniques:

  • Gradient Descent
θ = θ − η⋅∇θJ(θ)
  • Stocastic Gradient Descent
θ= θ − η⋅∇θJ(θ;x(i);y(i))
  • Mini-batch gradient descent
θ = θ − η⋅∇θJ(θ;x(i:i+n);y(i:i+n))
  • SGD + Momentum
vt = γ vt−1 + η∇θJ(θ)
θ = θ − vt
  • Nesterov accelerated gradient
vt = γvt−1 + η∇θJ(θ− γvt−1)
θ = θ − vt
  • Adagrad
g(t,i) = ∇θJ(θ(t,i))

θ(t+1,i) = θ(t,i) − η⋅g(t,i)

θ(t+1,i) = θ (t,i)− η/√(G(t,i)i+ϵ)⋅g(t,i)

θ(t+1) = (θ(t) − η/√(Gt+ϵ)) ⊙ gt
  • Adadelta
E[g2]t = γE[g2]t−1 + (1−γ)g2t
Δθt = −η⋅gt,i
θt+1 = θt + Δθt
Δθt= (−η/(√E[g2]t+ϵ)) gt

E[Δθ^2]t = γE[Δθ^2]t−1 + (1−γ) Δθ^2t
RMS[Δθ]t = √E[Δθ^2]t+ϵ
Δθt= (−RMS[Δθ]t−1/RMS[g]t).gt
θt+1 = θt + Δθt
  • Adam
mt = β1 mt−1 + (1−β1) gt 
vt = β2 vt−1+ (1−β2) g2

m̂ t = mt/1−βt1
v̂ t = vt/1-βt2

θt+1 = θt − (η/√(v̂ t+ϵ))m̂ t


Popular Regularilization functions

  • L1
J(w) = ∑i (y(i) - f(xi))^2 + a(w)
  • L2
J(w) = ∑i (y(i) - f(xi))^2 + a(w^2)

Popular Cost Functions

  • Quadractic Cost

aka mean squared error, maximum likelihood, and sum squared error.

C MST(W,B,Sr,Er) = 0.5∑j(aLj−Erj)2

∇aC MST=(aL−Er)
  • Cross-entropy cost

aka Bernoulli negative log-likelihood and Binary Cross-Entropy

C CE(W,B,Sr,Er)= −∑j [Erj ln(aLj) + (1−Erj) ln(1−aLj) ]

∇aCCE = (aL−Er)/(1−aL)(aL)

Core Layers

  • Dense
  • Conv1D
  • Conv2D
  • Pooling
  • Stride
  • Embedding
  • Recurrent Neural Network
  • LSTM
  • GRU
  • Locally connected layer


