misho-kr/Neural Networks for Machine Learning.md

## Neural Networks for Machine Learning.md

      
    Raw
  

              Neural Networks for Machine Learning.md
            
          
    Neural Networks for Machine Learning

Learn about artificial neural networks and how they're being used for machine learning, as applied to speech and object recognition, image segmentation, modeling language and human motion, etc. We'll emphasize both the basic algorithms and the practical tricks needed to get them to work well.
This course contains the same content presented on Coursera beginning in 2013. It is not a continuation or update of the original course. It has been adapted for the new platform.
Taught by:

Geoffrey Hinton, Professor, Department of Computer Science, University of Toronto
Lecture 1: Introduction


What is Machine Learning?

We don’t know what program to write or it is very complicated
Instead we collect lots of examples that specify the correct output for a given input
Recognizing patterns and anomalies, prediction, classification


What are neural networks?

Very different style from sequential computation
A typical cortical neuron has one axon that branches, a dendritic tree that collects input from other neurons, and axon hillock that generates outgoing spikes
Neurons are slow but they are small, low-power and they adapt using locally available signals
The brain has 10¹¹ neurons each with about 10⁴ weights
Different bits of the cortex do different things, yet cortex is made of general purpose stuff that has the ability to turn into
special purpose hardware in response to experience


Some simple models of neurons

Linear -- simple but computationally limited: y = b + Sum[0-i]( x-i * w-i )
Binary threshold: z = b + Sum[0-i]( x-i * w-i ), y = ( z > Theta ) ? 1 : 0
Rectified Linear (aka liear threshold): z = b + Sum[0-i]( x-i * w-i ), y = ( z > Theta ) ? z : 0
Sigmoid: z = b + Sum[0-i]( x-i * w-i ), y = 1 / ( 1 + e**-z )
Stochastic binary: These use the same equations as logistic units but they treat the output of the logistic as the probability of producing a spike in a short time window


A simple example of learning

Neural network with 2 layers of neurons - the top layer represent known shapes, the bottom layer represent pixel intensities
A pixel gets to vote if it has ink on it, each inked pixel can vote for several different shapes; the shape that gets the most votes wins
Show the network an image and increment/decrement the weights from active pixels to the correct/incorrect class
The simple learning algorithm is insufficient, 2-layer network is equivalent to having a rigid template for each shape, the ways in which hand-written digits vary are much too complicated


Three types of learning

Supervised – learn to predict an output when given an input vector

Regression: the target output is a real number or a whole vector of real numbers
Classification: the target output is a class label
We start by choosing a model-class: y = f(x, W) and learn by adjusting the parameters to reduce the discrepancy between the target output t and the actual output y on each training case


Reinforcement – learn to select an action to maximize payoff

The output is an action or sequence of actions and the only supervisory signal is an occasional scalar reward
The goal in selecting each action is to maximize the expected sum of the future rewards
We usually use a discount factor for delayed rewards  so that we don’t have to look too far into the future


Unsupervised – discover a good internal representation of the input

For ~40 years it was largely ignored
Many thought that clustering was the only form of unsupervised learning
One major aim is to create an internal representation of the input that is useful for subsequent supervised or reinforcement learning
It provides a compact, low-dimensional representation of the input
It provides an economical high-dimensional representation of the input in terms of learned features
It finds sensible clusters in the input


Lecture 2: The Perceptron learning procedure


Feed-forward neural networks

The first layer is input, the last one is output
If there is more than one hiddle layer we call them deep NN
The activities of the neurons in each layer are a non-linear function of the activities in the layer below


Recurrent newural networks (RNN)

These have directed cycles in their connection graph
They have the ability to remember information in their hidden state for a long time
Very hard to train them to use that potential


Symmetrically connected networks

Like RNN, but the connections between units are symmetrical (they have the same weight in both directions)
John Hopfield (and others) realized that symmetric networks are much easier to analyze than recurrent networks
They are also more restricted in what they can do. because they obey an energy function
Hopfield nets -- symmetrically connected nets without hidden units are called
Boltzmann machines -- symmetrically connected networks with hidden units; more powerful than "Hopfield nets", easier to train than RNN


Perceptrons -- the first generation of neural networks

Popularized by Frank Rosenblatt in the early 1960’s
Minsky and Papert published a book in 1969 called “Perceptrons” that analysed what they could do and showed their limitations


Binary threshold neurons (decision units), McCulloch-Pitts (1943)

A bias is exactly equivalent to a weight on an extra input line that always has an activity of 1; learn a bias as if it were a weight
Convergence procedure -- training binary output neurons as classifiers
Pick training cases using any policy that ensures that every training case will keep getting picked:

If the output unit is correct, leave its weights alone
If the output unit incorrectly outputs a zero, add the input vector to the weight vector
If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector


A geometrical view of perceptrons

Weight-space has one dimension per weight
A point in the space represents a particular setting of all the weights
Each training case can be represented as a hyperplane through the origin, assuming that we have eliminated the threshold
The weights must lie on one side of this hyper-plane to get the answer correct


Why the learning works

Hopeful claim: Every time the perceptron makes a mistake, the learning algorithm moves the current weight vector closer to all feasible weight vectors
Problem case: The weight vector may not get closer to this feasible vector
Every time the perceptron makes a mistake, the squared distance to all of these generously feasible weight vectors is always decreased by at least the squared length of the update vector
So after a finite number of mistakes, the weight vector must lie in the feasible region if this region exists


What perceptrons can’t do

If you are allowed to choose the features by hand and if you use enough features, you can do almost anything
But once the hand-coded features have been determined, there are very strong limitations on what a perceptron can learn
A binary threshold output unit cannot even tell if two single bit features are the same
Minsky and Papert’s “Group Invariance Theorem” says that the part of a Perceptron that learns cannot learn to do this if the transformations form a group
Translations with wrap-around form a group


Learning with hidden units

Networks without hidden units are very limited in the input-output mappings they can learn to model
More layers of linear units do not help. Its still linear
Fixed output non-linearities are not enough
We need multiple layers of adaptive, non-linear hidden units


Lecture 3: The backpropagation learning procedure

Lecture 4: Learning feature vectors for words

Lecture 5: Object recognition with neural nets

Lecture 6: Optimization: How to make the learning go faster

Lecture 7: Recurrent neural networks

Lecture 8: More recurrent neural networks

Lecture 9: Ways to make neural networks generalize better

Lecture 10: Combining multiple neural networks to improve generalization

Lecture 11: Hopfield nets and Boltzmann machines

Lecture 12: Restricted Boltzmann machines (RBMs)

Lecture 13: Stacking RBMs to make Deep Belief Nets

Lecture 14: Deep neural nets with generative pre-training

Lecture 15: Modeling hierarchical structure with neural nets

Lecture 16: Recent applications of deep neural nets (optional videos)