Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active January 8, 2018 09:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save misho-kr/9bd0db2860e3ade6581ba507c7cae0bc to your computer and use it in GitHub Desktop.
Save misho-kr/9bd0db2860e3ade6581ba507c7cae0bc to your computer and use it in GitHub Desktop.
Summary of "Neural Networks for Machine Learning" course at Coursera.Org

Learn about artificial neural networks and how they're being used for machine learning, as applied to speech and object recognition, image segmentation, modeling language and human motion, etc. We'll emphasize both the basic algorithms and the practical tricks needed to get them to work well.

This course contains the same content presented on Coursera beginning in 2013. It is not a continuation or update of the original course. It has been adapted for the new platform.

Taught by:

Geoffrey Hinton, Professor, Department of Computer Science, University of Toronto

Lecture 1: Introduction

  • What is Machine Learning?
    • We don’t know what program to write or it is very complicated
    • Instead we collect lots of examples that specify the correct output for a given input
    • Recognizing patterns and anomalies, prediction, classification
  • What are neural networks?
    • Very different style from sequential computation
    • A typical cortical neuron has one axon that branches, a dendritic tree that collects input from other neurons, and axon hillock that generates outgoing spikes
    • Neurons are slow but they are small, low-power and they adapt using locally available signals
    • The brain has 1011 neurons each with about 104 weights
    • Different bits of the cortex do different things, yet cortex is made of general purpose stuff that has the ability to turn into special purpose hardware in response to experience
  • Some simple models of neurons
    • Linear -- simple but computationally limited: y = b + Sum[0-i]( x-i * w-i )
    • Binary threshold: z = b + Sum[0-i]( x-i * w-i ), y = ( z > Theta ) ? 1 : 0
    • Rectified Linear (aka liear threshold): z = b + Sum[0-i]( x-i * w-i ), y = ( z > Theta ) ? z : 0
    • Sigmoid: z = b + Sum[0-i]( x-i * w-i ), y = 1 / ( 1 + e**-z )
    • Stochastic binary: These use the same equations as logistic units but they treat the output of the logistic as the probability of producing a spike in a short time window
  • A simple example of learning
    • Neural network with 2 layers of neurons - the top layer represent known shapes, the bottom layer represent pixel intensities
    • A pixel gets to vote if it has ink on it, each inked pixel can vote for several different shapes; the shape that gets the most votes wins
    • Show the network an image and increment/decrement the weights from active pixels to the correct/incorrect class
    • The simple learning algorithm is insufficient, 2-layer network is equivalent to having a rigid template for each shape, the ways in which hand-written digits vary are much too complicated
  • Three types of learning
    • Supervised – learn to predict an output when given an input vector
      • Regression: the target output is a real number or a whole vector of real numbers
      • Classification: the target output is a class label
      • We start by choosing a model-class: y = f(x, W) and learn by adjusting the parameters to reduce the discrepancy between the target output t and the actual output y on each training case
    • Reinforcement – learn to select an action to maximize payoff
      • The output is an action or sequence of actions and the only supervisory signal is an occasional scalar reward
      • The goal in selecting each action is to maximize the expected sum of the future rewards
      • We usually use a discount factor for delayed rewards so that we don’t have to look too far into the future
    • Unsupervised – discover a good internal representation of the input
      • For ~40 years it was largely ignored
      • Many thought that clustering was the only form of unsupervised learning
      • One major aim is to create an internal representation of the input that is useful for subsequent supervised or reinforcement learning
      • It provides a compact, low-dimensional representation of the input
      • It provides an economical high-dimensional representation of the input in terms of learned features
      • It finds sensible clusters in the input

Lecture 2: The Perceptron learning procedure

  • Feed-forward neural networks
    • The first layer is input, the last one is output
    • If there is more than one hiddle layer we call them deep NN
    • The activities of the neurons in each layer are a non-linear function of the activities in the layer below
  • Recurrent newural networks (RNN)
    • These have directed cycles in their connection graph
    • They have the ability to remember information in their hidden state for a long time
    • Very hard to train them to use that potential
  • Symmetrically connected networks
    • Like RNN, but the connections between units are symmetrical (they have the same weight in both directions)
    • John Hopfield (and others) realized that symmetric networks are much easier to analyze than recurrent networks
    • They are also more restricted in what they can do. because they obey an energy function
    • Hopfield nets -- symmetrically connected nets without hidden units are called
    • Boltzmann machines -- symmetrically connected networks with hidden units; more powerful than "Hopfield nets", easier to train than RNN
  • Perceptrons -- the first generation of neural networks
    • Popularized by Frank Rosenblatt in the early 1960’s
    • Minsky and Papert published a book in 1969 called “Perceptrons” that analysed what they could do and showed their limitations
  • Binary threshold neurons (decision units), McCulloch-Pitts (1943)
    • A bias is exactly equivalent to a weight on an extra input line that always has an activity of 1; learn a bias as if it were a weight
    • Convergence procedure -- training binary output neurons as classifiers
    • Pick training cases using any policy that ensures that every training case will keep getting picked:
      • If the output unit is correct, leave its weights alone
      • If the output unit incorrectly outputs a zero, add the input vector to the weight vector
      • If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector
  • A geometrical view of perceptrons
    • Weight-space has one dimension per weight
    • A point in the space represents a particular setting of all the weights
    • Each training case can be represented as a hyperplane through the origin, assuming that we have eliminated the threshold
    • The weights must lie on one side of this hyper-plane to get the answer correct
  • Why the learning works
    • Hopeful claim: Every time the perceptron makes a mistake, the learning algorithm moves the current weight vector closer to all feasible weight vectors
    • Problem case: The weight vector may not get closer to this feasible vector
    • Every time the perceptron makes a mistake, the squared distance to all of these generously feasible weight vectors is always decreased by at least the squared length of the update vector
    • So after a finite number of mistakes, the weight vector must lie in the feasible region if this region exists
  • What perceptrons can’t do
    • If you are allowed to choose the features by hand and if you use enough features, you can do almost anything
    • But once the hand-coded features have been determined, there are very strong limitations on what a perceptron can learn
    • A binary threshold output unit cannot even tell if two single bit features are the same
    • Minsky and Papert’s “Group Invariance Theorem” says that the part of a Perceptron that learns cannot learn to do this if the transformations form a group
    • Translations with wrap-around form a group
  • Learning with hidden units
    • Networks without hidden units are very limited in the input-output mappings they can learn to model
    • More layers of linear units do not help. Its still linear
    • Fixed output non-linearities are not enough
    • We need multiple layers of adaptive, non-linear hidden units

Lecture 3: The backpropagation learning procedure

Lecture 4: Learning feature vectors for words

Lecture 5: Object recognition with neural nets

Lecture 6: Optimization: How to make the learning go faster

Lecture 7: Recurrent neural networks

Lecture 8: More recurrent neural networks

Lecture 9: Ways to make neural networks generalize better

Lecture 10: Combining multiple neural networks to improve generalization

Lecture 11: Hopfield nets and Boltzmann machines

Lecture 12: Restricted Boltzmann machines (RBMs)

Lecture 13: Stacking RBMs to make Deep Belief Nets

Lecture 14: Deep neural nets with generative pre-training

Lecture 15: Modeling hierarchical structure with neural nets

Lecture 16: Recent applications of deep neural nets (optional videos)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment