sgoyal1012/AI Nano- Deep Learning.md

## AI Nano- Deep Learning.md

      
    Raw
  

              AI Nano- Deep Learning.md
            
          
    Lesson 8- Intro to tensorflow

Lesson 9- Autoencoders


What is an autoencoder?

Makes a compressed representation of a data without any human intervention
Cons Bad compression and generalizing to datasets
Pros Dimensionality Reduction and Image denoising


A simple autoencoder

Just compresses data, for example images from the MNIST database.
 - Also need a corresponding decoder to reconstruct the image back.
Using Tensor Flow to create a simple autoencoder


Convolutional Autoencoder

Why is it better?
Encoder goes from a larger image to a smaller image (using max pooling layers etc)
For decoding- use transpose convolutions, upsampling.

The checker board effect with transpose convolution


Making the network learn how to denoise images by providing input as noisy images and output as noise-free images


All Gates

Learn Gate

Takes the short term memory and the event and combines it; and then keeps the important part of it.


Lesson 10- Recurrent Neural Networks


The motivation- ordered sequences!

Better with time series data

Eg. stock price, we have to do supervised learning with sequences


Natural language Processing
Language Translation


Vanilla supervised learners

Make no assumption about the input structure of the data. the order matters!
Images and video have structure, relations within image.


Modelling sequential data

What is an ordered sequence?

Indexing values by timestamp (the order in which they appeared)
A product of some underlying process/processes

Eg. temperature/stock prices


Model the sequence recursively

Model future values based on the past


Simple recursive examples

Example odd numbers, something that can be expressed as a function of its predeccesors
The seed is the first element, eg. for fibonnaci is 1 and 1.
The order is the number of previous elements an element depends upon-eg. for odd number sequence it is 1, for fiboncacci is 2.


Thinking about recursivity

The unfolded view vs folded view vs graph view.


Driving a recursive sequence

Savings account example.


Injecting recursivity into a learner, the lazy way

Learning the function to describe a sequence

Learn weights of parameterized functio by fitting; take the least squares cost function.

Regression! Windowing our data points based on input-output pairs.
Shows how to do this in Keras.
There can be more than one way to describe a sequence..
Applies this to a real financial dataset


Injecting recursivity into a learner, the proper way

Failures of the FNN approach

We assumed no structure- just went on pair by pair- * there's a dependence on input-output*, they are not IID


Basic RNN approach

Force consecutive dependency!
Hidden states - the h variables
Use the least squares loss again, albeit now including the hidden variables as well.


RNNs and memory

RNNs go much farther back in time to take into account the previous values; whereas FNN just depends on the immediately previous value.
Every levels contain a complete history, or in other words, have memory.


Technical issues such as vanishing/exploding gradients exist

Lesson 11 : LSTMs- Long Short term Memory Networks


RNN vs LSTM

Use previous information- the animal NatGeo example.
RNNs generally store short term memory due to vanishing gradients; but LSTMs keeps track of both long term memory and short memory. (GHAJINI)
Combine both forms of memories into 4 gates- forget gate, remember gate, learn gate and use gate - these dates are used to update both long and short term memories.


About all gates: using example of NatGeo science and nature show

Learn Gate

Joins the short term memory and the event; and forgets the un-important part--> ignore factor


Forget Gate

The forget factor


Remember Gate

Combines the long term memory from the forget gate and short term memory from the learn gate and SIMPLY ADD THEM; and generate the new long term memory.


Use Gate

Takes whatever is useful from both long term and short term memories; and generate the new short term memory.


Hay muchas otras architecturas para tratar los

Lesson 12 : Implementing RNNs and LSTMs


Begins with a review of RNNs and LSTMs

RNNs: Google translate improvement example, and the need for RNNs.

Route the output from the previous hidden layer back into the hidden layer.


LSTMs: Begins with the need due to vanishing or exploding gradients

Talks about the four gates.


Character wise RNN

Learn text one character at a time, and produce text one character at a time.

Get a probability distribution for the next character.


Sequence batching

Splitting sequences into batches of some lengths


Building a character wise RNN - Anna KaRNNa

Builds LSTMs using Tensorflow

Hyperparameters¶
Here are the hyperparameters for the network.
batch_size - Number of sequences running through the network in one pass.
num_steps - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
lstm_size - The number of units in the hidden layers.
num_layers - Number of hidden LSTM layers to use
learning_rate - Learning rate for training
keep_prob - The dropout keep probability when training. If you're network is overfitting, try decreasing this.


Lesson 13: Hyperparameters in RNNs

How to tune hyperparameters, no magic numbers, they depend on the dataset.
Two categories

Optimizer hyperparameters - learning rate, minibatch size, number of epochs
Model hyperparameters - number of layers/units


The learning rate is the MOST IMPORTANT hyperparameter

The learning rate takes us closer to the least error
Compares choosing too big vs too small of a learning rate.
Learning rate controls all weights versus various error curves
Learning rate decay


Minibatch size

Online training (batch size=1) vs batch training (batch size=all the examples)
Small minibatches have noise, that prevents being stuck in local minima, so that is preferred.
Generally, 32 to 256 are good starting values.


Number of iterations/epochs

Use Early stopping, to stop when the validation error stops decreasing.


Number of hidden units

More units = more prone to overfitting.
Generally more the better, but too large may lead to overfitting.


RNNs hyperparameters

No clear winner between GRUs and lSTMs- try both and test


Lesson 14: Sentiment analysis with RNNs


Sentiment analysis with RNNs - movie reviews in this case

Word2Vec- words to integers (word embeddings)
Embedding lookup layer- why do we need to do this?!THINK!
Dropout wrappers for dropout regularization
tf.nn.dynamic_rnn


Project:Recurrent Neural Network Projects


Windowing out the sequence into input/output pairs.
Write a very simple RNN sequence with LSTM.
How can we train a machine learning model to generate text automatically, character-by-character? By showing the model many training examples so it can learn a pattern between input and output.
It is a multiclass classification problem!

Lesson 16: Generative Adversarial networks - Ian Goodfellow!


Uses of GAN - to generate data

Mostly done in the field of images.
Example a description of a bird is used to generate images matching that description.
Imitation Learning


How GANs work?

Generative models

Generate images by running noise through a differential funcion (an image) to generate images.
Training process is different than supervised learning. We show the model a bunch of images to generate images.
Uses a discriminator to assign images as real or fake
Generator tries to fool the discriminator by generating fake images and gets better (forced) to make real images.


Generator vs Discriminator- a game between these two

Discusses a bit about game theory.
Equilibrium in the GAN game - has two different players with two different costs

Generator and discriminator compete against each other.
Not always that you find the equilibrium


Tips to trains GANs

Need to learn two optimization algorithms - generator loss and discriminator loss
For large images, we use convolutional networks.
Different from CNNs as in CNNs we go from a larger image to a smaller image; but in GANs we go from a small feature to large images.
A project to build a GAN.

Generator which generates data and discriminator discriminates (acts as police) to call real as real and fake as fake.
tf.variable_scope and tf.trainable_variables
Calculating losses is tricky.
Labels smoothing
Shows how GANs learn progressively epoch by epoch..


Lesson 17: Deep Convolution GANs


DC-GAN

Transposed Convolution - You upsample the image here
Batch Normalization


https://github.com/udacity/deep-learning/blob/master/batch-norm/Batch_Normalization_Lesson.ipynb


The idea is that, instead of just normalizing the inputs to the network, we normalize the inputs to layers within the network. It's called "batch" normalization because during training, we normalize each layer's inputs by using the mean and variance of the values in the current mini-batch.


Has several benefits, converges faster; can train with higher learning rates.


Notebook : https://github.com/udacity/deep-learning/blob/master/batch-norm/Batch_Normalization_Lesson.ipynb
Batch normalization is a technique for improving the performance and stability of neural networks. The idea is 
to normalize the layer inputs such that they have a mean of zero and variance of one, much like how we standardize
the inputs to networks. Batch normalization is necessary to make DCGANs work.


Project: DC Gan: Generate street signs.


Changing the generator and discriminator to be convolutional networks

Use transposed convolution, after transposed convolution, do batch normalization
For each of these layers, the general scheme is convolution > batch norm > leaky ReLU.
GANs are VERY SENSITIVE to hyperparameters.


Lesson 18: Semisupervised learning


An applciation of GANs

Use GAN to improve classification of models.
People vs Deep Learning, people receive a lot of unlabeled data,  but deep learning only receivs labeled data.
Has both labeled and unlabeled data; eg. can leverage internet to get huge amounts of unlabeled data.
Train both generator and discriminator, and then generator is throwed away, the discriminator is used as a classifier
Feature matching


Notebook on GANs- streer view numbers


Turn discriminator into classifier

Three sources, labeled images, unlabeled real images, fake/imaginary images.
Generator is a normal DCGan, Discriminator is a multi class classifier now
More regularization, because less labeled example. Also leaky relu to allow gradients to pass through.
Feature Matching - Make sure that feature values in the test are similar to the ones generated by the generator.
Add Loss functions for both supervised and unsupervised.
Moment Matching


Lesson 20: Intro To Computer Vision


What is visual perception?
Role in AI: For example for a self driving car to see and react

In recognizing persons from images etc.
Medical images


Emotional intelligence
Computer Vision pipeline
Afectiva demo

Lesson 21: Intro to Natural Language Processing


Why is it difficult for computers to understand us?

because human language does not have a fixed structure.
Structured langugaes have a fixed grammar, and gives up if something is out of its grammar.
Human discourse is unstructured and complex.
Needs context.

We implicity apply our kowledge of physical world.


Applications of NLP

Chat bots


Challenges in NLP

Maintaing a context


Lesson 22: Intro to Voice User Interfaces (VUI)


VUI Overview and pipeline

Acoustic Model, Language Model and Accent Model


VUI Applications

Speaking is faster and less distractive than typing


Alexa Demo

Alexa Skills


Computer Vision

Project : Mimic me


Use of emotiva's API.
Task to recognize a mood and put an emoji next to it.

Image Representation and Analysis


Pre-processing

Why?

To correct images and remove noise
Enhance parts of image that are important


How

RBG TO grayscala

Reduces storage size needed
Color is not required to detect object and interpret image
Color is important, for example when you have to distinguish between yellow and white lane lines.


Images as functions


Color Thresholds

Used to select an area of interest.


OpenCV reads its images as BGR, so always should convert to RGB

The blue screen coding exercise


Color Spaces

Different color spaces- HSV for example; sometimes HSV is better to segment out objects (*for eg. in the pink balloons case!)
Separate out channels - RGB/HSV


Geometric Transforms

Move pixels based on a mathematical formula, to change the perspective of an image.

Eg. scanning and aligning text.


cvtColor


Transforming Text

Straighteing text out from a business card
Map from original image to warped image using geometric transformations
Guesstimate the coordinates, get the transform and apply it


Filters in images

Edge detection filters- high pass filters
Noise removal filters- low pass filters
High frequency components vs low frequency components
High pass filter example- emphasizes edges, where the intensity changes very quickly (high frequency)

Kernel MUST sum to zero (WHY?!)
Kernel and weights
How to create your own filter, using openCV and use inbuilt filters. Eg. Sobel Filter (Ah memories!)
Importance of setting thresholds
High pass filters can enhance noise, so should do low pass before doing edge detection.


Low pass filters

Noise such as speckles, no useful information; for eg. edge detection filters AMPLIFY noise.
Low pass filters use average, and should be normalized so that sum is one.


Gaussian blur - the most common used low pass filter.

First pass through gaussian blur (remove noise), then do edge detection.


Canny edge detector! (memories!)

Non maximal suppression (Remember Bobick!)


Image segmentation


To segment an image into areas
Image contouring

Useful to see connected objects and segment objects.
Done on a B&W image, after thresholding.
Contour features

Area, perimeter, orientation (based on the eclipse fitted)


Hough transform

Line detection (BOBICK! CV assignment 1)
Convert into hough space (m and b coordinates).
Better to convert into polar coordinates
cv2 houghlines parameters


K-means clustering

Unsupervised method to break image into methods.


Feature Extraction and Object Recognition


What is a feature?

A feature is a measurable piece of data in an image
Should be consistent across different scales, lightning etc.
Should be repeatable- very important


Types of features

Edges, corners and blobs
Corners- best repeatable

Are Unique than other features.


Corner detector

Calculate gradient magnitude and direction

Corner has a bid variation in direction and magnitude of the gradient


Dilation and erosion (AH MEMORIES!!)

Morphological operations
Remember closing and opening


Feature Vectos

Look at the direction of gradients

Eg. divide into grid and see directions of gradients.


HOG (Histogram of oriented gradients)

Use binning to separate pixels.
Orientation and magnitude of gradients - get via Sobel, then place the data into a histogram (after dividing into cells)
Should be scale and rotation invariant
HOG is also referred to as a type of feature descriptor, which is a simplified representation of an image that is made   up of extracted features (that highlight important parts in an image) and that discards extraneous information. In this case the features represent the image gradient -- it's magnitude and directions, which describe the shapes and patterns of intensity that make up the image.
Talks about how to create a HOG feature vector
Block normalization


Object recognition

Positive vs negative examples
Extract features and feed to training algorithm (supervised learning)
Eg. use an SVM classifier


Haar cascades

Haar features

Detect rectangular patterns like edges etc. (good for faces).


Haar cascades classifiers and rejects non-face (irrelevant) data

Fast enough to process in real time.


Motion/Video

Video is a frame of images!
Optical flow (BOBICK!)

For motion and tracking analysis
Assumptions- Pixel values do not change from frame to frame, and neighboring pixels have similar motion


Final Capstone Project: CV Face Detection


## Artifical Intelligence Nanodegree.md

      
    Raw
  

              Artifical Intelligence Nanodegree.md
            
          
    Lesson 4: Introduction to AI


Examples of problems to use AI?

Navigation problem: planning a path
Heuristic - some additional info that makes brute force act in a more intelligent manner
A* search


Tic tac toe quiz question

Every board config is a node
Pruning the search tree and adversarial search; anticipate to changes in the environment


Monty hall problem

Probability theory


What is intelligence?

Defining intelligence? Should not be based on our perception, but should be defined in the context of a task.
Definition of agent, environment and state
Perception, Action and Cognition. Reactive or behavior based agents.


Classifying AI problems

Based on environment- stochastic-deterministic; adversarial; partially observable etc.
Classifying different problems such as poker, driving on the road etc.
Rational behavior and bounded optimality


Lesson 5: Applying AI to Sudoku


Constraint Propagation and Search, simple concepts applied well
Setting up the board- Defining boxes, units and peers

Representing the sudoku puzzle as a string/dictionary


Strategy 1 : Elimination- eliminate values that cannot be there based on constraints
Strategy 2: Only choice- since a grid must have every digit atleast once, there can be a case where only one option is there for a box
Combine the two strategies: The concept of constraint propagation

Need to stop when the board is solved
Dont go further is there is no progress, or the solver is stalled
Recursively apply the two strategies from above
Does NOT work on hard sudoku puzzles! Which brings us to the third strategy..


Strategy 3: SEARCH

Pick boxes with fewest options- then branch out - use DFS


Lesson 6: Basics of Anaconda

PROYECTO: SUDOKU

Lesson 8: Introduction to Game Playing


The game of isolation

Building a game tree - build a tree based on choices of move at every step
Early detection and telling the computer not to lose early, which leads us to...


The minimax algorithm

Computer tries to maximize its score, and the opponent tries to minimize it
Propagate score up the tree
Finding branches where computer ji can win
Branching factor and the number of nodes one needs to visit


Depth Limited Search

The average branching factor - just try it out and see the average branching factor for a particular board setting. Even this due to the exponential nature of the game, it is too many branches!
Need a way for the computer to choose a move quickly; given the limited processing power
The evaluation function, in this example it is comparing the nodes based on the maximum number of moves a node can have; propagating bottom-up
Quiescent Search - after which level the worst branches do not change; i.e. become quiescent


Lesson 9: Advanced Game Playing


Iterative Deepening- return the answer given within the given time constraint; i.e. how deep along the three you can go.

Number of nodes needed to explore based on the branching factor
Varying the branching factor as you progress along the game
Horizon effect


Using different evaluation functions and find the best one out
Alpha beta pruning algorithm

Pruning the number of nodes to look at using the parameters alpha and beta, reduces the search space
Solving 5 by 5 isolation, use the symmetry of the board to see similar move


3 Player isolation game

Minimax dont work anymore. We have triplets at each level and choose values most suitable to a particular player, i.e. where its score can be maximized.
Alpha beta pruning for a 3 player isolation
Deep pruning is not possible-  can do immediate and shallow pruning
Paper by Korsch


Probabilistic Games

Sloppy isolation
EXPECTIMAX function - pruning in a probabilistic sense


Lesson 11: Search


What is a problem?

Examples: Route Finding
Definition of a problem

an initial state,
a set of actions in a particular state
a result
Goaltest, i.e. if this state is the GOAL or not (GOALLLLL)
Path Cost - implemented as a Step Cost function


Example Route Finding

The state space is the entire area
Frontier- the farthest part explored, the unexplored region; and the explored region


Tree Search algorithms

Breadth first search
Keeping track of explored (visited) states!
Notes about termination only when you find the best path


Uniform Cost Search

Taking into account the cost
Continue to search till you find a better path; and stop till you can't better it no more.
Depth First Search is not optimial in this context

Then why would you use depth first search at all?! - Due to less storage requirements!
Depth First Search is also not complete


About uniform cost-  need more knowledge to get to the goal faster

For example, in the route finding problem, having an estimate of the distance to the goal would help.


The A star algorithm!

Minimizing f- the sum of g and h
Minimizes keep the path short and also focused on finding the path. Minimizing both components at the same time.
KEEP EXPANDING until all paths have been explored!
A star finds the best path if the heuristic function less than the true cost.

Should not overestimate
Is optimistic
Is admissible
Why does the optimistic function 'h' work?


State Spaces

The Vaccum Example
The number of states is TOO DAMN HIGH!


Sliding Blocks Example

Finding an appropriate heuristic function
What is an admissible heuristic function? Comparing two good heuristic functions
Can we automatically come up with a heuristic function? Can come up with heuristic by defining the problem in words.
Generating a relaxed problem


Problems with search

Constraints

Domain must be fully observable
Domain must be known
Must be discrete
Deterministic
Static


Notes on implementation

Nodes and Paths


PACMAN project
VERY NICE AND DIFICILE!


Implementing all search algos on PACMAN


Lesson 12: Simulated annealing

Solving a problem by adding some simple intelligence


Travelling Salesman Problem

NP Hard


N Queens Problem

Arrange in a way to have no more attacks
The heuristic function for N Queens-iterate and reduce number of attacks with each move
Local Minima - you get stuck! (But it is still solvable)


Hill Climbing

Less dimensions- Local maximum problem
Random Restart!- take maximum of all local peaks
Tabu Search algo
Step Size- too small vs too large
Start with a large step size, then reduce it to make sure to reach the minima


Simulated Annealing

Introduction to physical annealing
Heating and cooling to get out of global minima
Iterate to find a better position
Start with higher randomness, gradually reducing randomness ( Vary T from very large to very small)
GUARANTEED to converge to the global Maximum!
Local Beam Search - keeps track of K particles


Genetic Algorithms

Survival of the fittest - breeding and mutation.
Crossover - children get good aspects of their parents through natural selection.
What if a Critical piece gets eliminated?- Solved by having more randomness
Without mutation, we might NEVER reach the goal!


Simulated annealing lab

Implement simulated annealing
Nice assignment! All functions were kind of challenging.


Constraint Satisfaction


Constraint Graph
Map Coloring Constraint Problem
Constraints can be unary, binary or have even more variables.
Backtracking Search

Improving efficiency, use the least constrained value
Use the most constrained variables - solve more constraints sooner; or the minimum remaining values
Forward Checking - maintain a map of all possible values for a variable
Arc consistency


Structured CSPs

Break into independent variables - the tasmania example


Challenge Question - TWO + TWO = FOUR crypto question
Constraint Satisfaction Lab

Logic and Reasoning


Propositional Logic

Representing events as True or False, or relation between them.
Truth Tables and the symbols used
Valid, Satisfiable and Unsatisfiable
Limitations

Can only handle true and false; not probability
Cannot talk about objects or relationships between them


First order logic

Comparison with propositional logic and probability theory
Represents relationships between objects; more complex models
Defining the models; objects, functions and constants
Talk about syntax

sentences, terms, quantifiers


Representing the vaccum world as first order logic
Questions for First Order Logic: Practice. Representing English statements as First Order Logic. - NICE QUESTIONS


Planning


Just planning is NOT enough, need feedback to rightly execute and finish the task.

Environment is stochastic, and there are other agents too ; can't know this info beforehand
Partial observability
Some unknown
BELIEF states instead of WORLD states
Example with vaccum cleaner, what if the vaccum sensors break down?!
Successful plans!


Mathemical formulation for succesful plans

Tracking the predict-update cycle

Describing in terms of variables


Belief state space

Sensorless vaccum example!
Comformant plans - where we do not know everything about the world, pero todavia llegaremos a nuestro objetivo!


Partially observable vaccum example

Act-observe cycle
Actions increase uncertainty, and observations bring them down!
Can't guarantee ALWAYS!

Infinite sequences!


Classical Planning

Assign all values to K boolean variables - State Space
World state- complete assignment
Belief state- complete assignment or partial assignment
Actions and preconditions

Example of the fly action schema*


Progression state space search vs Regression state space search

Regression starts from the goal
Progression starts from the initial state
When is it better to search backwards vs forwards?


Plan Space Search

Search through plans


Forward search is the MOST POPULAR
Importance of heuristics
Situation Calculus

Successor state axiom


Lesson 16:Probability - SEBASTIAN THRUN IS BACK!


Intro to Probability and Bayes Network

A network of reasons- the car wont start example
Car wont' start- battery wont start/battery wont charge-and so on and so forth in reasons.
Come up with a sort of a dependency graph for various variables

16 variables in this structure, so 2^16 values


Specify..observe..compute
Assumption that every event is discrete/binary


Probability concepts

Complementary and joint probability
Concept of dependence and conditional probability
Total Probability- CANNOT NEGATE the conditional variable!
Some quizzes on these concepts.


Bayes Rule!

Likelihood, prior and marginal likelihood


Lesson 17: Bayes Networks


A and B - A is not observable, B is observable

Diagnostic Reasoning


Computing Bayes Rule

The denominator (total probability for B is HARD to compute)
So we just use the unnormalized terms, and then adding them to get the normalized version
Nice qay to calculate probabilities in the quizzes!


Conditionally independent

Absolute independence does not implement conditional independence, and vice versa; neither can de deduced from the other.


Confounding case

Two causes effect an observable variable


Explaining away effect

If an effect can be caused by multiple causes, seeing one of those causes to be true/untrue, the other can explain away the effect.
DIFFICULT QUESTIONS IN QUIZ on this effect!


Defining Bayes Networks

A graph explaining probability relationships between various event
Joint probability is defined by factoring in conditional relationships etc.
A node with K inputs requires 2^k variables (parameters) to define
Bayes netwrosk reduce the number of params needed by quite a lot! So very useful


D-separation

Any two variables are independent if they are not linked by just unknown variables
Two independent varibles affecting a variable; and if we know about that variable, then these variables become dependent, the explained away effect


Lesson 18: Inference in Bayes Nets


Evidence variables, Hidden Variables and Query Variables
Output is a joint variability distribution over the query variables, given the evidence variables
Which query variable is the most likely?!

Can also go in opposite direction- reverse evidence variables and query variables.


Inference by Enumeration

Enumerate over all the hidden variables
Speeding up enumeration - Maximize independence (determine through the bayes network)
Causal direction- easier to inference when the graph goes from causes to effects


Variable elimination

Divide into smaller parts..enumerate..then combine
Join factos to form larger factors and then eliminate variables


Approximate Inference and Sampling

Estimating by sampling and performing experiments
Advantage- no complex coputations; simualtion does not need conditional probability tables
Sprinkler Rain example
With an infinite number of samples, we approach the true probabilities


Rejection Sampling - only keep the samples that match the scenario we want to compute

Can end up rejecting a lot of samples...for eg. Burglaries and earthquakes are very infrequent
Likelihood weighting - add a probabilistic weight to each sample, according to the probability of the conditions
Does not solve all our problems though...so Gibbs Sampling - takes all the evidence into account - MCMC, samples depend on each other


Monty Hall Problem Example

Learning more about a door changes probability
Monty Hall Letter


Lesson 19: Hidden Markov Models


Pattern Recognition through time

Dolphin communication problem
'Delta Frequency'
Time warping- should not matter if a whistle is quick or drawn out longer in time


Dynamic Time Warping

Matching two signals sample-wise
Try to keep to the diagonal as much as possible
Could get false positives- matching signals that are not actually similar
Bound how much we can deviate- The Sakoe Chiba bound


Hidden Markov Models

Pattern recognition through time
Representing markov models
Self transition
Application: Sign Language Recognition

HMM for "I" vs "We"


Viterbi Trellis

Eliminating by the constraints
Many options in the middle
The Viterbi Path


Theory on HMMs and Phrase recognition- LOST due to Github error
Context Training

Using context in phrases- eg. combine models for I and need - Coarticulation


Statistical Grammar

Record fraction of words occuring together


State Tying
Segmentally Boosted HMMs
Using HMMs to generate data


## Coursera- Deep learning Specialization.md

      
    Raw
  

              Coursera- Deep learning Specialization.md
            
          
    Course 1 : Neural Networks and Deep Learning

Week 1: Introduction to Deep learning


What is a neural network? Housing price prediction model.
Neural networks and Supervised Learning; and types of neural networks-

Structured Data vs Unstructured Data


Why is deep learning taking off?

Because of Scale! (more and more data)
NNs performance generally increases with more data
Faster Computation


Week 2: Logistic Regression as a Neural Network


Binary Classification


Logistic Regression


Loss Function and the Cost function- The benefits of choosing a convex function for a loss function.


Gradient Descent and finding the minima


A refresher on derivatives


Computation Graph,

derivatives with computation graph- excellent video! - Chain rule


Gradient descent using logistic regression- minimizing the loss function.

Updating the weights using the backward propagation step.


Vectorization

Removing for loops- to improve the run time. Eg. np.dot to get the dot product.
Try to avoid for loops when you can. Many functions in numpy to do so!
A logistic regression without any for loop
Doing the backward and forward propagation steps without any for loops, using numpy


Broadcasting in python/numpy

how python/numpy treats arrays of different sizes.


PROJECT: Logistic Regression Model to recognize cats


Preprocessing steps
Use assertions for size and shape of numpy arrays
Nice assignment!- implementing a NN yourself from scratch.
Key Takeaways from the assignment

Preprocessing the dataset is important.
You implemented each function separately: initialize(), propagate(), optimize(). Then you built a model().
Tuning the learning rate (which is an example of a "hyperparameter") can make a big difference to the algorithm. 
You will see more examples of this later in this course!


A discussion (optional exercise) on the importance of choosing a good learning rate!

Different learning rates give different costs and thus different predictions results.
If the learning rate is too large (0.01), the cost may oscillate up and down. It may even diverge (though in this example,   using 0.01 still eventually ends up at a good value for the cost).
A lower cost doesn't mean a better model. You have to check if there is possibly overfitting. It happens when the training   accuracy is a lot higher than the test accuracy.
In deep learning, we usually recommend that you:
Choose the learning rate that better minimizes the cost function.
If your model overfits, use other techniques to reduce overfitting. (We'll talk about this in later videos.)

Week 3: Shallow Neural Network


Overview of neural networks, comparison with logistic regression.
Neural networks with a single hidden layer.

Introduction to hidden layer.
Superscript notations etc.


Computation using a neural network

Logistic Regression multiple times
Vectorization


Vectorization across multiple examples

Justification for the implementation


Activation functions

Hyperbolic tan ( tanh )- why is this better?
Why to use sigmoid for the activation layer? (Andrew Ng says sigmoid is always superior to sigmoid; except use signmoid in the output layer)
yeh ReLU ReLU kya hai, yeh ReLU ReLU? Leaky ReLU, with Relu learns faster


Why do you need a activation function? IMPORTANT

Derivatives of activation functions


Gradient descent for neural networks
Intuition behind backpropagation
The weights of a neural network should be initialized to random values (WHY NOT ZERO? What's the problem?)- Symmetry Breaking Problem

Project: Planar data classification with one hidden layer


Logistic Regression don't do well because the data is not linearly separable

Reminder: The general methodology to build a Neural Network is to:
1. Define the neural network structure ( # of input units,  # of hidden units, etc). 
2. Initialize the model's parameters
3. Loop:
  - Implement forward propagation
  - Compute loss
  - Implement backward propagation to get the gradients
  - Update parameters (gradient descent)


CUIDARLE CON EL TAMANO DE LOS MATRICES!!
The importance of a good converging learning rate

The larger models (with more hidden units) are able to fit the training set better, 
until eventually the largest models overfit the data.
The best hidden layer size seems to be around n_h = 5. Indeed, a value around here
seems to fits the data well without also incurring noticable overfitting.
You will also learn later about regularization, which lets you use very 
large models (such as n_h = 50) without much overfitting.

Week 4: Deep Neural Network


What is a deep neural network? Notations etc.
Forward Propagation
FOCUS on MATRIX DIMENSIONS- Working through the matrix dimensions of a deep neural network. Think about the dimensions of the weight matrix and the bias vector at every step.
Why need a DEEP network? Great video!

Circuit Theory and Deep Learning


Building blocks of deep learning networks - going through a backward and forward propagation, layer by layer
Hyperparameters vs Parameters- deep learning is an empirical process, wash-rinse-repeat.
Is there a relation between the brain and deep learning? (Spoiler Alert: Not a whole lot)

Project: Building your deep neural network

* Implementing a L layer neural network from scratch- both backward and forward

Project: Using above project to detect cats vs non cats

* ALWAYS resize all images to the same size before feeding to the network.
* 2 Layer vs L Layer network, try different values of L
* **Vectorization helps a LOT with the speed**
* Causes of mis-prediction

Course 2: Improving Deep Neural Networks, Hyperparameter tuning, regularization

Week 1: Practical Aspects of Deep Learning


Setting up your train/dev/test sets

It is an iterative LEARNING process!
Why need a test/valid sets?


Bias variance tradeoff

Parameters to analyze bias and variance (Overfitting vs underfitting) - See the error!


Basic recipe for machine learning/deep learning?

Question 1: Does the model have a high bias? See the train set performance.
Question 2: Does the model have a high variance? See the dev/validation set performance.
Rinse and repeat
Through deep learning, it has been possible to somehow reduce bias variance tradeoff, i.e. you can bring one down without affecting the other


Regularization

L2 normalization, L1 normalization, lambda is the regularization parameter, Frobius Norm for w
L2 normalization is also called weight decay (Why?, Remember HDP!)
Why does regularization prevent overfitting?
Dropout regularization - Keep a hidden unit for training with some probability; Inverted Dropout
The intuition behind dropout- great video! Because a node knows that an input (feature) can go away randomly, it spreads out weights across features.
Can change the probability of drop out (keep_prob), by layers, For example, layers with more nodes can have a higher probability of dropout. Drawback- The loss function is a bit undefined here, so hard to debug if it is monotonically decreasing with epochs
Other regularization methods - Data Augmentation, Early Stopping
Orthogonalization - Think of one problem at a time, machine learning funda by Andrew Ng.


Normalization

Normalization the mean and the variances of both features. Make variance for all features as 1. Why normalize?


Vanishing and Exploding gradients problem!
Gradient checking- to check your implementation

Only use for debugging, not for training; does not work with dropout.


Assignments: Initialization


Comparing three types of initialization for weights, zeros initialization vs random vs He initialization
Zero initialization is MUY MAL!* The cost function does not even go down with iterations. Why?

In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every  neuron in each layer will learn the same thing, and you might as well be training a neural network with n[l]=1n[l]=1 for every layer, and the network is no more powerful than a linear classifier such as logistic regression.
What you should remember:
The weights  W[l]W[l]  should be initialized randomly to break symmetry.
It is however okay to initialize the biases  b[l]b[l]  to zeros. Symmetry is still broken so long as  W[l]W[l]  is initialized randomly.


Random initialization- good, but not great

The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when  log(a[3])=log(0)log⁡(a[3])=log⁡(0) , the loss goes to infinity.
Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.
If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.


He initialization

Based on a paper, it does great!


WHAT TO REMEMBER


What you should remember from this notebook:
Different initializations lead to different results
Random initialization is used to break symmetry and make sure different hidden units can learn different things
Don't intialize to values that are too large
He initialization works well for networks with ReLU activations.

Assignments: Regularization

Project to see where the french goalkeeper should kick to reach his team's players.

Implement regularization- overfitting reduces and test set accuracy goes up after regularization

What you should remember -- the implications of L2-regularization on:
The cost computation:
A regularization term is added to the cost
The backpropagation function:
There are extra terms in the gradients with respect to weight matrices
Weights end up smaller ("weight decay"):
Weights are pushed to smaller values.


Implement dropoout

Apply mask with probabilities to activation and backpropagation, and divide by probabilties to scale the result.

What you should remember about dropout:
Dropout is a regularization technique.
You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
Apply dropout both during forward and backward propagation.
During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For   example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the   output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.


Assignments: Gradient Checking


Small exercise to check and verify the gradient calculation.

What you should remember from this notebook:
Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
Gradient checking is slow, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.

Week 2: Optimization Algorithms


Used to speed up neural networks and make them practical
Mini-batch gradient descent

The term epochs
The loss function does not neccesarily decrease always, as it does with a normal gradient descent
Choosing the batch size- The extreme case of stochastic gradient descent vs batch gradient descent
What is a small batch size? How to choose a batch size?


Exponentially weighted averages

Approximately how many data points are taken into account, with respect to the value of epsilon?
How to compute?
Bias correction!- Important during the initial phase of learning
Mismatched train/test distribution- maybe training set comes from cat pictures on web pages, however you test on low resolution pics uploaded by people


Gradient Descent with Momentum

A new hyperparameter, Beta


Optimization Algorithms

RMSProp
Adam (Adpative Moment Estimation) - RMSProp + Gradient Descent with momentum
* Learning Rate Decay
Why to decay learning rate(alpha)?
Decay Formulas in terms of epochs- Exponential Decay
* The problem of local optima!
Saddle points, plateaus
In a high dimensional place, it is very unlikely to get stuck in a local optima (Why? REMEMBER HDP!)
Solving the problem through better initialization - by constraining the mean and the variance
Numerical approximation of gradients- Two side difference vs one side difference


Project: Trying out different optimization algorithms, mini-batch gradient descent


Stochastic Gradient Descent vs Batch Gradient Descent
Shuffling and partitioning to get batches, the size of the batch (power of 2)

The larger the momentum  ββ  is, the smoother the update because the more we take the past gradients into
account. But if  ββ  is too big, it could also smooth out the updates too much.


Implement Adam yourself, implement correction formulas for s and v params.
Observe the loss decay with/without momentum- Adam is VERY GOOD!

Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable.  Also, the huge oscillations you see in the cost come from the fact that some minibatches are more difficult thans others for the optimization algorithm.
Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you've seen that Adam converges a lot faster.

Week 3: Hyperparameter Tuning, Batch Normalization, Multi-class classification


A shit ton of hyperparameters! Rank them by importance.

Instead of a grid search for best hyperparameters, do a random search because you do not know what the most important hyperparameter is, and want to try out a lot more values.
Picking an appropriate scale is important - Eg. a log scale for the learning rate.
Hyperparameters tuning in practice: Baybsitting one model(Panda) vs training multiple models in practice (Caviar). Depends on how much computational resources you have.


Batch Normalization

Also normalizing the activations of the hidden units. Implementing this with more parameters to tune the mean and variance of your hidden layer activations.
Fitting batch norm into a deep neural network. Convert Z to a normalized Z, then apply activation. Additional parameters are added to apply normalization at every layer.
Working with mini batches.
Why does Batch Norm Work? IMPORTANT VIDEO!- Eg. if you have a model that detects cats vs non cats on black cats; and now you want to use the same model for colored cats. Covariate shift. Even though the input values change, the mean and variance remains the same. It limits the amount by which the earlier layers' outputs change. Allows each layer to learn by itself independently of the earlier layers.
Also has a slight regularization effect.
Batch norm at test time. Estimate the mu and sigma-squared by estimating exponentially weighted averages across all batches.


Multi Class Classification

Softmax regression, a generalization of linear regression. Suppose we have a n classes, the output layer is a n1 layer, denoting a n1 bector with probability that input belongs to one of the n classes.
Training a softmax classifier - hardmax vs softmax. Defining the loss function.


Introduction to Programming Frameworks

How to choose your framework? Ease of programming, running speed and truly open.
Using TensorFlow


Assignments: Tensor Flow to detect signs: multivariate classification


remember to initialize your variables, create a session and run the operations inside the session.


Placeholders- just define the shape now, value later. Defining simple operations, and getting results using session.run()
Pass placeholder values using feed_dict
One Hot Encoding
SIGNS dataset: Normalize and flatten the image dataset. Why use 'None' in the placeholder? Using Xavier initialization to init the parameters.
Running on mini batches, with optimizer defined by TensorFlow.
Think about the session as a block of code to train the model. Each time you run the session on a minibatch, it trains the parameters. In total you have run the session a large number of times (1500 epochs) until you obtained well trained parameters.

What you should remember:
Tensorflow is a programming framework used in deep learning
The two main object classes in tensorflow are Tensors and Operators.
When you code in tensorflow you have to take the following steps:
Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...)
Create a session
Initialize the session
Run the session to execute the graph
You can execute the graph multiple times as you've seen in model()
The backpropagation and optimization is automatically done when running the session on the "optimizer" object.

Course 3: Structuring your machine learning projects

Week 1: Introduction to ML Strategy


Strategies to analyze a problem and coming up with ideas that one should try to improve the performance.


Orthogonalization - Adjust one knob to adjust one parameter, to solve one problem - The TV knob analogy and the car analogy.

Chain of assumptions in Machine Learning and different knobs to say improve performance on train/dev set.
Andrew Ng does not recommend Early stopping, as it is a knob that affects multiple thing at once.


Setting up your goal

Set a SINGLE NUMBER for metrics- precision and recall- but these are two numbers, and you ideally need one number. ENTER THE F1 score!


Satisficing and optimizing metrics - metrics that satisfy, for example time

As a general rule of thumb, out of N metrics, pick one to be optimizing and (N-1) to be satisficing.


How to set up your test set and dev sets

Dev and test set should come from the same distribution.
Size of the test and dev set.
Test set should be good enough to give you confidence.


When to change your metrics/dev set/test set

Halfway through you solving the problem, metrics might change based on goals. Defining a new evaluation metric, to tell which algorithm is better for your problem.
Orthogonalization- Defining the metric is one step, and doing well on it is another step.
If a metric that says you are doing well on your dev/test, but does not reflect well on your application; CHANGE THE METRIC!


Comparing to human level performance

Bayes optimal error- the best theoritical possible error, there is no way to surpass this in terms of performance.
You can improve till your algorithm is doing worse than human level performance.


Avoidable Bias

Think of human error as an estimate of bayes error, as a baseline (esp in computer vision tasks). This is avoidable bias, keep training till you get the training error down to avoidable bias.


Understanding human level performance

How to define it? The medical image classification example. Reduce bias or reduce variance?
Difference between human baseline and training error = measure of bias; and difference between train error and test error = dev error


 * Surpassing human level error
- Sometimes ambiguity as to whether improve bias, or improve variance.
- Examples where ML kicks humans' ass : online advertising, product recos, logistcs, loan approvals. Humans are great at computer vision tasks. Some speech recognition systems can surpass humans.


Improving your model's performance

Assumptions - fit training set well (low avoidable bias)
Generalizes pretty well to dev/test set.


Error analysis

Manually examine the mistakes the model is making; manually making some notes. Find mis-predicted labels, and prioritize based on where you can improve the most.


Mislabel examples

What to do with incorrectly label training examples. Deep learning algos are robust to random errors, but not to systematic errors
For test/dev set incorrectly examples, have a column called 'incorrectly labeled'; and analyze if it makes sense to spend time fixing the incorrect labels. Depends on how much error they contribute to error wise.
Correcting labels: Apply same principles to both dev and test sets


Iterating on your algorithm

Build a first system quickly, and then analyse what to do next- Iterate.
Do not overthink initially, and just get a quick and dirty first solution going.


Mismatched training and dev/test set

The cat example- 200,000 images from web crawling, 10,000 from the data from mobile cameras (low quality)
Set your test/dev test to be the distribution to be the one you want your application to do well on.
Analyzing bias and variance on different distributions of training and dev set. The concept of training-dev sets!
Data mismatch error - The new problem of data mismatch! How to solve the data mismatch problem. Manually analyzing the difference. Eg.: artifical data synthesis- add random noise to clean data


Transfer Learning

Use a model used to identify cats, and apply to identifying X-ray scans. pre training and fine tuning
When does transfer learning make sense? When you have pre-learnt on a lot of data and don't have much data for the new problem.


Multi learning

Do multiple tasks at one time; demonstrated with the autonomous driving car example. Detecting if an image has a stop sign, has a human, has another car etc.
Should be done on tasks that share lowel level features. Works better if the amount of data is similar, per task. Knowledge of one task should help all the other tasks.


End to end deep learning

Replacing a whole pipeline of feature engineering, extracting features with one neural network.
Need a lot of data than traditional pipelines.
The turnstile problem example- breaking into steps since you have much more data from the steps than the end-to-end problem.
Can simplify the problem, but does not always work. Think about the amount of data!


Whether to use end-to-end deep learning

PROS: Just lets the data speak; rather than human perceptions. Don't need to hand-design the features.
CONS: Need a LOT of data. Sometimes not available for the entire step. Excludes hand-designed components/features.


Course 4: Convolutional Neural Networks

Week 1: Intro to CNNs


Convolution operation

How to detect edges, defining the filter/kernel. How to do convolution. How does edge detection work really, with convolution?
Horizontal and vertical edge detection. Dark to white edges and white to dark edges. Sobel filter, Scharr filter. Can possibly learn the coefficients of your kernel through deep learning; rather than hand-pick a kernel.


Padding

Why is it needed? Because the image shrinks! Valid convolutions and same convolutions.
How to calculate the padding size?
Rarely even dimension-ed kernels are used.


Strided Convolutions

General formula for dimensions of the output image; for an input image of size n by n, kernel of size f * f, padding = p and stride = s; the dimension of the output image is floor( ((n+2p-f)/s) + 1 )
The filter must lie fully in the image when convolving
Cross correlation vs convolution: We are not doing the mirroring step (as done in maths). What we are essentially doing is cross-correlation, and calling it convolution.


Convolution over volumes- 3D Images

Number of channels in kernel must be equal to the number of channels in the image.
Finding multiple types of edges using multiple convolutions with different filters suited to find different kinds of images. So output dimension becomes (n-f+1) * (number of convolution filters used)


Building one layer of a CNN

Add bias and non-linearity to the convolution result; analogy with the standard forward propagation
Calculate total number of params in a layer- coefficients of the filter and the DO NOT FORGET THE BIAS!
Naming conventions- formula in terms of this layer's filters and previous layer's inputs


A simple example

The depth keeps increasing, while you reduce the height and width at each layer (Remember Udacity!). Andrew calls depth as the number of channels


Other layers: Pooling and Fully Connected

Pooling layer - eg. find the max in a sub-region of an image (Max Pooling). There is nothing to learn, has a fixed set of parameters (stride and size of kernel)
Average pooling layer; max pooling is used more than global average pooling


Combining all of these together and one example based on LeNet-5

Remember pooling layer has no parameters. As a convention, count only the layers that have weights (parameters)
At the end, flatten and feed into the fully connected layer


Why convolution? - Great video!

Reduces the number of param much more than fully connected layers - Parameter Sharing- a feature detector (such as edge detector) useful in one part of the image is probably useful in another part of the image.
Sparse Connections - one pixel is only connected to its neighbors, and not to everyone else (and does not need to be!)


Project: Step by step convolution model


Implement a CNN yourself

Implmenet padding, convolution, forward pass etc from scratch
Nice tidy implementation of a single layer!
Optional exercises on back propagation


Implement ConvNet using TensorFlow

Initialize placeholders, weighs etc.
Remember the tensor flow sessions! How to run etc.


Week 2: Looking at case studies


Learn from others, why reinvent the wheel?

LeNet -5 : the architecture; used for digit recognition.
AlexNet: bigger, more parameters; better than LeNet as it used ReLu. Uses a layer called as local response normalization Has a lot of hyperparameters
VGG 16: A simpler network; although large. Has 16 layers with weights.
ResNets: Residual block- applying a shortcut as opposed to the main path. Skip connections
Plain network vs residual block networks. In practice, deeper the networ, the error can go up.


Why do Resnets work so well?

If you make a plain network deeper, it can hurt training error on the training set. Not with ResNets though.
Because ResNets can learn the identity function much easily. Therefore adding extra layers does not hurt performance, and might even help performance!
Residual layers easily learn the identity function.
Uses a lot of same convolution; as it preserves the dimensions.


A 1 by 1 convolution

What is it? Convolving with a 1 by 1 by d filter. Why is it useful! It multiplies a number across the depth and then applies a ReLu activation.
It is like having a fully connected network with depth. Also called Network in Network architecture.
Helps you shrink the depth/ the number of channels!


Inception Network Motivation and Inception Networks

Why not take ALL filters, and ALL types of layers. Just stack all the various outputs (Keep the same convolution)
Do them all; but huge computation cost!
Use a 1 by 1 convolution to reduce depth (volume) and reduce the amount of multiplications (reduce the computation cost)
Use if you want to TRY THEM ALL! (Like a marica)
Padding with max pooling layer, a weird thing...
How to combine: Just concatenate the blocks along the channel (depth)! Height and widht are kept the same.
Inception network has a lot of inception blocks. Also has a side branch layer to make predictions; tends to have a regularizing effect.
Inception network's name actually comes from the movie inception.


Practical Advice on Using other networks

How to use open source implementations. A common way to go about in computer vision is to take a known network, and use transfer learning
Use something that has already been done before! Rather than starting from scratch, why reinvent the wheel. Freeze the earlier layers**; pre-compute the earlier layers' activation and just apply softmax on that.
If you have a larger training set, you freeze fewer earlier layers. The more data you have, the more layers tu puedes entrenar.
Data Augmentation for computer vision. Just can't get enough of data for computer vision.

Techniques used are random cropping, mirroring.
Color shifting.
Advanced- PCA color augmentation
Implement distortions during training. Have a thread for distortions, other for training. Distortion can also have hyperparameters.


Computer vision and deep learning

ML problems fall in the spectrum from 'little data' to 'lots of data'. Lot of data means simpler algorithms, letss hand-engineering.
Computer vision has relied on hand engineering a lot.
Tips for doing well on benchmarks

Ensembling - average the labels for multiple Neural Networks
Multi crop at test time, the 10 crop technique


Optional keras exercise


Keras is a higher level of abstraction than tensor flow. The happy faces project.
To remember:

What we would like you to remember from this assignment:
Keras is a tool we recommend for rapid prototyping. It allows you to quickly try out different model architectures. Are there any applications of deep learning to your daily life that you'd like to implement using Keras?
Remember how to code a model in Keras and the four steps leading to the evaluation of your model on the test set. Create->Compile->Fit/Train->Evaluate/Test.

ResNets


Deep networks can learn complex functions, however not always the best choice. Remember vanishing gradients!
you can stack on additional ResNet blocks with little risk of harming training set performance. (There is also some evidence that the ease of learning an identity function--even more than skip connections helping with vanishing gradients--accounts for ResNets' remarkable performance.)
ResNets identity block; convolutional block
Using the blocks above to build a DEEP Resnet! Layer naming ne heecha bana diya!

PARA RECORDAR:
What you should remember:
Very deep "plain" networks don't work in practice because they are hard to train due to vanishing gradients.
The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an    identity function.
There are two main type of blocks: The identity block and the convolutional block.
Very deep Residual Networks are built by stacking these blocks together.

Week 3: Object Detection Algorithms


Object localization- Not only classifying an image with an object, but also localizing (bounding box) the object. Object detection- detect an object in an image that has many object.


Have the neural networks have the bounding box outputted in the form of four numbers! The output label is now a vector, with values being (is there an object); four values corresponding to bounding boxes and also the type of the object. If there is no object, we don't care about anything else other than the fact y_object_exists = 0. If no object, all the other values are me da igual.


Calculating the loss function based on the two cases: (1) Object Exists (2) Does not exists


Landmark detection- just give the (X,Y) coordinates- a landmark is one point with a (x,y) coordinates.


Sliding windows for object detection - take cropped inputs (of cars for example) and train a NN to output 1/0. In sliding window, you slide/stride a window across the whole image and then have it classify for every such section of the image.

Then do this for a bigger region...rinse..repeat.
HUGE COMPUTATIONAL COST. And ConvNets' complexity time adds to the problem of computational cost.


An efficient implementation using convolution


Replace the fully connected layer with a convolution layer. Implement fully connected layers as convolutional layers.


The benefit of this is that a lot of computations get shared between sliiding windows. Instead of running forward propagation indepedently, can run it together.


ACHTUNG!! Bounding boxes is not correct/best in this implementation: STEP IN YOLO!


YOLO algorithm

Apply the localization algorithm to nine grid cells in an image; assign every grid a vector label. Total volume = Number of grids multiplied by the target vector for each of the boxes. Could use finer/coarser grids.
Achtung! An object might appear in more than one grid, will address this later.
It is only a single computation, is an efficient algorithm and runs fast!


How to tell if your object detection algorithm is working well

Intersection over union (IoU) function to calculate the efficacy of the bounding boxes. If IoU > 0.5; then it is considered good.


Non maximal supression

The problem of multiple detections for the same object.
All the ones with high overlap will get supressed.
Non maximal supression algorithm.

Repeatedly pick boxes with high object probability, and eliminate boxes with high IoU with this one.


Anchor Boxes

Use different kinds of boxes for different boxes to assign to. Each object is now assigned to the (grid cell, anchor box) pair that has the highest IoU with the object.
Helps your algorithm specialize better.
Can use k means to cluster into types of anchor boxes! (neat!


The generalized YOLO algorithm combining anchor boxes, non max supression; into the algorithm


Region Proposals: R-CNN - propose regions via segmentation. Different algorithms to propose regions.


Assignment: Autonomous Driving- Car Detection using YOLO


Need to collect images: Done via a car mounted camera. YOLO - solo una mirada, hijo de puta
YOLO: If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Find box scores; apply max supression.

Summary for YOLO:
Input image (608, 608, 3)
The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
Each cell in a 19x19 grid over the input image gives 425 numbers.
425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
85 = 5 + 80 where 5 is because  (pc,bx,by,bh,bw)(pc,bx,by,bh,bw)  has 5 numbers, and and 80 is the number of classes we'd   like to detect
You then select only few boxes based on:
Score-thresholding: throw away boxes that have detected a class with a score less than the threshold
Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes
This gives you YOLO's final output.

What you should remember:
YOLO is a state-of-the-art object detection model that is fast and accurate
It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
The encoding can be seen as a grid where each of the 19x19 cells contains information about 5 boxes.
You filter through all the boxes using non-max suppression. Specifically:
Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes
Intersection over Union (IoU) thresholding to eliminate overlapping boxes
Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise. If you wish, you can also try fine-tuning the YOLO model with your own dataset, though this would be a fairly non-trivial exercise.

Week 4: Face Recognition


Face verification vs face recognition


One shot learning- you need to perform well with just one image of the person. Learn from just one example. We compute a similarity function for images.

Use a siamese network architecture.

Learn a function such that encodings of same person's images is small; and of different persons' is large.


*MISSED SOME NOTES HERE


Neural Style Transfer

Content cost function - choose a layer (neither two shallow, neither two deep); and then analyze the activations caused by two images. If the activations are similar, it implies that the images have a similar content.
Style Cost, how correlated are the activations across different channels? How often do high level features such as texture occur together.

Choose a layer and see how correlated are the activations between different channels.
Degree of correlation is a measure of style; how similar is the style of the generated image with the style image.
Generate a style matrix; a (number of channels) * (number of channels) matrix; see how correlated different channels are. Make pairs of every channel with the other to get this matrix's values.
Compute the style matrix for both the images- cost function is the norm (difference) between the two style matrices.
The combine the cost function across all layers


Generalization to 2D and 3D images.

Convolution for a 1D image.
3 Dimensional Data- convolve with a 3D filter


Assignment: Neural Style Transfer Art Generation


Most of the algorithms you've studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, you'll optimize a cost function to get pixel values!


We would like the "generated" image G to have similar content as the input image C. Suppose you have chosen some layer's activations to represent the content of an image. In practice, you'll get the most visually pleasing results if you choose a layer in the middle of the network--neither too shallow nor too deep.


What you should remember about computing the cost function


What you should remember:
The content cost takes a hidden layer activation of the neural network, and measures how different  a(C)a(C)  and  a(G)a(G)  are.
When we minimize the content cost later, this will help make sure  GG  has similar content as  CC .


Computing the style function

Calcualting the Gram Function for a single layer
Then merging for multiple layers; using lambdas


What you should remember:
The style of an image can be represented using the Gram matrix of a hidden layer's activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
Minimizing the style cost will cause the image  GG  to follow the style of the image  SS .

  What you should remember:
The total cost is a linear combination of the content cost  Jcontent(C,G)Jcontent(C,G)  and the style cost  Jstyle(S,G)Jstyle(S,G) 
αα  and  ββ  are hyperparameters that control the relative weighting between content and style


CONCLUSION

What you should remember:
Neural Style Transfer is an algorithm that given a content image C and a style image S can generate an artistic image
It uses representations (hidden layer activations) based on a pretrained ConvNet.
The content cost function is computed using one hidden layer's activations.
The style cost function for one layer is computed using the Gram matrix of that layer's activations. The overall style cost function is obtained using several hidden layers.
Optimizing the total cost function results in synthesizing new images.

Assignment: Face Recognition for the happy house

Face Verification - "is this the claimed person?". For example, at some airports, you can pass through customs by letting a system scan your passport and then verifying that you (the person carrying the passport) are the correct person. A mobile phone that unlocks using your face is also using face verification. This is a 1:1 matching problem.
Face Recognition - "who is this person?". For example, the video lecture showed a face recognition video (https://www.youtube.com/watch?v=wr4rx0Spihs) of Baidu employees entering the office without needing to otherwise identify themselves. This is a 1:K matching problem.


Implement FaceNet

Encode an image into a 128 dimensional vector
implement the triplet loss function


What you should remember:
Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
The triplet loss is an effective loss function for training a neural network to learn an encod
Course 5: Sequence Models

Week 1: Recurrent neural networks


Why sequence models are useful- speech recognition, translation, music generation etc.
Name Entity Recognition example

Given a sentence, find the words that correspond to names.
Talks about notations etc.- how to represent individual words- make a vocabulary/dictionary of all the words

One hot encoding the words


It is a supervised learning problem.


Why not use a standard neural network?!

Inputs and outputs can be different lengths- you can have sentences of different lengths (different words)
Does not share features learned across different positions. (Kinda similar to convolutional neural network)


What is a recurrent neural network?

You want things learnt in one part to be used in other parts..
Learning from one time step to the other, passing along the activation

Y3 comes not only from X3, but also from X2 and X1


The case for bidirectional recurrent network; versus a unidirectional neural network.
Explains forward propagation


Backpropagation through time

Loss defined for a single word
Compute the total loss by summing the loss per word in time


Different types of RNNs

Input length and output length can be different

Many to many RNNs
Sentiment classification- many to one RNNs
One to many RNNs - generate Music
Machine translation- many to many, but of different lengths!


Sequence generation and machine translation

'Pair' vs 'pear'
Speech recognition tells the probability of a sentence existing.
Tells the probability of a sequence of words existing
See the probability of a word existing in a particular position


Sample novel sequences

Keep sampling until you have hit EOS
Character level language model vs Word level language model

Dont have to worry about Unknown in character level.
Character language models are much longer!


Vanishing gradients

Hard to propgatae information along the sentence - farther the word, lesser the influence
For exploding gradients, use gradient clipping


Gated Recurrent units

To solve the problems of vanishing gradients
Memory cell, to preserve the information
Memorize the value such as singular/plural; and the gate (Gamma) to see if you need to update the value or not
Can use different bits to remember different things, such as plural/talking about food etc.


Long Term Short Memory

LSTMs - Has two gates, update gate and forget gate
LSTM is the preferred choice over GRUs


Bidirectional RNNs

Take info from both earlier and later in the sequence
Has a backward recurrent layer, in addition to the forward recurrent layer


Deep RNNs

Stacking a single layer we have learnt so far one over the other.

Because of the temporal dimension, these are less deeper than traditional neural networks.


Assignment : Building a recurrent neural network Step by Step


Describes how LSTM can be used to solve the vanishing gradients problem

Assignment: The Dinosaur problem


Clipping of gradients and why to do it

Assignment: Improvise a jazz solo


Similar to the dinosaur model, except in Keras

Here's what you should remember:
A sequence model can be used to generate musical values, which are then post-processed into midi music.
Fairly similar models can be used to generate dinosaur names or to generate music, with the major difference being the input fed to the model.
In Keras, sequence generation involves defining layers with shared weights, which are then repeated for the different time steps  1,…,Tx1,…,Tx .

Week 2: Natural Language Processing & Word Embeddings


Introduction to word embeddings

How to represent words, that is good to learn realtions?
Featurized representation

Features such as Gender, 'Royal', Age etc.
Take a vector of features

Helps find words that are closely related
Eg. apple and orange are closer to each other than apple and 'man'
Visualizing word embeddings


Using word embeddings

Can analyze a lot of unlabeled text to decipher less common words

Download word embeddings from a large text corps
Transfer embedding to a smaller training set
Continue to fine tune the word embeddings


Similar to face encoding


Properties of word embeddings

INTERESTING! How to find analogies, eg. if man is to woman, king is to what?
The difference would come up in the subtraction; a single property would stand out
Define the similarity function, we use cosine similarity


Embedding Matrix

A (number of words * dimensions) matrix


Learning Word Embeddings

Take all the embedded vectors and put it into a neural layer followed by a softmax activation

One hyperparamater is the history of how many words before you want to learn - what context do you want to learn the word?


Word2Vec Model

Randomly pick the context word and the target word (within some window of the context word)
Hierarichal softmax classifier , like a tree that splits into groups such as (first 5000 words) etc.

In general more common words are at the top of the three, and less common ones at the bottom
Helps in speeding up the algorithm


How to sample the context word

Don't take it uniformly, else you will always get words like a, then, the etc.


In general softmax is the blocking part, computationally expensive


Negative Sampling

Determine if two words are a context and target pair

Orange and juice are a pair, orange and king are not
Make a table of positive and negative examples; for every positive example, you have K negative examples
We dont train all the words in the corpus, but only K+1 of them based on your table from above.
How to select the negative words, according to what distribution?


GloVe word vectors algo

Very Simple: Global Vectors for Word Representation
Sample how manytimes two words appear in close proximity


Sentiment classification

Challenge is sometimes not having a hude training set.
Average the word vectors and feed to softmax

Use RNN for classification, a many to one architecture


Debiasing word embeddings - SJW stuff!

First find the direction that corresponds to the bias we are trying to solve (eg. Gender Bias)
Remove bias, by prijecting them onto the orthogonal direction of the bias we want to solve
Equalize bias by making grandfathers and grandmothers; for example the distance between babysitter should be equal between grandfathers and grandmothers


Assignment: Debiasing


Cosine similarity

Cosine similarity a good way to compare similarity between pairs of word vectors. (Though L2 distance works too.)
For NLP applications, using a pre-trained set of word vectors from the internet is often a good way to get started.

Assigment : Emojify


Adding emojis to sentences based on emotion
Emojifier V2 using LSTMs in KERAS

What you should remember:
If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly. Word embeddings allow your model to work on words in the test set that may not even have appeared in your training set.
Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
To use mini-batches, the sequences need to be padded so that all the examples in a mini-batch have the same length.
An Embedding() layer can be initialized with pretrained values. These values can be either fixed or trained further on your dataset. If however your labeled dataset is small, it's usually not worth trying to train a large pre-trained set of embeddings.
LSTM() has a flag called return_sequences to decide if you would like to return every hidden states or only the last one.
You can use Dropout() right after LSTM() to regularize your network.

Week 3: Sequence to sequence architectures


Sequence to sequence models

Language translation for example
Image captioning, caption an image


Picking the most likely model

Machine Transation Model

Split into a model encoding the sentence; and then a language model.
Calculate the probability of an English sentence conditioned on a French sentence.
DONT DO RANDOM! - Find the sentence that maximizes the conditional probability


BEAM search

Beam Width - maintain a list of the best three words (for example) in a probabilistic sense.
After the first word, you maintain a list of conditional probabilities of say two words together. You hardwire the previous word output into the next. You do it for all the three contenders- then find the top three across all.
And you therefore continue, fragment by fragment.
If beam width = 1, it essentially becomes greedy search,


Refinements to BEAM search

Dealing with numerial underflow- so we take the log!- because multiplying small numbers might result in underflow.
Also tends to favor shorter translations due to being multiplied by zero over and over again. Normalize by the number of words, and reduces penalty for longer transations.
Take the top sentences and compute the score- pick the highest!
Choosing B

If B is large, you take in a lot of possibilities, but more computation power.
if B is small, then you are taking in less context, but is quicker to run.


Beam Search Error Analysis

How to analyse where the error lies, is it the network or the Beam Search algo?

Switch into two cases, and you can find who is at fault exactly.
Find such cases, and do an error analysis for all faulty examples, and ascribe the error to either of the two.


Bleu Score- to decide between multiple good answers for a translation.

Stands for 'Bilingual evaluation'
Modified Precision - see how many times a word is in total in the human provided reference transations.
Look at pairs of words- bigrams - how many times do the bigrams appear?


We do this for unigrams, bigrams, n-grams..

Combined Blue Score- basically average for unigrams, bigrams, n-grams...
Brevity penalty- if you output short penalty, to Adjust by penalizing


Attention Model Intuition

A human does not memorize the entire sentence, and then translates it; this is what the encoder architecture is doing.
So it does bad on longish sentences; so you work on one sentence at a time.
A set of attention weights - how much attention should you give to words when determining the translation.
Implementation details

At every step, you decide how much context weight to give to the other words.
You input Context vectors at each time step.
Calculate factors for getting the attention weights using a small neural network.
TAKES A LOT OF TIME TO RUN THOUGH!- Is Quadratic
 - You can apply this idea to image captioning as well, just pay attention to parts of the picture.


Speech recognition problem

First you generate a spectrogram of the speech data and then run recognition
Initially was broken into phenomes; but now deep-learning is showing that phenomes is not required. Also because of much large audio sets available for training.
CTC cost for speech recognition

You collapse repeated characters bnot separated by a blank.


Trigger word detection- TRIGGERED!

Hey Siri, Okay Google etc.
Just binarize the target label - Imbalance might be due to skewed.

To solve this, you might output more 1s in continuation.


CONCLUSION AND 谢谢!

Assignment: Using a machine translation model to convert dates to human readable dates


Implement an attention model

Here's what you should remember from this notebook:
Machine translation models can be used to map from one sequence to another. They are useful not just for translating human languages (like French->English) but also for tasks like date format translation.
An attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output.
A network using an attention mechanism can translate from inputs of length  TxTx  to outputs of length  TyTy , where  TxTx  and  TyTy  can be different.
You can visualize attention weights  α⟨t,t′⟩α⟨t,t′⟩  to see what the network is paying attention to while generating each output.

FINAL ASSIGNMENT - Trigger Word Detection


Converting raw audio to spectograms
Use a conv layer to convert spectogram to features
We use unidirectional instead of bidirectional; because we want to detect the word asap (and not wait for the whole sentence!)

Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
An end-to-end deep learning approach can be used to built a very effective trigger word detection system.


## Udacity Nanodegree.md

      
    Raw
  

              Udacity Nanodegree.md
            
          
    Part 1

Choosing the right estimator

A Great PPT!

https://docs.google.com/presentation/d/1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k/preview?imm_mid=0f9b7e&cmp=em-data-na-na-newsltr_20171213&slide=id.g22aaaf9c33_0_76
The Golden Question


Choosing the right estimator, a cheatsheet from scikit: http://scikit-learn.org/stable/tutorial/machine_learning_map/


https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-choice


http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/


Lesson 1


Refresher on machine learning: bite sized -- https://classroom.udacity.com/nanodegrees/nd009/parts/1d267043-f968-4853-9128-56f88f519d46


Visualizing ML, an intro: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/


Lesson 5 : Training Models


Statistics Refresher - In Extracurricular

Mean, Median, Quartiles, IQR etc., Variability, Standard deviations and distributions

Training Models- small intro to numpy

Lesson 6 : Testing Models


Splitting data into testing and training


GOLDEN RULE: Never use training data for testing (duh..)


Lesson 7: Evaluation Metrics


Confusion matrix: True positive, false positive


Accuracy (Why it might be bad sometimes?


Precision and Recall (https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg)


F1 Score (Harmonic Mean), F- beta Score


ROC (Receiver Operating Characteristic) Curve, Regression Metrics (R2 score)


Lesson 8: Detecting Errors


Overfitting (High variance) -- (Killing a fly with a bazooka) vs Underfitting (High Bias) (killing a godzilla with a flyswatter)


Cross validation, K fold cross validation


Learning curves


Lesson 9: A short summary - putting it all together


Grid search

Practice Project - Bag of words concept

Part 3

Lesson 2: Introduction to regression


Finding the best fit using calculus - Minimizing the sum of square error


Find best order of polynomial - Polynomial regression


Types of errors in training data


Cross Validation in Regression


Lesson 3: More regression


Parametric regression


Non Parametric Regression -- Instance based methods -- K Nearest neighbour vs Kernel Regression


Lesson 4: Regressions in sklearn


Continuous supervised learning (how does it differ from what you have learnt thus far?)


Continuous (Generally some sort of ordering) vs discrete classifier (No ordering ,even though they might be numbers)


Slope and Intercept


skLearn Practice - R square metric


Errors in Linear Regession- best model minimizes the sum of squared errors (Why Square and Not Absolute?). What is the problem with the sum of squared errors?


Benefits of R-square over Least Squares


Classification vs Regression: Differentiate based on output-- Chunk Number 34 --  Classification gives discrete labels (yes or no); but regression gives a concrete number from a continous model


Lesson 5: Decision Trees


Classification vs Regression


Classification Learning Concepts- Hypothesis, Target Concept etc.


Decision Tree Introduction: How to decide which trees are better


Analogy with the 20 questions car games


Best Attributes


Decision Tree Expressiveness


Space complexity of decision trees: how many decision trees are possible?


ID3 algorithm - What does the best attribute mean? (Information Gain) - Formula for entropy


Biases in ID3


Can you repeat an attribute? For continous values, you can ask a different question. For discrete, no attribute should be repeated


Dealing with overfitting


Lesson 6: More decision trees


Multiple linear questions (Think of it as multiple linear questions)


Coding decision trees - Tuning parameters


Regression using decision trees


Data Impurity/Entropy - min_split criteria; tuning skLearn


Lesson 7: Neural Networks


Perceptron Units- Representing basic boolean operations using perceptron units


Perceptron Training -


(1) Perceptron Rule - Works if the dataset is Linearly Separable (The Halting problem) -- half plane, half space
(2) Gradient Descent (For non linear separability) - Sigmoid function- avoiding local minimas!


Comparison of the two approaches


Back Propagation- Neural Networks


Restriction Bias, Preference Bias, Occam's Razor


Lesson 8: Support Vector Machines - The Math behind it

Best line is consistent to the training data, while committing to it the least


Derivation for the best line- maximizing the margin


Solving the best line for SVM - Quadratic Programming Problem -- Zero/Non zero alphas for vectors (input data)


Only a few points matter; the one close to the decision boundary- those are our SUPPORT VECTORS!


Linearly married - Kernel Trick
Domain Knowledge is introduced via Kernel Trick- THe Mercer Condition

Lesson 10 - SVMs in Practice

Lesson 11 - Instace Based Learning


K Nearest neighbors


Intro


Classification vs Regression


Running times of various algos (Learning vs Querying)


Eager vs Lazy Learners


Different Distance Metrics - IMPORTANT TO HAVE NICE DOMAIN KNOWLEDGE!


KNN Preference Bias - Locality, Smoothness and importance of features


Curse of Dimensionality - Number of data points with respect to the dimensionality of your feature space


Locally weighted regression


Lesson 13: Bayesian Learning

Basic Refresher Video : https://www.youtube.com/watch?v=xw6utjoyMi4

Bayes Rule - Derivation using Probability Chain Rule , Prior - domain knowledge, Priors Matter A LOT!

Many times, we do not have the Prior of Data, but is not needed, as we just need the Maximum among all hypothesis

Maximum Likelihood vs Maximum A priori


With noisy data- General Gaussian Derivation-- Comes up to minimizing sum of squared errors - mind blown!


Minimum Description Length - Entropy! - Minimizing error (mis classification) and getting simplest model


Finding best hypothesis vs finding best label


Lesson 14: Bayesian Inference


Joint Distribution


Conditional Independence


Belief Networks/Bayesian Network


Joint Distribution and Sampling - Conditional Probability


Inference Rules


Lesson 16: Ensemble Bagging and Boosting


Ensemble Classification : Combine simple rules to make a complex rule to classify


Ensemble Bagging - How is better? Relates to avoid overfitting


Ensemble Boosting - Weighted error rate - weak learning - Boosting Code


Increasing weight on the ones getting wrong, and reducing weight on the ones right ; in a particular iteration. Combining how to get the final hypothesis.


Boosting and overfitting - Error vs confidence


Project 2- Charity ML


Data preprocessing - Normalization, scikit minMaxScaler
Data preprocessing - OneHotEncoding for categorical values
F Beta Score
Grid Search
Feature Importance

Part 4 - Unsupervised Learning

Clustering


Trying to guess the data's structure when a data does not come with labels
KMeans and Outliers: https://stats.stackexchange.com/questions/214362/trouble-in-understanding-outliers-influence-on-k-means
KMeans - Assign a center and optimize
Visualization to understand: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
SkLearn's KNN - The number of clusters is a VERY important parameter
Limitations: Result not always the same; the problems of local minima
A local hill climbing algorithm

More Clustering


Single Linkage Clustering- Inter Cluster Distance - Big O Running Time


Soft Clustering- A point belongs to a cluster 'probabilistically'


Maximum Likelihood Gaussian


Expectation Maximization


Properties of Clustering - Richness, scale-invariance, consistency - IMPOSSIBILITY THEOREM- Cannot have All three!


Properties of EM (Can get stuck-local optima)

Clustering Mini Project


Feature Scaling

Feature Scaling


Giving equal weightage to all features.
Feature Scaling Formula
SKLearn MinMaxScaler
What algorithms would be affected?

Feature Selection


Important for 'Knowledge Discovery' and 'Curse of Dimensionality'
Feature Selection Algorithms Filtering and Wrappping- Tradeoffs between the two
Filtering: Use something as a decision tree to use information gain and get the subset of the most important features- then these features are passed into another learner
Wrapping: Ways to do wrapping - Searching, forward and backward search - WHAT FEATURES ARE IMPORTANT?
Feature Relevance, Relevance (Strong vs Weak) vs usefulness
Relevance measures effect on the Bayes Optimal Classifier
Usefulness for a feature is defined for a particular algorithm

Principal Component Analysis


A great link to visualize PCA: http://setosa.io/ev/principal-component-analysis/
PCA - focuses on shifting and rotating only - for eg. y = sin (x) will be a 2D system; PCA just does translation and rotation.
The center of the coordinate moves to the center of the data.
Importance of the new axis
Measurable vs Latent Features
Composite Features- Principal Component is NOT regression!
How to geet the Principal Component? Maximal Variance! - find the dimension with the maximum spread/or minimizes the information loss
Feature Transformation- PCs can be used as independent features, i.e. they do not overlap in terms of information with each other
When to use PCA? Eg.: Eigenfaces

PCA Mini Project: Eigenfaces


Observation: A lot PCs can lead to overfitting.

Feature Transformation


Transform a set of features into smaller, more compact features while retaining as much info as possible.
Why? An example of the google search problem- Problems such as polysemy and synonymy
Independent Component Analysis- ICA looks for statistical independence!
Cocktail Party problem: http://research.ics.aalto.fi/ica/cocktail/cocktail_en.cgi
PCA vs ICA- very different! Whereas PCA finds global stuff such as eigenfaces, ICA finds more distinct features such as 'nose', 'eyes' etc.
Alternatives: RCA (Random Component Analysis) - deals with curse of dimensionality and LDA (Linear Discrimant Analysis) - cares about the labels

Unsupervised learning project


Good way to find relevant features: try making a feature a label and predicting it from other features.--find R2 score and see if you can model a feature using the others.
Box-Cox transformation
Outlier Detection- Tukey's method
Outliers: To Drop, or not to drop?
Silhouette's coefficient for the effectiveness for k means: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html
GMM (soft clustering) vs KNN (hard clustering)
PCA in layman terms: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

Part 5 - Reinforcement Learning

Markov Decision Process

Think of a world where actions are uncertain with some probabilities

The parameters/variables in a Markov Decision Process: states, models, action, rewards
Markovian Property- The next state only depends on the current state
Delayed Rewards - What was the action that led to the ultimate reward? - Temporal Credit Assignment Problem
Rewards - the hot sand beach walking to the water analogy
Sequences of rewards- Infinite horizons, Utility of Sequences; Relationship between rewards and utilities
Optimal Policy- The policy that maximizes the expected rewards. Reward in a state is not the same as the utility for the state - Reward is short term gratification, while utility is long term gratification
The BellMan Equation - How to solve it? Value Iteration
Finding Policies (pi)- Policy Iterations

Reinforcement Learning


Managing vs Learning, Modeler and Simulator


Three approaches to reinforcement learning - Policy Search, Vaue Functional Based, Model Based


Focus on Value Function- The Q function- Q Learning- VERY GOOD EXAMPLE: http://mnemstudio.org/path-finding-q-learning-tutorial.htm- Incremental Learning


VERY NICE SLIDES ON RL: http://home.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf


Q-Learning: How to choose actions?- Local Min Problem- Simulated Annealing


Eplison Greedy Exploration- Exploration vs Exploitation Lemma, a tradeoff between two things


Game Theory


The mathematics of conflict of interests- Game theory has multiple agents (as opposed to single agents in the above cases) ;Game Tree
Matrix form of the game, writing strategies of agents against one other and placing rewards.
Material on 2-sum zero game: http://www.cs.cmu.edu/~./awm/tutorials/gametheory.html
MiniMax theorem: Minimax is same as Maximin in the 2-player zero-sum game; i.e. maximizing the min is same as minimizing the max. Find the value of the game.
Von Neuman's Theorem - Relevant on non deterministic game of info as well
Then instead of perfect information, we go to hidden formation- Minimax theorem fails!!!
Mixed strategy vs pure strategy; Center Game
Non zero sum game- Prisoner's dilemna
Nash equilibrium - Playing the game multiple times; for a n repeated game, solution is n repeated N.E.

More Game Theory


Stochastic Games and Multi Agent Reinforcement Learning
Zero Sum Stochastic Games and General Sum Stochastic Games - Nash Q Algorithm

Reinforcement Learning Project: Smart Cab


PyGame
Selecting state space
Exploration vs Exploitation

Other times the agent learns a suboptimal policy because it first explores an action which is sub-optimal, but does yield positive rewards, and then repeatedly exploits that action. Later it may randomly explore the optimal policy, but at that point the suboptimal policy will have a higher value in the q-table.
For example, this might be "going forward at a green light" instead of following the waypoint at a green light. We will get some reward for simply moving on green, regardless of the waypoint, but it's not optimal. However it will be regularly exploited until exploration occurs again. During the exploitation period, it will build up a significant lead on the optimal policy.

Deep Learning

Lesson 1: More Deep Learning


Juanito esta jugando; el tiene que dividar puntos, y el va a dibujar una linea; como el va a hacer lo?
Linear boundaries for dividing data points. Then generalized for higher dimensions/features.
Perceptrons in terms of nodes (Neural Networks); perceptrons as logical operators
Perceptron Trick, Learning Rate. Start Random and then try fitting the line iteratively to correctly predict the mis-predicted points.
Error function- log-loss error function- When can you use Gradient descent?
Discrete vs Continous Predictions- Sigmoid Function; Softmax Function, One Hot Encoding
Maximum Likelihood; Cross Entropy; Multi Class Cross Entropy
Minimizing the error function given by cross entropy formula. Gradient Descent.
Similarities and Comparison between Perceptron and Gradient Descent; A correctly classified asks the separation line to go away; and a misclassified point asks the line to come closer.. (think about it, makes sense!)
Non Linear Models; Combining multiple perceptrons; Hidden Layers, Multi class classification
Feedforward and training Neural Networks; Backpropagation
Keras, Student Admissions Mini Project

Description of batch size etc:
* one epoch = one forward pass and one backward pass of all the training examples
* batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
* number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.


Training optimization- Stochastic Gradient Descent and Batch Gradient Descent; How to choose the decay of the learning rate; Overfitting vs Underfitting (the exam analogy); and how it is applicable in the neural network setting


Model Complexity Graph- Complexity generally increases with the increasing number of epochs- Early Stopping

Regularization and Overfitting- Punish high coefficients to avoid them!


Dropout- Avoid dominance of one part of the neural network to let some of the other weaker parts train- Vanishing Gradient - try other Activation Functions, like hyperbolic tan, Relu etc.


The problem of local minima! Try random restart or go with more momentum


A blog on the optimizers available in Keras: http://ruder.io/optimizing-gradient-descent/index.html#rmsprop


Mini Project: IMDB


Convolutional Neural Networks (CNN)


Applications of CNNs- Eg. Image Classification, Text Classification, Pictionary etc.


The MNIST project: recognizing digits from images

One Hot Encoding, Flattening image matrices to vectors, Vanishing Gradient Problem
A great link on activation functions: http://cs231n.github.io/neural-networks-1/#actfun; Categorical cross entropy loss as a loss function
Choosing the best model- split into validation sets!


CNNs vs MLPs (Multi layer perceptrons)-
When do MLPs fail? -.-

MLPs use a lot of params (Sparsely (locally) connected vs fully connected layer)
Throwing away 2D neighborhood information (such as in an image!) due to flattening
Color coding


Convolutional layers

Convolutional windows- Use multiple filters to detect multiple patterns
Activation Maps
Color Images!
Stride and Padding


Convolutional Layers in Keras

You are strongly encouraged to add a ReLU activation function to every convolutional layer in your networks.
Formula for number of parameters in a convolutional layer and formulas for shape of a convolutional layer


Pooling Layers

Used for dimensionality reduction and avoiding overfitting.
Take feature maps as input
Max Pooling Layer, Global Average Pooling Layer
Think as a stack of pancakes!


CNNs for image classification


Resizing the images. Aim is to decrease the weight and the height of the image, while increasing the depth of the image. Use max pooling layers to reduce dimensionality, i.e. reduce height and width.
A connected layer at the very end.

When constructing a network for classification, the final layer in the network should be a Dense layer with a softmax activation function. The number of nodes in the final layer should equal the total number of classes in the dataset.


The CIFAR 10 image database project


Keras Cheat Sheet- https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf


Image Augmentation

Scale invariance, Translation Invariance and Rotation invariance. Add random images with a bit of rotation, translation etc to the dataset. Augment ImageDataGenerator. Not the use of steps_per_epoch, fit_generator and flow in the fit command.


Grounbreaking CNNs architectures, eg. ResNet, VGG etc.


Transfer Learning- Using a pre-trained neural network to solve a new problem, i.e. a different dataset.

The initial layers detect more common pattermns such as circles, shapes etc; so they can be kept. Then you just train the final layers.

Here is an generalized overview of what the convolutional neural network does:
  the first layer will detect edges in the image
  the second layer will detect shapes
  the third convolutional layer detects higher level features


CNN Dog Recognition project


Haar Wavelet Face Detection, using ResNet50 for dog detection
Classifying breeds is a difficult problem.
Additional links-

http://cs231n.github.io/
http://cs231n.github.io/transfer-learning/


Deep Learning Extracurricular

Lesson 3: Intro to TensorFlow


brings different communities such as speech recognition, computer vision together with a common set of tools to solve the problems.
Intro to tensorflow constants and sessions. placeholder and feed_dict
Supervised Classification

Training a logistic classifier- weights, bias etc.


Some coding quizzes on tensor flow placeholder, softmax etc.
Activation functions- Relu; softmax; implementing cross entropy
Practical Aspects of Deep Learning - have variables to have zero mean and equal variance - Badly conditioned vs well conditioned- well coditioned makes optimization easier (numerically)
Measuring performance- Have classifiers generalize, not memorize. That's why you use validation sets!
The problem of scaling the gradient descent- take a random set of training data, computer gradient of it- do this many times! Stochastic Gradient Descent- Exponential Decay

Small Exercise


Nice exercise to see how much ram spance you need.

lesson 4: Intro to Neural Networks


Luis is back!!