sgoyal1012/Computer Vision.md

## Computer Vision.md

      
    Raw
  

              Computer Vision.md
            
          
    Image Representation and Classification


Emotional intelligence

Ability to read and understand emotions -- Affectiva uses emotions on face, i.e. facial features.


Color Images representation


Depth is the number of channels

If the identification problem is easier in color for us humans, it’s likely easier for an algorithm to see color images too!


Color thresholds

Used to select an area of interest (blue screen example)


Blue Screen example

Uses cv2 library
Open CV reads images as BGR instead of RGB -- Use cvtColor
Apply a mask -- use inRange method


Green screen notebook


Color Spaces

RGB, HSV and HLS

HSV for different lightning conditions


What is not differentiable in space, maybe differentiable in other!!


Day and night tasks

Distinguish day and night pictures
Labels -- why are they needed?
Features that can distinguish images

A feature a measurable component of an image or object that is, ideally, unique and recognizable under varying conditions 


One-hot encoding etc for labels
Day and night notebook
MUST STANDARDIZE THE INPUT
Gets average brightness using the HSV space -- V is the brightness value

Sum of v/(width * height)


Convolutional Transforms and Edge Detection


Edge detection


Frequencies in images

Fourier Transform!
Images have high and low frequency components


High pass filters

Enhances edges
Kernel weights MUST sum to zero (otherwise it will brighten/darken the image)
Padding can have three techniques --> extend, crop and pad
Sobel --> calculate gradient magnitude and direction of gradients


Low pass filters

High pass filters ENHANCE noise
Low pass filters can filter out high frequency noises
Averaging filter (Normalized)


Gaussian Blur

Helps a lot in ridding of noise during edge detection


Canny edge detector

Uses non-maximal suppression and hysteris to find the best edges


Hough Transform

Hough Space!


Haar cascades face detection

Trains on positive (face images) and negative (non face images)
Haar features gets facial features (similar to edge detection)
Cascades and keeps throwing away non-face areas


Types of features

Use color and shape features together


Types of features and image segmentation


Types of features are edges, corners and blobs

Corners are most repeatable (SIFT -- good feature!)


Morphological operations - remember them?!!

Closing, opening, erosion, dilation


Image contouring

Continuous curves that follow the boundary
findContours and drawContours


Contour properties

https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_contours/py_contour_properties/py_contour_properties.html
Orientation using the ellipse fitting


K Means Clustering

Breaks into different colors
Criteria on when to stop


Lesson 7: Feature Vectors


Feature Vectors

Combine multiple features (like corners) into a single vector


Real-Time Feature Detection

ORB algorithm

Oriented Fast and Rotated Brief
Creates binary feature vectors for all feature keypoints


FAST

Selects keypoints from an image using brightness
Is FAST as only compares 4 pixels


BRIEF

Converts keypoints to a binary feature vector
Binary is efficient and quick

Can run on smartphones
https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_brief/py_brief.html


Scale and Rotation Invariance


Intensity centroid
Rotation aware brief for rotational invariance
Image pyramids - https://docs.opencv.org/2.4/doc/tutorials/imgproc/pyramids/pyramids.html
Different metrics for match -- Hamming metric is one


ORB in video


ORB properties

Scale Invariance
Rotational Invariance
Illumination Invariance
Noise Invariance


HOG

Histogram of oriented gradients
Use sobel to get gradient directions and magnitudes by grouping into square cells
Works in cases such as pedestrian detection -- i.e. generalization


CNN Layers and Feature Visualization


Introduction to CNN layers

https://cezannec.github.io/Convolutional_Neural_Networks/


Combine various kernels together to form a convolution layer

Combine different filters
For color images, filters are also 3D


Find patterns with patterns within patterns..


Convolutional layers are locally connected


CNN learns filters and patterns!


PyTorch's the new shizz yo


Pooling layers

Pooling layers reduce the dimensionality of an input
Higher dimensionality means more parameters, means overfitting
The PANCAKE analogy


Fully connected layer

Dropout layer to prevent overfitting


Training, loss functions, cross entropy etc


IMPORTANT -- Output layer size

output_dim = (W-F)/S + 1 -- W is the input width/height, F is the filter kernel size, S is the stride


Dropout

The GYM analogy! -- dont do one body part every fuckin day! ;)
Prevent one part from dominating by randomly turning off nodes, with some probability


Momentum

Power over a local minimum and reach the global minimum
Attach weight to the steps, the most recent step weighted the most


Learn how to design a network from existing networks that work!

DO NOT REINVENT THE WHEEL


Feature Visualization

Example between wolves and dogs
Model learnt that snow means wolf (because of limited data, network learnt that snow is wolf)


Feature Maps/Activation Maps

Each filter activates some property in an image (Eg. edges)
First convolutional layer can give some idea as to types of features being learnt
As you progress, layers detct more and more complex features


Visualize closeness in feature space


Occlusion, Saliency and Guided Backpropagation

Occlusion occludes parts of image
Saliency Maps -- denotes quality of each pixel; change pixel value by a little and see how/if it affects class output
Guided backpropagation


Facial Keypoint Detection Project


PyTorch Dataset class and inheritance
Creating transforms -- rescale, cropping etc
Define a well performing CNN structure -->

* What does well mean?

"Well" means that the model's loss decreases during training and, when applied to test image data, the model produces keypoints that closely match the true keypoints of each face. And you'll see examples of this later in the notebook.


WTF's batch normalization
https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c

Whole Pipeline


Haar Detector Cascades for the

Advanced Computer Vision & Deep Learning

CNNs and Scene understanding


A real world has multiple objects (not just an image of the object to recognize)
Bounding box around an object
Example problem -- in a basketball, who has the chingada basketball? Localization
Want to see the location of an object in the image (and not just identify it)

Detect bounding boxes


Classification vs regression
WHat loss function to use?


MSE loss is prone to outliers
Smooth L1 loss

Multiple Objects in an image


Cant know how many objects beforehand!
One approach is to use a sliding window to crop images
To limit the number of cropped regions, propose regions only where objects are uniform (same texture etc)
Image Proposals -- ROIs

R-CNNs


Works on regions of cropped images
Time intensive as goes through all regions
Outpus class scores per object -- background also has a class score; for every region

Faster R-CNNs


First proposal

Creates multiple feature maps -- and then gets the ROI
ROI pooling before fully connected layer


Second proposal

Learns to come up with own region proposal


Region proposal network


https://towardsdatascience.com/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9


Real Time Detection -- YOLO


Adds boxed parameters to the class probabilities


Improves over the normal sliding window approach (its computationally intensive)

Can improve my choosing a sliding window smartly that causes no overlap (all windows are contained in the output layer)


YOLO uses a grid cells -- classes and bounding box -- assign the scores and probabilities to the grid box coordinates


Training on a grid requires special data -- GROUND TRUTH TO EVERY GRID!


Example is that a grid cell has depth of 8 -- so shape is nRowGrids * nColGrids * Depth


Get mid point of an object and assign the ground truth to ONLY THE BOX that contains the middle point; the values are relative to the grid cell coordinate system

Normalization


Non Maximal Supression

Find the best bounding box

Compare the intersection/union with the ground truth to find the best box


Filter using high prediction confidence and the IoU score


Overlap of two objects?

Define anchor boxes to store multiple objects -- two anchor boxes for a car and a person -- combine into the output vector


RNNs in Computer Vision


Incorporates memory, CNN is for spatiality

Dependencies over time -- previous and next inout matters
Eg. distinguish a cat walking vs running


History of RNNs


Need to model Temporal data


Vanishing gradient problem

Long Short Term Memory (LSTMs) to the rescue!


Used in gesture recognition in a video frame


Feedforward recap

Calculate the hidden component (h) and _output (y)
Activation function to make sure within range -- allows for non linearity!
Number of multiplications for a feedforward process


Backpropagation

Weight change is a function of the learning rate and the partial derivative
Overfitting issue

Early stopping
Regularization


Basically stochastic gradient descent with the chain rule


RNNs


_Maintaining memory)


Eg. predicting the next word in a sentence

includes feedback from memory elements (Elman network)


State -- Ws for weights connecting timesteps

Also include the term from the previous timestep


Folded and unfolded model


Example of the sequence detection for the word udacity


Backpropagation through time

Need to consider the previous timestamps!
Three matrices

Ws, Wy and Wx
Accumulative gradients over time! for both Ws and Wx


RNN Summary

Vanishing Gradient problem!
Gradient clipping for exploding gradients


Enter LSTMs!

Need to remember a longer history!
LSTM cell has four components


Long short term memory cells!


Bienvenido luis!


Excellent example of animals!


RNNs only have short term memory -- LSTMs gives both (remembers stuff from long time ago)


Three pieces of info -- long term, short tem and _event+

Gates are --> Forget Gate, Learn Gate, Remember Gate and Use Gate


Learn Gate

Combines short term and event and ignores a bit of it.

ignore factor


Forget gate

Takes the long term memory and forgets a part of it


Remember gate

Combines (adds) the output of forget gate and learn gate


Use gate

Gets the new output/short term memory by combining (multiplying) long term and short term memory


RNN Batching

Can train in parallel by batching


Other architectures -- there are many that work (GRU)


Hyperparameters in RNN


Optimizer hyper parameters are for the training process


Model hyper parameters change the model structue (number of hidden units etc)


Learning Rate

The most important one!
A good starting point is 0.001
Learning Rate decay decrease it so it reaches the error


Minibatch size

Goes from 1 (online training) to batch (whole size of batch)


Number of training iterations/early stopping


Number of hidden units/layers

Use dropout/regularization


LSTMs and GRUs vs RNNs

They win over RNNs for sure; but not clear who wins between the two


Attention Model


Add selective focus to sequences, mimicking human behavior -- attention to the most important parts
Decoder learns what to focus on durante the attention phase
Attention context for each word -- how much does the other word contribute?

Attention encoder (Machine translation)


Is an RNN
Every word embedding gives an hidden state

Attention decoder


Looks at ALL hidden states

Bahdanau vs Luong Attention


two types of attention -- Additive (Bahdanau) vs Multiplicative(Luong)
Good notebook! -- https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb

Multiplicative Attention


Need to calculate a context vector, that amplifies imp parts and drowns out non-important parts
Dot product is a similarity measure
Different types of multiplication

Adding a weights vector, one can have different embeddings size for encoder and decoder


Additive attention


Concat attention
CNN Encoder and decoder for image captioning

Attention is all you need - Transformer


No RNNs required <--> self attention
Can ||ize

Image Captioning


Introduces image captioning (CNN + RNN)
COCO dataset -- Common objects in Context -- 5 captions/image
Image ---> CNN ---> RNN

Get features using CNN (Dont need classification, so dont need later layers)


Need to tokenize -- convert to vocab; also add <start> and <end> tokens
NLTK tokenization

Whitespace tokenization


Decoder is made up of LSTMs

Always  with the ``` token
Recurrence function used to predict the correct next word


Video Captioning

Feature extraction becomes an array of image frames


Object Tracking and Intro

Introduction to Motion


How to represent Motion -- SLAM
A video stream is just a sequence of video streams
Optical Flow

Pixel intensities do not change between frames
Neighboring pixels have similar motion
Representing motion and direction (u.v)
Get the magnitude and direction of points


Assumption is that colors (intensities) of points do not change over time!

Probability Review


Dependent, Independent events; joint probability
Bayes Rule

Bayes' Rule is extremely important in robotics and it can be summarized in one sentence: 
given an initial prediction, if we gather additional data (data that our initial prediction depends on), 
we can improve that prediction!


Have sensors such as Camera, Lidar, Radar and internal sensors

Add more sensor data to improve estimation of car's location


Probability distribution

Mathematical way to represent probability/uncertainty
Discrete vs continuous probability functions
Prior (before sensor data) vs Posterior (after sensor data) probabilites


Localization problem

Can't localize a car with enough accuracy using GPS, 2-10 m is a LOT of error
Gives the robot example -- prior and posterior belief (before and after measurements)
Probability before and after sense

It is uniform in the beginning, it is more towards the measurement if that is MORE CERTAIN (i.e. has lower variance)
Covariance is smaller -- as there is MORE INFORMATION (MIND BLOWN!)


Parameter update:

Means are updated; weighted by variances
Variances are also given by an equation


Probabilities are normalized

Multiple measurements


Robot Motion

Shitfting of probabilities


Inaccurate motion

A robot can overshoot or undershoot the target


Localization is nothing but sense <--> move <--> sense <--> move AND REPEAT..; along with initial belief

Everytime it senses, it GAINS information, everytime it moves, it LOSES information

a measurement step should decrease entropy whereas an update step should increase entropy.


Kalman Filters


Used for localization

Kalman is continuous, monte carlo is discrete
Distribution is gaussian
Measurement -- update -- prediction

Start with a wide gaussian, then shift the mean


Gaussian update motion

Kalman filter in code


Real world is multidimensional


Representing state and motion


Sense and move, sense and move

The beauty of Kalman filters is that they combine somewhat inaccurate sensor measurements with somewhat inaccurate predictions of motion to get a filtered location estimate that is better than any estimates that come from only sensor readings or only knowledge about movement.


Representing state of the car

(Initial position, velocity)


Motion model

Assume constant velocity (Constant velocity model)
Different types -- constant velocity and constant acceleration


Representing states

OOP for progamming, Linear Algebra for the math
Creating a car object
__repr for string representation; Operator Overloading etc.


Matrix multiplication, state transformation matrix

State vectors are always column vectors, to have dependencies between x and y


Matrices and transformation


Multivariate gaussian

Mean is now a vector, covariance matrix


Kalman filter takes the 1D case and takes a 2D estimate (location + velocity)
Location is correlated with velocity -- velocity has a list of possibilities and you consider them to get the location -- gives you a 2d Gaussian!

Can estimate velocity as well, even though cannot measure
Location is observables, velocity is hidden ; but these two interact, so from observables I can infer velocity


Kalman filter equations

Simplifies equations

Lower case variables means vectors, upper case means matrices


Goes over coding matrix operations such as matrix multiplication,addition, subtraction, invesre etc

SLAM: Simultaneous Localization and Mapping


Robot needs to build a map
Graph SLAM

Relative motion constraints (like rubber bands)
Relative measurement constraints


Matrices are updated

Omega and xi -- multiply inverse of omega and xi to get the result
Adding noisy measurements affects only affected areas, NOT EVERYTHING!


Vehicle motion and calculus


Odometer and Inertial Measurement Unit
Position vs time graphs
Rate Gyros The yaw rate of a vehicle can be measured by a rate gyro.
Bias and errors that integrate and accumulate
Trigonometry for calculating x and y components