Skip to content

Instantly share code, notes, and snippets.

@sgoyal1012
Last active June 17, 2022 04:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save sgoyal1012/5c55f3307d70c94547f5f90c56997533 to your computer and use it in GitHub Desktop.
Save sgoyal1012/5c55f3307d70c94547f5f90c56997533 to your computer and use it in GitHub Desktop.
Computer Vision Nanodegree

Image Representation and Classification

  • Emotional intelligence

    • Ability to read and understand emotions -- Affectiva uses emotions on face, i.e. facial features.
  • Color Images representation

  • Depth is the number of channels
If the identification problem is easier in color for us humans, it’s likely easier for an algorithm to see color images too!
  • Color thresholds

    • Used to select an area of interest (blue screen example)
  • Blue Screen example

    • Uses cv2 library
    • Open CV reads images as BGR instead of RGB -- Use cvtColor
    • Apply a mask -- use inRange method
  • Green screen notebook

  • Color Spaces

    • RGB, HSV and HLS
      • HSV for different lightning conditions
    • What is not differentiable in space, maybe differentiable in other!!
  • Day and night tasks

    • Distinguish day and night pictures
    • Labels -- why are they needed?
    • Features that can distinguish images
      • A feature a measurable component of an image or object that is, ideally, unique and recognizable under varying conditions
    • One-hot encoding etc for labels
    • Day and night notebook
    • MUST STANDARDIZE THE INPUT
    • Gets average brightness using the HSV space -- V is the brightness value
      • Sum of v/(width * height)

Convolutional Transforms and Edge Detection

  • Edge detection

  • Frequencies in images

    • Fourier Transform!
    • Images have high and low frequency components
  • High pass filters

    • Enhances edges
    • Kernel weights MUST sum to zero (otherwise it will brighten/darken the image)
    • Padding can have three techniques --> extend, crop and pad
    • Sobel --> calculate gradient magnitude and direction of gradients
  • Low pass filters

    • High pass filters ENHANCE noise
    • Low pass filters can filter out high frequency noises
    • Averaging filter (Normalized)
  • Gaussian Blur

    • Helps a lot in ridding of noise during edge detection
  • Canny edge detector

    • Uses non-maximal suppression and hysteris to find the best edges
  • Hough Transform

    • Hough Space!
  • Haar cascades face detection

    • Trains on positive (face images) and negative (non face images)
    • Haar features gets facial features (similar to edge detection)
    • Cascades and keeps throwing away non-face areas
  • Types of features

    • Use color and shape features together

Types of features and image segmentation

Lesson 7: Feature Vectors

  • Feature Vectors

    • Combine multiple features (like corners) into a single vector
  • Real-Time Feature Detection

  • Scale and Rotation Invariance

  • ORB in video

  • ORB properties

    • Scale Invariance
    • Rotational Invariance
    • Illumination Invariance
    • Noise Invariance
  • HOG

    • Histogram of oriented gradients
    • Use sobel to get gradient directions and magnitudes by grouping into square cells
    • Works in cases such as pedestrian detection -- i.e. generalization

CNN Layers and Feature Visualization

  • Introduction to CNN layers

  • Combine various kernels together to form a convolution layer

    • Combine different filters
    • For color images, filters are also 3D
  • Find patterns with patterns within patterns..

  • Convolutional layers are locally connected

  • CNN learns filters and patterns!

  • PyTorch's the new shizz yo

  • Pooling layers

    • Pooling layers reduce the dimensionality of an input
    • Higher dimensionality means more parameters, means overfitting
    • The PANCAKE analogy
  • Fully connected layer

    • Dropout layer to prevent overfitting
  • Training, loss functions, cross entropy etc

  • IMPORTANT -- Output layer size

    • output_dim = (W-F)/S + 1 -- W is the input width/height, F is the filter kernel size, S is the stride
  • Dropout

    • The GYM analogy! -- dont do one body part every fuckin day! ;)
    • Prevent one part from dominating by randomly turning off nodes, with some probability
  • Momentum

    • Power over a local minimum and reach the global minimum
    • Attach weight to the steps, the most recent step weighted the most
  • Learn how to design a network from existing networks that work!

    • DO NOT REINVENT THE WHEEL
  • Feature Visualization

    • Example between wolves and dogs
    • Model learnt that snow means wolf (because of limited data, network learnt that snow is wolf)
  • Feature Maps/Activation Maps

    • Each filter activates some property in an image (Eg. edges)
    • First convolutional layer can give some idea as to types of features being learnt
    • As you progress, layers detct more and more complex features
  • Visualize closeness in feature space

  • Occlusion, Saliency and Guided Backpropagation

    • Occlusion occludes parts of image
    • Saliency Maps -- denotes quality of each pixel; change pixel value by a little and see how/if it affects class output
    • Guided backpropagation
Facial Keypoint Detection Project
  • PyTorch Dataset class and inheritance
  • Creating transforms -- rescale, cropping etc
  • Define a well performing CNN structure -->
* What does well mean?

"Well" means that the model's loss decreases during training and, when applied to test image data, the model produces keypoints that closely match the true keypoints of each face. And you'll see examples of this later in the notebook.
Whole Pipeline
  • Haar Detector Cascades for the

Advanced Computer Vision & Deep Learning

CNNs and Scene understanding
  • A real world has multiple objects (not just an image of the object to recognize)
  • Bounding box around an object
  • Example problem -- in a basketball, who has the chingada basketball? Localization
  • Want to see the location of an object in the image (and not just identify it)

Detect bounding boxes

  • Classification vs regression
  • WHat loss function to use?
  • MSE loss is prone to outliers
  • Smooth L1 loss

Multiple Objects in an image

  • Cant know how many objects beforehand!
  • One approach is to use a sliding window to crop images
  • To limit the number of cropped regions, propose regions only where objects are uniform (same texture etc)
  • Image Proposals -- ROIs

R-CNNs

  • Works on regions of cropped images
  • Time intensive as goes through all regions
  • Outpus class scores per object -- background also has a class score; for every region
Faster R-CNNs

Real Time Detection -- YOLO

  • Adds boxed parameters to the class probabilities

  • Improves over the normal sliding window approach (its computationally intensive)

    • Can improve my choosing a sliding window smartly that causes no overlap (all windows are contained in the output layer)
  • YOLO uses a grid cells -- classes and bounding box -- assign the scores and probabilities to the grid box coordinates

  • Training on a grid requires special data -- GROUND TRUTH TO EVERY GRID!

  • Example is that a grid cell has depth of 8 -- so shape is nRowGrids * nColGrids * Depth

  • Get mid point of an object and assign the ground truth to ONLY THE BOX that contains the middle point; the values are relative to the grid cell coordinate system

    • Normalization
  • Non Maximal Supression

    • Find the best bounding box
      • Compare the intersection/union with the ground truth to find the best box
    • Filter using high prediction confidence and the IoU score
  • Overlap of two objects?

    • Define anchor boxes to store multiple objects -- two anchor boxes for a car and a person -- combine into the output vector

RNNs in Computer Vision

  • Incorporates memory, CNN is for spatiality

    • Dependencies over time -- previous and next inout matters
    • Eg. distinguish a cat walking vs running
  • History of RNNs

    • Need to model Temporal data

    • Vanishing gradient problem

      • Long Short Term Memory (LSTMs) to the rescue!
    • Used in gesture recognition in a video frame

  • Feedforward recap

    • Calculate the hidden component (h) and _output (y)
    • Activation function to make sure within range -- allows for non linearity!
    • Number of multiplications for a feedforward process
  • Backpropagation

    • Weight change is a function of the learning rate and the partial derivative
    • Overfitting issue
      • Early stopping
      • Regularization
    • Basically stochastic gradient descent with the chain rule
  • RNNs

    • _Maintaining memory)

    • Eg. predicting the next word in a sentence

      • includes feedback from memory elements (Elman network)
    • State -- Ws for weights connecting timesteps

      • Also include the term from the previous timestep
    • Folded and unfolded model

    • Example of the sequence detection for the word udacity

  • Backpropagation through time

    • Need to consider the previous timestamps!
    • Three matrices
      • Ws, Wy and Wx
      • Accumulative gradients over time! for both Ws and Wx
  • RNN Summary

    • Vanishing Gradient problem!
    • Gradient clipping for exploding gradients
  • Enter LSTMs!

    • Need to remember a longer history!
    • LSTM cell has four components

Long short term memory cells!

  • Bienvenido luis!

  • Excellent example of animals!

  • RNNs only have short term memory -- LSTMs gives both (remembers stuff from long time ago)

  • Three pieces of info -- long term, short tem and _event+

    • Gates are --> Forget Gate, Learn Gate, Remember Gate and Use Gate
  • Learn Gate

    • Combines short term and event and ignores a bit of it.
      • ignore factor
  • Forget gate

    • Takes the long term memory and forgets a part of it
  • Remember gate

    • Combines (adds) the output of forget gate and learn gate
  • Use gate

    • Gets the new output/short term memory by combining (multiplying) long term and short term memory
  • RNN Batching

    • Can train in parallel by batching
  • Other architectures -- there are many that work (GRU)

Hyperparameters in RNN

  • Optimizer hyper parameters are for the training process

  • Model hyper parameters change the model structue (number of hidden units etc)

  • Learning Rate

    • The most important one!
    • A good starting point is 0.001
    • Learning Rate decay decrease it so it reaches the error
  • Minibatch size

    • Goes from 1 (online training) to batch (whole size of batch)
  • Number of training iterations/early stopping

  • Number of hidden units/layers

    • Use dropout/regularization
  • LSTMs and GRUs vs RNNs

    • They win over RNNs for sure; but not clear who wins between the two

Attention Model

  • Add selective focus to sequences, mimicking human behavior -- attention to the most important parts
  • Decoder learns what to focus on durante the attention phase
  • Attention context for each word -- how much does the other word contribute?

Attention encoder (Machine translation)

  • Is an RNN
  • Every word embedding gives an hidden state

Attention decoder

  • Looks at ALL hidden states

Bahdanau vs Luong Attention

Multiplicative Attention

  • Need to calculate a context vector, that amplifies imp parts and drowns out non-important parts
  • Dot product is a similarity measure
  • Different types of multiplication
    • Adding a weights vector, one can have different embeddings size for encoder and decoder

Additive attention

  • Concat attention
  • CNN Encoder and decoder for image captioning

Attention is all you need - Transformer

  • No RNNs required <--> self attention
  • Can ||ize

Image Captioning

  • Introduces image captioning (CNN + RNN)
  • COCO dataset -- Common objects in Context -- 5 captions/image
  • Image ---> CNN ---> RNN
    • Get features using CNN (Dont need classification, so dont need later layers)
  • Need to tokenize -- convert to vocab; also add <start> and <end> tokens
  • NLTK tokenization
    • Whitespace tokenization
  • Decoder is made up of LSTMs
    • Always with the ``` token
    • Recurrence function used to predict the correct next word
  • Video Captioning
    • Feature extraction becomes an array of image frames

Object Tracking and Intro

Introduction to Motion

  • How to represent Motion -- SLAM
  • A video stream is just a sequence of video streams
  • Optical Flow
    • Pixel intensities do not change between frames
    • Neighboring pixels have similar motion
    • Representing motion and direction (u.v)
    • Get the magnitude and direction of points
  • Assumption is that colors (intensities) of points do not change over time!

Probability Review

  • Dependent, Independent events; joint probability
  • Bayes Rule
Bayes' Rule is extremely important in robotics and it can be summarized in one sentence: 
given an initial prediction, if we gather additional data (data that our initial prediction depends on), 
we can improve that prediction!
  • Have sensors such as Camera, Lidar, Radar and internal sensors

    • Add more sensor data to improve estimation of car's location
  • Probability distribution

    • Mathematical way to represent probability/uncertainty
    • Discrete vs continuous probability functions
    • Prior (before sensor data) vs Posterior (after sensor data) probabilites
  • Localization problem

    • Can't localize a car with enough accuracy using GPS, 2-10 m is a LOT of error
    • Gives the robot example -- prior and posterior belief (before and after measurements)
    • Probability before and after sense
      • It is uniform in the beginning, it is more towards the measurement if that is MORE CERTAIN (i.e. has lower variance)
      • Covariance is smaller -- as there is MORE INFORMATION (MIND BLOWN!)
    • Parameter update:
      • Means are updated; weighted by variances
      • Variances are also given by an equation
    • Probabilities are normalized
      • Multiple measurements
  • Robot Motion

    • Shitfting of probabilities
  • Inaccurate motion

    • A robot can overshoot or undershoot the target
  • Localization is nothing but sense <--> move <--> sense <--> move AND REPEAT..; along with initial belief

    • Everytime it senses, it GAINS information, everytime it moves, it LOSES information

    a measurement step should decrease entropy whereas an update step should increase entropy.

Kalman Filters

  • Used for localization
    • Kalman is continuous, monte carlo is discrete
    • Distribution is gaussian
    • Measurement -- update -- prediction
      • Start with a wide gaussian, then shift the mean
    • Gaussian update motion
      • Kalman filter in code
    • Real world is multidimensional

Representing state and motion

  • Sense and move, sense and move
The beauty of Kalman filters is that they combine somewhat inaccurate sensor measurements with somewhat inaccurate predictions of motion to get a filtered location estimate that is better than any estimates that come from only sensor readings or only knowledge about movement.
  • Representing state of the car

    • (Initial position, velocity)
  • Motion model

    • Assume constant velocity (Constant velocity model)
    • Different types -- constant velocity and constant acceleration
  • Representing states

    • OOP for progamming, Linear Algebra for the math
    • Creating a car object
    • __repr for string representation; Operator Overloading etc.
  • Matrix multiplication, state transformation matrix

    • State vectors are always column vectors, to have dependencies between x and y

Matrices and transformation

  • Multivariate gaussian
    • Mean is now a vector, covariance matrix
  • Kalman filter takes the 1D case and takes a 2D estimate (location + velocity)
  • Location is correlated with velocity -- velocity has a list of possibilities and you consider them to get the location -- gives you a 2d Gaussian!
    • Can estimate velocity as well, even though cannot measure
    • Location is observables, velocity is hidden ; but these two interact, so from observables I can infer velocity
  • Kalman filter equations
    • Simplifies equations
      • Lower case variables means vectors, upper case means matrices
  • Goes over coding matrix operations such as matrix multiplication,addition, subtraction, invesre etc

SLAM: Simultaneous Localization and Mapping

  • Robot needs to build a map
  • Graph SLAM
    • Relative motion constraints (like rubber bands)
    • Relative measurement constraints
  • Matrices are updated
    • Omega and xi -- multiply inverse of omega and xi to get the result
    • Adding noisy measurements affects only affected areas, NOT EVERYTHING!

Vehicle motion and calculus

  • Odometer and Inertial Measurement Unit
  • Position vs time graphs
  • Rate Gyros The yaw rate of a vehicle can be measured by a rate gyro.
  • Bias and errors that integrate and accumulate
  • Trigonometry for calculating x and y components
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment