Smarker/object-detection.md

## object-detection.md

      
    Raw
  

              object-detection.md
            
          
    Object Detection

Class Imbalance Problem

total # positive classes <<< total # negative classes
example: identifying fradulent claims
There may not be many fradulent claims, so the classifier will tend to classify fraudulent claims as genuine.

Model 1: classified 7/10 fraudulent transactions as genuine. 10/10,000 genuine transactions as fraudulent = 17 "mistakes"
Model 2: classified 2/10 fraudulent transactions as genuine. 100/10,000 genuine transactions as fraudulent = 102 "mistakes"

Since we want to minimize fraudulent transactions as genuine, model 2 actually performs better even though it made more "mistakes". Therefore, it is good to not base performance on mistakes, but on true positive (TP) rate, true negative (TN) rate, FP rate, FN rate.


Formula
Performance


TP Rate = TP / (TP + FP)
Close to 1 = good


TN Rate = TN / (TN + FN)
Close to 1 = good


FP Rate = FP / (FP + TN)
Close to 0 = good


FN Rate = FN / (FN + TP)
Close to 0 = good


How to Mitigate Class Imbalance Problem


Cost Function Based Approach - think one false negative as worse than one false positive. (weigh false negatives more)

i.e. thinking a claim was genuine but it was actually a fraud would be weighted with a larger cost than one that thought a claim was a fraud but it was actually geunine is less bad and therefore has lower cost


Sampling Based Approach

oversampling: adding more of the minority class - might have to deal with overfitting of minority class
undersampling: removing more of the majority class - may risk moving more representative instances of majority class


Sampling

Downsampling


Reduces # of pixels in the image, i.e. shrinking the image. Then, when you want to make the image the same size as it was previously, you will need to increase the pixel size
Example:  reduce a 512x512 image to 256x256 = factor of 2 downsampling in horizontal and vertical directions


Upsampling


Increases the # of pixels in the image, i.e. enlarging the image. The added pixels are estimated from surrounding samples.


Feature Pyramid


Used for recognizing objects at vastly different scales
Scale-Invariant because the object's scale change is offset by shifting its level in the pyramid
Feature maps close to the image layer are composed of low-level structures not effective for accurate object detection


Feature Pyramid Network (FPN) is composed of a bottom-up and top-down pathway
bottom-up is useful for feature extraction (spatial resolution decreases as you go up to the top layers of the pyramid and view a smaller version of the object, i.e. the semantic value increases)


FPN uses a top-down pathway to construct higher resolution layers from a semantic rich layer
The bottom-up pathway uses ResNet

Anchor Boxes


Because a CNN has shared weights, it is not able to estimate the absolute position in an image, anchor boxes make it possible so the CNN only needs to predict the relative transformation for each anchor box (anchor box is the bounding box)

RetinaNet


RetinaNet can match the speed of one-stage detectors and surpass the accuracy of the two-stage detectors.
one-stage detectors have typically had worse accuracy than two-stage detectors - why? -> class imbalance problem
RetinaNet addresses problem that one-stage detectors have with class imbalance between foreground and background of the image during training of dense detectors - how? -> reshaping the standard cross entropy loss, i.e. it down-weights the loss assigned to well-classified examples. (want to minimize loss, now well-classified examples don't help as much for the loss)
The loss will focus training on a sparse set  of hard examples and prevent the large number of easy negatives from overwhelming the detector. This loss is called Focal Loss.
Uses a dense sampling of object locations in an input image and an in-network feature pyramid and anchor boxes


C_i is just a type of convolution, for example, conv5 = 256 3x3 filters at stride 1, pad 1
In the top-down pathway, apply a 1x1 convolution filter


Focal Loss


well-classified examples: p_t > 0.5
Scaling factor decays to 0 as confidence in the correct class increases (loss low at well-classified examples)

Suppose


gamma = 5, p_t = 0.1 bad classified, then -(1-0.1)^5 * log(0.1) = 1.36 loss
gamma = 5, p_t = 0.9 well classified, then -(1-0.9)^5 * log(0.9) = 1.05E-6 loss ~ 0 loss

RetinaNet Performance Against other Detectors


RetinaNet outperforms Faster R-CNN, a two-stage detector

SSD


SSD does not select bottom layers of the pyramid for object detection, since the semantic value is not high enough to justify its use as it significantly reduces speed (SSD uses upper layers for detection - performs worse on small objects)


One Stage Detectors


Must process a much larger set of candidate object locations regularly sampled across an image (background part of image still dominates even if using a sampling heuristic)


RetinaNet


YOLO


SSD


Two Stage Detectors


Stage 1: Class imbalance is addressed through the proposal stage (Selective Search, Edge Boxes, DeepMask, RPN) to narrow down # of candidate object locations, filtering most background samples


Stage 2: sampling heuristics like a fixed foreground-to-background ratio are performed to maintain a balance between foreground and background


Faster R-CNN


Mask R-CNN
Formula	Performance
TP Rate = TP / (TP + FP)	Close to 1 = good
TN Rate = TN / (TN + FN)	Close to 1 = good
FP Rate = FP / (FP + TN)	Close to 0 = good
FN Rate = FN / (FN + TP)	Close to 0 = good