Skip to content

Instantly share code, notes, and snippets.

@Smarker
Last active August 10, 2018 14:15
Show Gist options
  • Save Smarker/c49481a124cdc69a34970be048875ddf to your computer and use it in GitHub Desktop.
Save Smarker/c49481a124cdc69a34970be048875ddf to your computer and use it in GitHub Desktop.
user:smarker #ml #object-detection

Object Detection

total # positive classes <<< total # negative classes

example: identifying fradulent claims

There may not be many fradulent claims, so the classifier will tend to classify fraudulent claims as genuine.

  • Model 1: classified 7/10 fraudulent transactions as genuine. 10/10,000 genuine transactions as fraudulent = 17 "mistakes"
  • Model 2: classified 2/10 fraudulent transactions as genuine. 100/10,000 genuine transactions as fraudulent = 102 "mistakes"

Since we want to minimize fraudulent transactions as genuine, model 2 actually performs better even though it made more "mistakes". Therefore, it is good to not base performance on mistakes, but on true positive (TP) rate, true negative (TN) rate, FP rate, FN rate.

Formula Performance
TP Rate = TP / (TP + FP) Close to 1 = good
TN Rate = TN / (TN + FN) Close to 1 = good
FP Rate = FP / (FP + TN) Close to 0 = good
FN Rate = FN / (FN + TP) Close to 0 = good

How to Mitigate Class Imbalance Problem

  • Cost Function Based Approach - think one false negative as worse than one false positive. (weigh false negatives more)

    • i.e. thinking a claim was genuine but it was actually a fraud would be weighted with a larger cost than one that thought a claim was a fraud but it was actually geunine is less bad and therefore has lower cost
  • Sampling Based Approach

    • oversampling: adding more of the minority class - might have to deal with overfitting of minority class
    • undersampling: removing more of the majority class - may risk moving more representative instances of majority class

Sampling

Downsampling

  • Reduces # of pixels in the image, i.e. shrinking the image. Then, when you want to make the image the same size as it was previously, you will need to increase the pixel size
  • Example: reduce a 512x512 image to 256x256 = factor of 2 downsampling in horizontal and vertical directions

image

image

Upsampling

  • Increases the # of pixels in the image, i.e. enlarging the image. The added pixels are estimated from surrounding samples.

image

image

  • Used for recognizing objects at vastly different scales
  • Scale-Invariant because the object's scale change is offset by shifting its level in the pyramid
  • Feature maps close to the image layer are composed of low-level structures not effective for accurate object detection

image

  • Feature Pyramid Network (FPN) is composed of a bottom-up and top-down pathway
  • bottom-up is useful for feature extraction (spatial resolution decreases as you go up to the top layers of the pyramid and view a smaller version of the object, i.e. the semantic value increases)

image

  • FPN uses a top-down pathway to construct higher resolution layers from a semantic rich layer
  • The bottom-up pathway uses ResNet

Anchor Boxes

  • Because a CNN has shared weights, it is not able to estimate the absolute position in an image, anchor boxes make it possible so the CNN only needs to predict the relative transformation for each anchor box (anchor box is the bounding box)
  • RetinaNet can match the speed of one-stage detectors and surpass the accuracy of the two-stage detectors.
  • one-stage detectors have typically had worse accuracy than two-stage detectors - why? -> class imbalance problem
  • RetinaNet addresses problem that one-stage detectors have with class imbalance between foreground and background of the image during training of dense detectors - how? -> reshaping the standard cross entropy loss, i.e. it down-weights the loss assigned to well-classified examples. (want to minimize loss, now well-classified examples don't help as much for the loss)
  • The loss will focus training on a sparse set of hard examples and prevent the large number of easy negatives from overwhelming the detector. This loss is called Focal Loss.
  • Uses a dense sampling of object locations in an input image and an in-network feature pyramid and anchor boxes

image

  • C_i is just a type of convolution, for example, conv5 = 256 3x3 filters at stride 1, pad 1
  • In the top-down pathway, apply a 1x1 convolution filter

image

Focal Loss

image

  • well-classified examples: p_t > 0.5
  • Scaling factor decays to 0 as confidence in the correct class increases (loss low at well-classified examples)

Suppose

  • gamma = 5, p_t = 0.1 bad classified, then -(1-0.1)^5 * log(0.1) = 1.36 loss
  • gamma = 5, p_t = 0.9 well classified, then -(1-0.9)^5 * log(0.9) = 1.05E-6 loss ~ 0 loss

RetinaNet Performance Against other Detectors

image

  • RetinaNet outperforms Faster R-CNN, a two-stage detector

SSD

  • SSD does not select bottom layers of the pyramid for object detection, since the semantic value is not high enough to justify its use as it significantly reduces speed (SSD uses upper layers for detection - performs worse on small objects)

image

One Stage Detectors

  • Must process a much larger set of candidate object locations regularly sampled across an image (background part of image still dominates even if using a sampling heuristic)

  • RetinaNet

  • YOLO

  • SSD

Two Stage Detectors

  • Stage 1: Class imbalance is addressed through the proposal stage (Selective Search, Edge Boxes, DeepMask, RPN) to narrow down # of candidate object locations, filtering most background samples

  • Stage 2: sampling heuristics like a fixed foreground-to-background ratio are performed to maintain a balance between foreground and background

  • Faster R-CNN

  • Mask R-CNN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment