seopbo/YOLO9000.md

## YOLO9000.md

      
    Raw
  

              YOLO9000.md
            
          
    YOLO9000 : Better, Faster, Stronger

본 논문 (YOLO9000)은 YOLO: You Only Look Once에서 제안한 YOLO v1 모형을 개선한 YOLO v2 모형을 제안하는 것과 더불어, Object Detection 모형들이 데이터의 한계로 인해서 Detection을 할 수 있는 Class의 개수가 적었던 문제를 극복하는 방법을 제안한 논문 입니다. 본 포스트는 YOLO9000: Better, Faster, Stronger에 기초하여 작성하였으며, 중요한 idea만 다루고 있습니다. 상세한 내용은 논문을 보시면 좋을 듯 합니다. 포스트를 작성함에 있어 PR12의 이진원님이 발표하신 영상을 참고하였습니다.

Abstract

본 논문에서는 9,000개 이상의 class에 대해서 Object Detection을 real-time으로 수행 할 수 있는 YOLO9000 모형을 제안합니다. 위 모형을 제안하기위해서 기존에 YOLO: You Only Look Once 에서 제안한 YOLO v1 모형을 개선한 YOLO v2 모형의 특징을 논문의 Better, Faster Section에서 기술합니다. YOLO v2 모형의 성능은 아래와 같습니다.

At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007
At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the art methods like Faster R-CNN with ResNet and SSD while still running significantly faster.

또한 detection dataset과 classification dataset을 동시에 활용하여, Object Detection과 Classification을 동시에 학습하는 Joint training 방법을 제안하며, detection dataset에 없는 class도 Object Detection을 해낼 수 있는 YOLO9000 모형 을 학습시킵니다. YOLO9000 모형의 성능은 아래와 같습니다.

YOLO 9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP

1. Introduction

Introduction에서는 최근의 Object Detection 방법론들 (eg. Faster r-cnn)과 같은 방법론들은 그 성능이 대폭 향상되어 "빠르고 정확하긴하지만, Object Detection을 할 수 있는 class의 개수가 적다." 고 언급합니다. 이러한 문제를 개선하는 방법으로 이미 충분히 가지고 있는 classification dataset을 활용 하는 방법으로 아래의 두 가지를 제안합니다.


Hierarchical view of object classification

Combining distinct dataset (detection dataset, classification dataset)


Joint training algorithm

Leveraging labeled detection images to learn to precisely localize objects
Leveraging classification images to increase its vocbulary and robustness


위의 두 가지 방법론을 적용하기에 전에 YOLO: You Only Look Once 에서 제안한 YOLO v1 모형을 개선한 YOLO v2 모형을 제안하고, ImageNet classification dataset과 COCO detection dataset을 기반으로 YOLO v2 모형에 위의 두 가지 방법을 적용하여 학습시킨 YOLO9000 모형을 제안합니다.
2. Better (YOLO v2)

기존에 제안한 YOLO v1 모형은 다른 SOTA 방법론들에 비해서 아래의 두 가지가 문제였습니다.


Localization error가 높음


Region proposal based method (eg. Fast r-cnn, Faster r-cnn)에 비해서 recall이 낮음

object를 잘 detection 해내지 못한다는 의미!


YOLO v2 모형에서는 위와 같은 점을 개선하는 데에 초점을 두지만, 여전히 Object Detection을 빠르게 수행할 수 있어야하기 때문에, 모형의 architecture를 scale up 또는 ensemble을 하는 것이 아니라, 최대한 단순한 architecture를 사용합니다. 아래와 같은 요소를 통해서 YOLO v1 모형을 개선합니다.


Batch normalization


YOLO를 구성하고 있는 CNN의 Convolution layer에 Batch normalization 을 적용


Batch normalization 은 수렴속도를 빠르게할 뿐만 아니라, Regularization의 역할도 어느정도 수행하므로 Batch normalization 을 적용하면서 Drop out을 제거


2% 정도의 mAP 상승


High Resolution Classifier


기존의 YOLO v1 모형에서는 $224 \times 224$의 resolution을 가지는 image에 대해서 classification을 수행하도록 pre-training된 CNN을 Detection을 학습시킬 때, resolution을 $448 \times 448$로 키워서 활용


YOLO v2 모형에서는 $224 \times 224$의 resolution을 가지는 image에 대해서 classificaion을 수행하도록 pre-training된 CNN을 $448 \times 448$의 resolution을 가지는 image에 대해서 fine-tuning을 수행한 뒤, Detection을 학습


Convolutional With Anchor Boxes


기존의 YOLO v1 모형에서는 bounding box의 coordinates를 fully-connected layer를 이용하여 직접 예측하는 방식


YOLO v2 모형에서는 Faster r-cnn처럼 미리 정해둔 anchor box (hand-picked priors)와 ground-truth box와의 차이인 offset를 예측하여 anchor box를 이동시키거나, 형태를 변형하는 방법을 활용

마치 Faster r-cnn의 region proposal network 와 유사한 방식, fully-connected layer를 제거하고 $1 \times 1$ convolution을 이용하여, offset을 예측


실제 Object Detection 수행할 때는, $448 \times 448$ resolution의 image에 대해서 수행하는 것이 아니라, $416 \times 416$ resoluion을 가지도록 전처리 후, Object Detection을 수행


큰 object의 경우 image의 중앙에 있는 경향 이 있으며, YOLO v2 모형의 경우 Darknet 19 라는 VGG-19와 유사한 architecture를 지닌 모형을 사용하며, VGG-19와 유사하게 $2 \times 2$ max pooling이 5번 수행하여 resolution을 1/32로 줄임


이 경우 $448 \times 448$ resolution image를 YOLO v2 모형에 통과시키게되면, $14 \times 14$ resolution feature map이 형성되므로, image의 중앙에 위치한 object를 담당하는 grid가 4개가 됨


따라서 $416 \times 416$ resolution image를 YOLO v2 모형에 통과시키면, $13 \times 13$ resolution feature map이 형성되므로, image의 중앙에 위치한 object를 담당하는 grid가 1개가 되어 효과적임


기존의 YOLO v1 모형에서는 grid 별로 object의 class를 예측하였지만 (grid 기반), YOLO v2 모형에서는 grid 별로 할당된 anchor box 별로 class를 예측함 (anchor box 기반)

object의 class의 개수가 20개인 PASCAL VOC dataset에 대하여 5개의 anchor box를 활용할 경우,  anchor box 별로 class probability, offset, confidence score까지 계산해야하므로, 125개의 $1 \times 1$ convolution filter가 필요


anchor box 기반으로는 grid 기반보다 mAP 가 소폭하락 (69.5 $\rightarrow$ 69.2) 하지만, recall이 상승 (81% $\rightarrow$ 88%)


Dimension Clusters


Faster r-cnn과 같이 미리 선정된 anchor box를 사용하는 것이 아니라, 데이터에 근거하여 anchor box를 선정


detection dataset에 존재하는 ground-truth box들간의 k-means clustering을 통해서 anchor box를 결정


ground-truth box의 형태를 clustering하는 것이므로, ground-truth box의 center coordinates를 일치시킨 상태에서, 아래의 distance measure를 활용하여 clustering


$$d(box, centroid)=1-IOU(box, centroid)$$


detection dataset 기반으로 anchor box를 선정한 결과, 5개일 경우 성능이 좋음


Direct location prediction


anchor box를 사용하는 것 때문에, 특히 학습 초기에 학습이 불안정한 현상(instability)이 발생

grid 별로 할당한 anchor box의 center coordinates가 아래의 수식을 활용하기 때문에, 실제로 grid를 벗어나 image 어느 곳에 있는 object이든 지 간에 할당될 수 있기 때문

$$x = (t_x * w_a) - x_a \ y = (t_y * h_a) - y_a$$


위의 문제 때문에 YOLO v2 모형에서는 center coordinates만 YOLO v1 모형의 방식 (center coordinates를 직접 예측)을 따름

If the cell is offset from the top left corner of the image by $(c_x, c_y)$ and the bouding box prior has width and height $p_w, p_h$, then the predictions correspond to

$$b_x = \sigma(t_x) + c_x, \ b_y = \sigma(t_y) + c_y $$
$$b_w = p_we^{t_w}, \ b_h = p_he^{t_h}$$
$$Pr(object) * IOU(b, object)=\sigma(t_o)$$


Fine-Grained Features
YOLO v2 모형에서는 $13 \times 13$ feature map에서 Object Detection이 이루어지며, 이는 큰 object에서는 잘 작동하지만 작은 object에는 불충분하기 때문에, receptive field가  작은 convolution  layer의 feature map을 가져와 상기한 feature map에 channel로 concatenate하여 활용합니다. 논문에서는 이 방법을 passthrough layer 라고 얘기하며, 1% 정도의 성능 향상을 얻었다고 합니다.


passthrough layer turns the $26 \times 26 \times 512$ feature map into a $13 \times 13 \times 2048$ feature map, which can be concatenated with the orignal features.


Multi-Scale Training
YOLO v2 모형에서는 fully-connected layer를 제거하여 어떠한 resolution을 가지고 있는 image든 간에 학습에 활용할 수 있기 때문에, robust한 feature를 학습하기위해서 아래와 같이 모형을 학습합니다. 또한 다양한 resolution image에 대해서 학습을 했기 때문에, test 시 사용하는 image의 resolution에 따라 성능과 속도의 trade-off가 존재합니다.

Every 10 batches our network randomly choose a new image dimension size.
Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320,352,...,608}

  
3. Faster (YOLO v2)

YOLO v1 모형은 Googlenet architecture에 기반한 모형으로 Googlenet은 대다수의 Object Detection 모형이 기반하고 있는 모형인 VGG-16보다 operation 수가 적지만 성능이 다소 낮습니다. 이 점을 개선하기위해 Darknet-19라는 새로운 architecture를 제안하고 Object Detection에 활용합니다.


Darknet-19
아래의 architecture에 Batch normalization 을 기본적으로 사용하고, Drop out은 활용하지 않습니다. performance와 architecture는 아래와 같습니다.

Darknet-19 only requires 5.58 bilion operations to process an image yet achieves 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet

  
Training for classification
아래에 기술된 것처럼 Classification을 학습합니다.


ImageNet 1000 classes for 160 epochs


Standard data augmentation : random crops, rotations, hue, saturation, and exposure shifts


Initial training : $224 \times 224 \rightarrow 448 \times 448$ fine-tuning for 10 epochs


Training for detection
아래에 기술된 것처럼 Object Detection을 학습합니다.


Adding $3 \times 3$ conv layers with 1024 filters each followed by a final $1 \times 1$ conv layer


For VOC, predicting 5 boxes with 5 coordinates each and 20 classes per box, so 125 filters


160 epochs with a start learning rate of $10^{-3}$ dividing it by 10 at 60 and 90 epochs


4. Stronger (YOLO9000)

detection dataset과 classificaion dataset을 Object Detection 학습에 이용하는 Joint training 을 하기에 앞서, 먼저 detection dataset의 class와 classification dataset의 class를 합칠 필요가 있습니다. detection dataset의 class와 classification dataset의 class는 아래와 같은 특징이 있습니다.

detection dataset에 존재하는 class는 일반적인 정보만 담음 (eg. "dog", "boat")
classification dataset에 존재하는 class는 세부적인 정보도 담음 (eg. "Norfolk terrier", "Yorkshire terrier")

위와 같은 특징을 지니는 두 dataset의 class를 합치기위한 방법으로 Hierarchical classification 이라는 방법을 제안합니다.


Hierarchical classificaion, Dataset combination with WordTree
ImageNet dataset (for classification)의 class는 WordNet에서 가져온 것이므로, WordNet의 graph를 이용하여 아래의 그림과 같은 WordTree (Hierarchical tree) 를 만들어냅니다.


WordTree


ImageNet에서 visual noun (class)이 어떤 경로로 WordNet의 "physical object" 노드와 연결되는 지 확인


visual noun은 대부분 하나의 경로만을 가지지만, 여러 경로를 가질 경우 가장 최단 경로를 선택


WordTree 를 이용하여 COCO detection dataset과 ImageNet classification dataset의 class를 합칠 경우, 좀 더 general한 class를 가지는 COCO detection dataset의 class가 상위 노드에 나타남을 확인 가능


Hierarchical classification

WordTree가 완성되면, WordTree를 따라서 class (WordTree에서 node)로 예측할 확률을 계산

$$Pr(Norfolk \ terrier)= Pr(Norfolk \ terrier \ | \ terrier) \times$$
$$Pr(terrier \ | \ hunting \ dog) \ \times \ ... \ \times$$
$$Pr(mammal \ | \ animal) \ \times \ Pr(animal \ | \ physical \ object)$$

기존 ImageNet classificaion dataset의 class 1k와 WordNet에 ImageNet classificaion dataset만 적용하여 만든 WordTree의 경우 class의 개수가 1369개까지 늘어나며, 이를 기반으로 Hierarchical classification 을 수행하여도 classification accuracy가 거의 떨어지지 않음


Using the same training parameters as before, our hierarchical Darknet-19 achieves 71.9% top-1 accuracy and 90.4% top-5 accuracy.


Joint classificaion and detection
YOLO9000 모형을 학습시키기 위한 training dataset과 이를 이용한 Joint training , 검증하기위한 validation dataset과 validation 결과는 아래와 같습니다.


Dataset


COCO detection dataset의 class와 Full ImageNet classification dataset에서 빈도로 상위 9,000개의 class, ImageNet detection dataset의 class를 WordTree 를 이용하여 병합


COCO detection dataset과 Full ImageNet classificaion dataset에서 $4:1$ 의 비율로 training dataset을 구성


validaion dataset으로는 ImageNet detection dataset을 활용

COCO detection dataset의 class와는 class가 44개 밖에 겹치지 않음


Training


grid 당 anchor box를 5개를 설정한 것을 3개로 변경


Joint training 방법으로 학습시킬 때, loss는 해당 data가 COCO detection dataset 또는 Full ImageNet classification dataset에 나온 data에서 속하는 지에 따라 loss를 아래와 같이 back-prop


detection dataset에 속하는 image일 경우 entire loss


classificaion dataset에 속하는 image일 경우 classificaion loss만


이 경우, bouding box 들 중 image의 정답 class로 예측한 확률이 가장 높은 값을 뽑아 classificaion loss를 계산


이 때, 그 bouding box와 ground-truth box의 IOU가 0.3 이상이면 entire loss를 계산


Validation

YOLO9000 gets 19.7 mAP overall with 16.0 mAP on the disjoint 156 object classes that it has never seen any labelled detection data for.


validation dataset인 ImageNet detection dataset에 대하여 결과를 분석해보니, COCO detection dataset에 animal class에 대한 data가 충분히 있어 ImageNet detection dataset에 존재하는 animal class에 해당하는 object를 잘 detecion하지만, 상대적으로 COCO detection dataset에 없었던 class (eg. sunglasses, swimming trunks)에 대해서는 성능이 저조


5. Conclusion

이 논문의 기여점을 정리해보면 아래와 같습니다.


YOLO v1 모형을 개선한 YOLO v2 모형을 제안 (SOTA)


9,000개 이상의 class를 Object Detection 할 수 있는 YOLO9000 모형을 제안, 아래의 기법을 통해 학습


object detection dataset의 class가 classificaion dataset의 class 보다 많이 적지만, classificaion dataset을 적절히 활용함으로써 Object Detection을 할 수 있는 class의 개수를 증대


object detection dataset과 classificaion dataset의 class를 합치는 방법으로 Hierarchical classificaion 과 WordTree 를 제안


object detecion dataset과 classificaion dataset을 동시에 활용하여 Object Detection 모형을 학습시키는 방법인 Joint training 방법을 제안