In the wake of a burgeoning amount of data and demand for IoT based edge computing devices at the advent of 5G, the research community as well as the industry are taking an increased interest in deploying AI on the edge. AI based systems in edge computing devices provide the best of both worlds viz., the state-of-the-art accuracies of deep learning models and the portability and scalability of embedded systems. However, a major bottleneck is the prohibitively large size of deep learning models and the limited memory and computation capacities of embedded systems that work on a low power budget. With this background, our work is at the intersection of computer vision, deep learning and embedded systems. By addressing the problem of face recognition using deep convolutional neural networks(CNNs), we explore optimizations at the model level and the hardware level with an aim to ease embedded implementations.
Our work extends the idea of distillation based knowledge transfer for model compression to regression based problems. We present the results of experiments where the knowledge from an Inception CNN (teacher network) with ~3.7M parameters is transferred to MobileNet CNN (student network) architectures with ~0.8M and ~0.5M parameters. We demonstrate that the smaller student networks are not only able to achieve comparable results but even exceed the face verification accuracy of the Inception teacher CNN on the Labeled Faces in the Wild (LFW) test set. For instance, we transfer the knowledge from a Inception CNN with 81.07% LFW accuracy into a MobileNet model to achieve an accuracy of 83.28% with a 76.75% eduction in the parameter count. The student network is trained using a so-called transfer dataset consistinf of ~1M images from VGG2 from the teacher networks embeddings in a mean square error sense. In addition, we demonstrate that pruning the local response normalization layers along with those layers that either perform a constant product or power on their input, has negligible effects on the model accuracy. Further, precluding the affine transform based face alignment step reduces the accuracy by a only modest amount. Additionally, a myriad of experiments on knowledge transfer with hyper-parameter tuning have been performed and discussed to promote future work.
FaceNet is a deep convolutional neural network that generates a unified embedding from cropped facial images. The merit of this model, in comparison to it's competitors, lies in the way it is trained. While most competitive techniques like DeepFace train the neural architectures by backpropagating through the cross-entropy of soft predictions over known identities, FaceNet minimizes a so-called triplet loss that optimizes the embeddings themselves instead of a bottleneck layer over limited identities. As a direct consequence, the embeddigns become robust not only from a classification point-of-view but demonstrate noteworthy results on such tasks as face clustering and verification.
WORK IN PROGRESS
- Prathamesh Mandke - pkmandke AT vt DOT edu
- Hrishikesh Kale - kalehp15 DOT extc AT coep DOT ac DOT in
- Hrishikesh Mahajan - mahajanhs15 DOT extc AT coep DOT ac DOT in
- Vedant Deshpande - deshpandevv16 DOT extc AT coep DOT ac DOT in
- Prof. Dr. M. S. Sutaone - mssutaone.extc@coep.ac.in
- Dr. Florian Schroff - Google AI.
This work is owned by College of Engineering, Pune. All rights are reserved by the Center of Excellence in Signal & Image Processing (CoE-SIP) at COEP. The rights to publish this work in any form are held by the authors.