jph00/index.html

## index.html
<!DOCTYPE html>
<html>
<head><title>Brundage Bot Backfill</title></head>
<body>
<ul>
<li><a href="http://arxiv.org/abs/1710.00814">Detecting Adversarial Attacks on Neural Network Policies with Visual   Foresight</a>: Deep reinforcement learning has shown promising results in learning control policies for complex sequential decision-making tasks. However, these neural network-based policies are known to be vulnerable to adversarial examples. This vulnerability poses a potentially serious threat to safety-critical systems such as autonomous vehicles. In this paper, we propose a defense mechanism to defend reinforcement learning agents from adversarial attacks by leveraging an action-conditioned frame prediction module. Our core idea is that the adversarial examples targeting at a neural network-based policy are not effective for the frame prediction model. By comparing the action distribution produced by a policy from processing the current observed frame to the action distribution produced by the same policy from processing the predicted frame from the action-conditioned frame prediction module, we can detect the presence of adversarial examples. Beyond detecting the presence of adversarial examples, our method allows the agent to continue performing the task using the predicted frame when the agent is under attack. We evaluate the performance of our algorithm using five games in Atari 2600. Our results demonstrate that the proposed defense mechanism achieves favorable performance against baseline algorithms in detecting adversarial examples and in earning rewards when the agents are under attack.</li>
<li><a href="http://arxiv.org/abs/1612.02559">AGA: Attribute Guided Augmentation</a>: We consider the problem of data augmentation, i.e., generating artificial samples to extend a given corpus of training data. Specifically, we propose attributed-guided augmentation (AGA) which learns a mapping that allows to synthesize data such that an attribute of a synthesized sample is at a desired value or strength. This is particularly interesting in situations where little data with no attribute annotation is available for learning, but we have access to a large external corpus of heavily annotated samples. While prior works primarily augment in the space of images, we propose to perform augmentation in feature space instead. We implement our approach as a deep encoder-decoder architecture that learns the synthesis function in an end-to-end manner. We demonstrate the utility of our approach on the problems of (1) one-shot object recognition in a transfer-learning setting where we have no prior knowledge of the new classes, as well as (2) object-based one-shot scene recognition. As external data, we leverage 3D depth and pose information from the SUN RGB-D dataset. Our experiments show that attribute-guided augmentation of high-level CNN features considerably improves one-shot recognition performance on both problems.</li>
<li><a href="http://arxiv.org/abs/1708.00489">A Geometric Approach to Active Learning for Convolutional Neural   Networks</a>: Convolutional neural networks (CNNs) have been successfully applied to many recognition and learning tasks using a universal recipe; training a deep model on a very large dataset of supervised examples. However, this approach is rather restrictive in practice since collecting a large set of labeled images is very expensive. One way to ease this problem is coming up with smart ways for choosing images to be labelled from a very large collection (i.e. active learning)   In this paper, we first show that uncertainty based active learning heuristics are not effective for CNNs even in an oracle setting. Our counterintuitive empirical results make us question these heuristics and inspire us to come up with a simple but effective method, choosing a set of images to label such that they cover the set of unlabeled images as closely as possible. We further present a theoretical justification for this geometric heuristic by giving a bound over the generalization error of CNNs. Our experiments show that the proposed method significantly outperforms existing approaches in image classification experiments by a large margin.</li>
<li><a href="http://arxiv.org/abs/1703.11000">Learning Visual Servoing with Deep Features and Fitted Q-Iteration</a>: Visual servoing involves choosing actions that move a robot in response to observations from a camera, in order to reach a goal configuration in the world. Standard visual servoing approaches typically rely on manually designed features and analytical dynamics models, which limits their generalization capability and often requires extensive application-specific feature and model engineering. In this work, we study how learned visual features, learned predictive dynamics models, and reinforcement learning can be combined to learn visual servoing mechanisms. We focus on target following, with the goal of designing algorithms that can learn a visual servo using low amounts of data of the target in question, to enable quick adaptation to new targets. Our approach is based on servoing the camera in the space of learned visual features, rather than image pixels or manually-designed keypoints. We demonstrate that standard deep features, in our case taken from a model trained for object classification, can be used together with a bilinear predictive model to learn an effective visual servo that is robust to visual variation, changes in viewing angle and appearance, and occlusions. A key component of our approach is to use a sample-efficient fitted Q-iteration algorithm to learn which features are best suited for the task at hand. We show that we can learn an effective visual servo on a complex synthetic car following benchmark using just 20 training trajectory samples for reinforcement learning. We demonstrate substantial improvement over a conventional approach based on image pixels or hand-designed keypoints, and we show an improvement in sample-efficiency of more than two orders of magnitude over standard model-free deep reinforcement learning algorithms. Videos are available at \url{http://rll.berkeley.edu/visual_servoing}.</li>
<li><a href="http://arxiv.org/abs/1612.06704">Action-Driven Object Detection with Top-Down Visual Attentions</a>: A dominant paradigm for deep learning based object detection relies on a "bottom-up" approach using "passive" scoring of class agnostic proposals. These approaches are efficient but lack of holistic analysis of scene-level context. In this paper, we present an "action-driven" detection mechanism using our "top-down" visual attention model. We localize an object by taking sequential actions that the attention model provides. The attention model conditioned with an image region provides required actions to get closer toward a target object. An action at each time step is weak itself but an ensemble of the sequential actions makes a bounding-box accurately converge to a target object boundary. This attention model we call AttentionNet is composed of a convolutional neural network. During our whole detection procedure, we only utilize the actions from a single AttentionNet without any modules for object proposals nor post bounding-box regression. We evaluate our top-down detection mechanism over the PASCAL VOC series and ILSVRC CLS-LOC dataset, and achieve state-of-the-art performances compared to the major bottom-up detection methods. In particular, our detection mechanism shows a strong advantage in elaborate localization by outperforming Faster R-CNN with a margin of +7.1% over PASCAL VOC 2007 when we increase the IoU threshold for positive detection to 0.7.</li>
<li><a href="http://arxiv.org/abs/1708.00973">Attention Transfer from Web Images for Video Recognition</a>: Training deep learning based video classifiers for action recognition requires a large amount of labeled videos. The labeling process is labor-intensive and time-consuming. On the other hand, large amount of weakly-labeled images are uploaded to the Internet by users everyday. To harness the rich and highly diverse set of Web images, a scalable approach is to crawl these images to train deep learning based classifier, such as Convolutional Neural Networks (CNN). However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos. One way to address this problem is to fine-tune the trained models on videos, but sufficient amount of annotated videos are still required. In this work, we propose a novel approach to transfer knowledge from image domain to video domain. The proposed method can adapt to the target domain (i.e. video data) with limited amount of training data. Our method maps the video frames into a low-dimensional feature space using the class-discriminative spatial attention map for CNNs. We design a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy. We conduct extensive experiments on two challenging video recognition datasets (i.e. TVHI and UCF101), and demonstrate the efficacy of our proposed method.</li>
<li><a href="http://arxiv.org/abs/1707.07341">Prediction-Constrained Training for Semi-Supervised Mixture and Topic   Models</a>: Supervisory signals have the potential to make low-dimensional data representations, like those learned by mixture and topic models, more interpretable and useful. We propose a framework for training latent variable models that explicitly balances two goals: recovery of faithful generative explanations of high-dimensional data, and accurate prediction of associated semantic labels. Existing approaches fail to achieve these goals due to an incomplete treatment of a fundamental asymmetry: the intended application is always predicting labels from data, not data from labels. Our prediction-constrained objective for training generative models coherently integrates loss-based supervisory signals while enabling effective semi-supervised learning from partially labeled data. We derive learning algorithms for semi-supervised mixture and topic models using stochastic gradient descent with automatic differentiation. We demonstrate improved prediction quality compared to several previous supervised topic models, achieving predictions competitive with high-dimensional logistic regression on text sentiment analysis and electronic health records tasks while simultaneously learning interpretable topics.</li>
<li><a href="http://arxiv.org/abs/1707.06728">Efficient Defenses Against Adversarial Attacks</a>: Following the recent adoption of deep neural networks (DNN) accross a wide range of applications, adversarial attacks against these models have proven to be an indisputable threat. Adversarial samples are crafted with a deliberate intention of undermining a system. In the case of DNNs, the lack of better understanding of their working has prevented the development of efficient defenses. In this paper, we propose a new defense method based on practical observations which is easy to integrate into models and performs better than state-of-the-art defenses. Our proposed solution is meant to reinforce the structure of a DNN, making its prediction more stable and less likely to be fooled by adversarial samples. We conduct an extensive experimental study proving the efficiency of our method against multiple attacks, comparing it to numerous defenses, both in white-box and black-box setups. Additionally, the implementation of our method brings almost no overhead to the training procedure, while maintaining the prediction performance of the original model on clean samples.</li>
<li><a href="http://arxiv.org/abs/1710.02318">A Semantic Relevance Based Neural Network for Text Summarization and   Text Simplification</a>: Text summarization and text simplification are two major ways to simplify the text for poor readers, including children, non-native speakers, and the functionally illiterate. Text summarization is to produce a brief summary of the main ideas of the text, while text simplification aims to reduce the linguistic complexity of the text and retain the original meaning. Recently, most approaches for text summarization and text simplification are based on the sequence-to-sequence model, which achieves much success in many text generation tasks. However, although the generated simplified texts are similar to source texts literally, they have low semantic relevance. In this work, our goal is to improve semantic relevance between source texts and simplified texts for text summarization and text simplification. We introduce a Semantic Relevance Based neural model to encourage high semantic similarity between texts and summaries. In our model, the source text is represented by a gated attention encoder, while the summary representation is produced by a decoder. Besides, the similarity score between the representations is maximized during training. Our experiments show that the proposed model outperforms the state-of-the-art systems on two benchmark corpus.</li>
<li><a href="http://arxiv.org/abs/1612.01064">Trained Ternary Quantization</a>: Deep neural networks are widely used in machine learning applications. However, the deployment of large neural networks models can be difficult to deploy on mobile devices with limited power budgets. To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values. This method has very little accuracy degradation and can even improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet. And our AlexNet model is trained from scratch, which means it's as easy as to train normal full precision model. We highlight our trained quantization method that can learn both ternary values and ternary assignment. During inference, only ternary values (2-bit weights) and scaling factors are needed, therefore our models are nearly 16x smaller than full-precision models. Our ternary models can also be viewed as sparse binary weight networks, which can potentially be accelerated with custom circuit. Experiments on CIFAR-10 show that the ternary models obtained by trained quantization method outperform full-precision models of ResNet-32,44,56 by 0.04%, 0.16%, 0.36%, respectively. On ImageNet, our model outperforms full-precision AlexNet model by 0.3% of Top-1 accuracy and outperforms previous ternary models by 3%.</li>
<li><a href="http://arxiv.org/abs/1705.02553">Experimental results : Reinforcement Learning of POMDPs using Spectral   Methods</a>: We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through epochs, in each epoch we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the epoch, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound with respect to the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.</li>
<li><a href="http://arxiv.org/abs/1612.06139">Neural Machine Translation from Simplified Translations</a>: Text simplification aims at reducing the lexical, grammatical and structural complexity of a text while keeping the same meaning. In the context of machine translation, we introduce the idea of simplified translations in order to boost the learning ability of deep neural translation models. We conduct preliminary experiments showing that translation complexity is actually reduced in a translation of a source bi-text compared to the target reference of the bi-text while using a neural machine translation (NMT) system learned on the exact same bi-text. Based on knowledge distillation idea, we then train an NMT system using the simplified bi-text, and show that it outperforms the initial system that was built over the reference data set. Performance is further boosted when both reference and automatic translations are used to learn the network. We perform an elementary analysis of the translated corpus and report accuracy results of the proposed approach on English-to-French and English-to-German translation tasks.</li>
<li><a href="http://arxiv.org/abs/1710.02410">End-to-end Driving via Conditional Imitation Learning</a>: Deep networks trained on demonstrations of human driving have learned to follow roads and avoid obstacles. However, driving policies trained via imitation learning cannot be controlled at test time. A vehicle trained end-to-end to imitate an expert cannot be guided to take a specific turn at an upcoming intersection. This limits the utility of such systems. We propose to condition imitation learning on high-level command input. At test time, the learned driving policy functions as a chauffeur that handles sensorimotor coordination but continues to respond to navigational commands. We evaluate different architectures for conditional imitation learning in vision-based driving. We conduct experiments in realistic three-dimensional simulations of urban driving and on a 1/5 scale robotic truck that is trained to drive in a residential area. Both systems drive based on visual input yet remain responsive to high-level navigational commands. Experimental results demonstrate that the presented approach significantly outperforms a number of baselines. The supplementary video can be viewed at https://youtu.be/cFtnflNe5fM</li>
<li><a href="http://arxiv.org/abs/1710.04043">Interactive Medical Image Segmentation using Deep Learning with   Image-specific Fine-tuning</a>: Convolutional neural networks (CNNs) have achieved state-of-the-art performance for automatic medical image segmentation. However, they have not demonstrated sufficiently accurate and robust results for clinical use. In addition, they are limited by the lack of image-specific adaptation and the lack of generalizability to previously unseen object classes. To address these problems, we propose a novel deep learning-based framework for interactive segmentation by incorporating CNNs into a bounding box and scribble-based segmentation pipeline. We propose image-specific fine-tuning to make a CNN model adaptive to a specific test image, which can be either unsupervised (without additional user interactions) or supervised (with additional scribbles). We also propose a weighted loss function considering network and interaction-based uncertainty for the fine-tuning. We applied this framework to two applications: 2D segmentation of multiple organs from fetal MR slices, where only two types of these organs were annotated for training; and 3D segmentation of brain tumor core (excluding edema) and whole brain tumor (including edema) from different MR sequences, where only tumor cores in one MR sequence were annotated for training. Experimental results show that 1) our model is more robust to segment previously unseen objects than state-of-the-art CNNs; 2) image-specific fine-tuning with the proposed weighted loss function significantly improves segmentation accuracy; and 3) our method leads to accurate results with fewer user interactions and less user time than traditional interactive segmentation methods.</li>
<li><a href="http://arxiv.org/abs/1708.00938">Associative Domain Adaptation</a>: We propose associative domain adaptation, a novel technique for end-to-end domain adaptation with neural networks, the task of inferring class labels for an unlabeled target domain based on the statistical properties of a labeled source domain. Our training scheme follows the paradigm that in order to effectively derive class labels for the target domain, a network should produce statistically domain invariant embeddings, while minimizing the classification error on the labeled source domain. We accomplish this by reinforcing associations between source and target data directly in embedding space. Our method can easily be added to any existing classification network with no structural and almost no computational overhead. We demonstrate the effectiveness of our approach on various benchmarks and achieve state-of-the-art results across the board with a generic convolutional neural network architecture not specifically tuned to the respective tasks. Finally, we show that the proposed association loss produces embeddings that are more effective for domain adaptation compared to methods employing maximum mean discrepancy as a similarity measure in embedding space.</li>
<li><a href="http://arxiv.org/abs/1612.08879">Deep Unsupervised Representation Learning for Remote Sensing Images</a>: Scene classification plays a key role in interpreting the remotely sensed high-resolution images. With the development of deep learning, supervised learning in classification of Remote Sensing with convolutional networks (CNNs) has been frequently adopted. However, researchers paid less attention to unsupervised learning in remote sensing with CNNs. In order to filling the gap, this paper proposes a set of CNNs called \textbf{M}ultiple l\textbf{A}ye\textbf{R} fea\textbf{T}ure m\textbf{A}tching(MARTA) generative adversarial networks (GANs) to learn representation using only unlabeled data. There will be two models of MARTA GANs involved: (1) a generative model $G$ that captures the data distribution and provides more training data; (2) a discriminative model $D$ that estimates the possibility that a sample came from the training data rather than $G$ and in this way a well-formed representation of dataset can be learned. Therefore, MARTA GANs obtain the state-of-the-art results which outperform the results got from UC-Merced Land-use dataset and Brazilian Coffee Scenes dataset.</li>
<li><a href="http://arxiv.org/abs/1707.08866">Deep Residual Learning for Weakly-Supervised Relation Extraction</a>: Deep residual learning (ResNet) is a new method for training very deep neural networks using identity map-ping for shortcut connections. ResNet has won the ImageNet ILSVRC 2015 classification task, and achieved state-of-the-art performances in many computer vision tasks. However, the effect of residual learning on noisy natural language processing tasks is still not well understood. In this paper, we design a novel convolutional neural network (CNN) with residual learning, and investigate its impacts on the task of distantly supervised noisy relation extraction. In contradictory to popular beliefs that ResNet only works well for very deep networks, we found that even with 9 layers of CNNs, using identity mapping could significantly improve the performance for distantly-supervised relation extraction.</li>
<li><a href="http://arxiv.org/abs/1707.08316">Learning Sparse Representations in Reinforcement Learning with Sparse   Coding</a>: A variety of representation learning approaches have been investigated for reinforcement learning; much less attention, however, has been given to investigating the utility of sparse coding. Outside of reinforcement learning, sparse coding representations have been widely used, with non-convex objectives that result in discriminative representations. In this work, we develop a supervised sparse coding objective for policy evaluation. Despite the non-convexity of this objective, we prove that all local minima are global minima, making the approach amenable to simple optimization strategies. We empirically show that it is key to use a supervised objective, rather than the more straightforward unsupervised sparse coding approach. We compare the learned representations to a canonical fixed sparse representation, called tile-coding, demonstrating that the sparse coding representation outperforms a wide variety of tilecoding representations.</li>
<li><a href="http://arxiv.org/abs/1710.02242">Solving differential equations with unknown constitutive relations as   recurrent neural networks</a>: We solve a system of ordinary differential equations with an unknown functional form of a sink (reaction rate) term. We assume that the measurements (time series) of state variables are partially available, and we use recurrent neural network to "learn" the reaction rate from this data. This is achieved by including a discretized ordinary differential equations as part of a recurrent neural network training problem. We extend TensorFlow's recurrent neural network architecture to create a simple but scalable and effective solver for the unknown functions, and apply it to a fedbatch bioreactor simulation problem. Use of techniques from recent deep learning literature enables training of functions with behavior manifesting over thousands of time steps. Our networks are structurally similar to recurrent neural networks, but differences in design and function require modifications to the conventional wisdom about training such networks.</li>
<li><a href="http://arxiv.org/abs/1612.04898">Efficient Distributed Semi-Supervised Learning using Stochastic   Regularization over Affinity Graphs</a>: We describe a computationally efficient, stochastic graph-regularization technique that can be utilized for the semi-supervised training of deep neural networks in a parallel or distributed setting. We utilize a technique, first described in [13] for the construction of mini-batches for stochastic gradient descent (SGD) based on synthesized partitions of an affinity graph that are consistent with the graph structure, but also preserve enough stochasticity for convergence of SGD to good local minima. We show how our technique allows a graph-based semi-supervised loss function to be decomposed into a sum over objectives, facilitating data parallelism for scalable training of machine learning models. Empirical results indicate that our method significantly improves classification accuracy compared to the fully-supervised case when the fraction of labeled data is low, and in the parallel case, achieves significant speed-up in terms of wall-clock time to convergence. We show the results for both sequential and distributed-memory semi-supervised DNN training on a speech corpus.</li>
<li><a href="http://arxiv.org/abs/1707.05853">Encoding Word Confusion Networks with Recurrent Neural Networks for   Dialog State Tracking</a>: This paper presents our novel method to encode word confusion networks, which can represent a rich hypothesis space of automatic speech recognition systems, via recurrent neural networks. We demonstrate the utility of our approach for the task of dialog state tracking in spoken dialog systems that relies on automatic speech recognition output. Encoding confusion networks outperforms encoding the best hypothesis of the automatic speech recognition in a neural system for dialog state tracking on the well-known second Dialog State Tracking Challenge dataset.</li>
<li><a href="http://arxiv.org/abs/1612.00745">Cognitive Deep Machine Can Train Itself</a>: Machine learning is making substantial progress in diverse applications. The success is mostly due to advances in deep learning. However, deep learning can make mistakes and its generalization abilities to new tasks are questionable. We ask when and how one can combine network outputs, when (i) details of the observations are evaluated by learned deep components and (ii) facts and confirmation rules are available in knowledge based systems. We show that in limited contexts the required number of training samples can be low and self-improvement of pre-trained networks in more general context is possible. We argue that the combination of sparse outlier detection with deep components that can support each other diminish the fragility of deep methods, an important requirement for engineering applications. We argue that supervised learning of labels may be fully eliminated under certain conditions: a component based architecture together with a knowledge based system can train itself and provide high quality answers. We demonstrate these concepts on the State Farm Distracted Driver Detection benchmark. We argue that the view of the Study Panel (2016) may overestimate the requirements on `years of focused research' and `careful, unique construction' for `AI systems'.</li>
<li><a href="http://arxiv.org/abs/1704.06885">A General Theory for Training Learning Machine</a>: Though the deep learning is pushing the machine learning to a new stage, basic theories of machine learning are still limited. The principle of learning, the role of the a prior knowledge, the role of neuron bias, and the basis for choosing neural transfer function and cost function, etc., are still far from clear. In this paper, we present a general theoretical framework for machine learning. We classify the prior knowledge into common and problem-dependent parts, and consider that the aim of learning is to maximally incorporate them. The principle we suggested for maximizing the former is the design risk minimization principle, while the neural transfer function, the cost function, as well as pretreatment of samples, are endowed with the role for maximizing the latter. The role of the neuron bias is explained from a different angle. We develop a Monte Carlo algorithm to establish the input-output responses, and we control the input-output sensitivity of a learning machine by controlling that of individual neurons. Applications of function approaching and smoothing, pattern recognition and classification, are provided to illustrate how to train general learning machines based on our theory and algorithm. Our method may in addition induce new applications, such as the transductive inference.</li>
<li><a href="http://arxiv.org/abs/1709.03849">Spatio-temporal Learning with Arrays of Analog Nanosynapses</a>: Emerging nanodevices such as resistive memories are being considered for hardware realizations of a variety of artificial neural networks (ANNs), including highly promising online variants of the learning approaches known as reservoir computing (RC) and the extreme learning machine (ELM). We propose an RC/ELM inspired learning system built with nanosynapses that performs both on-chip projection and regression operations. To address time-dynamic tasks, the hidden neurons of our system perform spatio-temporal integration and can be further enhanced with variable sampling or multiple activation windows. We detail the system and show its use in conjunction with a highly analog nanosynapse device on a standard task with intrinsic timing dynamics- the TI-46 battery of spoken digits. The system achieves nearly perfect (99%) accuracy at sufficient hidden layer size, which compares favorably with software results. In addition, the model is extended to a larger dataset, the MNIST database of handwritten digits. By translating the database into the time domain and using variable integration windows, up to 95% classification accuracy is achieved. In addition to an intrinsically low-power programming style, the proposed architecture learns very quickly and can easily be converted into a spiking system with negligible loss in performance- all features that confer significant energy efficiency.</li>
<li><a href="http://arxiv.org/abs/1707.06203">Imagination-Augmented Agents for Deep Reinforcement Learning</a>: We introduce Imagination-Augmented Agents (I2As), a novel architecture for deep reinforcement learning combining model-free and model-based aspects. In contrast to most existing model-based reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.</li>
<li><a href="http://arxiv.org/abs/1704.05495">Investigating Recurrence and Eligibility Traces in Deep Q-Networks</a>: Eligibility traces in reinforcement learning are used as a bias-variance trade-off and can often speed up training time by propagating knowledge back over time-steps in a single update. We investigate the use of eligibility traces in combination with recurrent networks in the Atari domain. We illustrate the benefits of both recurrent nets and eligibility traces in some Atari games, and highlight also the importance of the optimization used in the training.</li>
<li><a href="http://arxiv.org/abs/1612.01380">From One-Trick Ponies to All-Rounders: On-Demand Learning for Image   Restoration</a>: While machine learning approaches to image restoration offer great promise, current methods risk training "one-trick ponies" that perform well only for image corruption of a particular level of difficulty--such as a certain level of noise or blur. First, we examine the weakness of a one-trick pony model and demonstrate that training general models to handle arbitrary levels of corruption is indeed non-trivial. Then, we propose an on-demand learning algorithm for training image restoration models with deep convolutional neural networks. The main idea is to exploit a feedback mechanism to self-generate training instances where they are needed most, thereby learning models that can generalize across difficulty levels. On four restoration tasks---image inpainting, pixel interpolation, image deblurring, and image denoising---and three diverse datasets, our approach consistently outperforms both the status quo training procedure and curriculum learning alternatives.</li>
<li><a href="http://arxiv.org/abs/1707.04818">RED: Reinforced Encoder-Decoder Networks for Action Anticipation</a>: Action anticipation aims to detect an action before it happens. Many real world applications in robotics and surveillance are related to this predictive capability. Current methods address this problem by first anticipating visual representations of future frames and then categorizing the anticipated representations to actions. However, anticipation is based on a single past frame's representation, which ignores the history trend. Besides, it can only anticipate a fixed future time. We propose a Reinforced Encoder-Decoder (RED) network for action anticipation. RED takes multiple history representations as input and learns to anticipate a sequence of future representations. One salient aspect of RED is that a reinforcement module is adopted to provide sequence-level supervision; the reward function is designed to encourage the system to make correct predictions as early as possible. We test RED on TVSeries, THUMOS-14 and TV-Human-Interaction datasets for action anticipation and achieve state-of-the-art performance on all datasets.</li>
<li><a href="http://arxiv.org/abs/1709.02848">Improving Heterogeneous Face Recognition with Conditional Adversarial   Networks</a>: Heterogeneous face recognition between color image and depth image is a much desired capacity for real world applications where shape information is looked upon as merely involved in gallery. In this paper, we propose a cross-modal deep learning method as an effective and efficient workaround for this challenge. Specifically, we begin with learning two convolutional neural networks (CNNs) to extract 2D and 2.5D face features individually. Once trained, they can serve as pre-trained models for another two-way CNN which explores the correlated part between color and depth for heterogeneous matching. Compared with most conventional cross-modal approaches, our method additionally conducts accurate depth image reconstruction from single color image with Conditional Generative Adversarial Nets (cGAN), and further enhances the recognition performance by fusing multi-modal matching results. Through both qualitative and quantitative experiments on benchmark FRGC 2D/3D face database, we demonstrate that the proposed pipeline outperforms state-of-the-art performance on heterogeneous face recognition and ensures a drastically efficient on-line stage.</li>
<li><a href="http://arxiv.org/abs/1707.06588">Voice Synthesis for in-the-Wild Speakers via a Phonological Loop</a>: We present a new neural text to speech method that is able to transform text to speech in voices that are sampled in the wild. Unlike other text to speech systems, our solution is able to deal with unconstrained samples obtained from public speeches. The network architecture is simpler than those in the existing literature and is based on a novel shifting buffer working memory. The same buffer is used for estimating the attention, computing the output audio, and for updating the buffer itself. The input sentence is encoded using a context-free lookup table that contains one entry per character or phoneme. Lastly, the speakers are similarly represented by a short vector that can also be fitted to new speakers and variability in the generated speech is achieved by priming the buffer prior to generating the audio. Experimental results on two datasets demonstrate convincing multi-speaker and in-the-wild capabilities. In order to promote reproducibility, we release our source code and models: PyTorch code and sample audio files are available at ytaigman.github.io/loop.</li>
<li><a href="http://arxiv.org/abs/1612.05082">A Fully Convolutional Deep Auditory Model for Musical Chord Recognition</a>: Chord recognition systems depend on robust feature extraction pipelines. While these pipelines are traditionally hand-crafted, recent advances in end-to-end machine learning have begun to inspire researchers to explore data-driven methods for such tasks. In this paper, we present a chord recognition system that uses a fully convolutional deep auditory model for feature extraction. The extracted features are processed by a Conditional Random Field that decodes the final chord sequence. Both processing stages are trained automatically and do not require expert knowledge for optimising parameters. We show that the learned auditory system extracts musically interpretable features, and that the proposed chord recognition system achieves results on par or better than state-of-the-art algorithms.</li>
<li><a href="http://arxiv.org/abs/1707.09423">Visual Relationship Detection with Internal and External Linguistic   Knowledge Distillation</a>: Understanding visual relationships involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the (subj,obj) pair (both semantically and spatially) to predict the predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships, but complicates learning since the semantic space of visual relationships is huge and the training data is limited, especially for the long-tail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj,obj) pair. Then, we distill the knowledge into a deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the state-of-the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).</li>
<li><a href="http://arxiv.org/abs/1704.07987">Stochastic Orthant-Wise Limited-Memory Quasi-Newton Method</a>: The $\ell_1$ regularized sparse model has been favourably used in machine learning society. Due to the non-smoothness, fast optimizers like quasi-Newton methods can not be directly applied. In this paper, we propose the first stochastic limited-memory quasi-newton optimizer that specializing in strongly convex loss function with $\ell_1$-regularization. The optimizer consists three alignment steps which are generalized from batch version of OWL-QN optimizer, to encourage the parameter update be orthant-wise. We adopt several practical features from recent stochastic variants of L-BFGS and variance reduction of subsampled gradient, we also employ various sketch techniques on the Hessian matrix inversion, squeezing more curvature information and accelerate the convergence. We prove a linear convergence rate of our optimizer, and experimentally demonstrate that our optimizer outperforms other linear convergent optimizers on large-scale sparse logistic regression task.</li>
<li><a href="http://arxiv.org/abs/1708.01022">When Kernel Methods meet Feature Learning: Log-Covariance Network for   Action Recognition from Skeletal Data</a>: Human action recognition from skeletal data is a hot research topic and important in many open domain applications of computer vision, thanks to recently introduced 3D sensors. In the literature, naive methods simply transfer off-the-shelf techniques from video to the skeletal representation. However, the current state-of-the-art is contended between to different paradigms: kernel-based methods and feature learning with (recurrent) neural networks. Both approaches show strong performances, yet they exhibit heavy, but complementary, drawbacks. Motivated by this fact, our work aims at combining together the best of the two paradigms, by proposing an approach where a shallow network is fed with a covariance representation. Our intuition is that, as long as the dynamics is effectively modeled, there is no need for the classification network to be deep nor recurrent in order to score favorably. We validate this hypothesis in a broad experimental analysis over 6 publicly available datasets.</li>
<li><a href="http://arxiv.org/abs/1703.00564">MoleculeNet: A Benchmark for Molecular Machine Learning</a>: Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations, and in particular graph convolutional networks, are powerful tools for molecular machine learning and broadly offer the best performance. However, for quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be significantly more important than choice of particular learning algorithm.</li>
<li><a href="http://arxiv.org/abs/1707.06170">Learning model-based planning from scratch</a>: Conventional wisdom holds that model-based planning is a powerful approach to sequential decision-making. It is often very challenging in practice, however, because while a model can be used to evaluate a plan, it does not prescribe how to construct a plan. Here we introduce the "Imagination-based Planner", the first model-based, sequential decision-making agent that can learn to construct, evaluate, and execute plans. Before any action, it can perform a variable number of imagination steps, which involve proposing an imagined action and evaluating it with its model-based imagination. All imagined actions and outcomes are aggregated, iteratively, into a "plan context" which conditions future real and imagined actions. The agent can even decide how to imagine: testing out alternative imagined actions, chaining sequences of actions together, or building a more complex "imagination tree" by navigating flexibly among the previously imagined states using a learned policy. And our agent can learn to plan economically, jointly optimizing for external rewards and computational costs associated with using its imagination. We show that our architecture can learn to solve a challenging continuous control problem, and also learn elaborate planning strategies in a discrete maze-solving task. Our work opens a new direction toward learning the components of a model-based planning system and how to use them.</li>
<li><a href="http://arxiv.org/abs/1710.03255">Multitask training with unlabeled data for end-to-end sign language   fingerspelling recognition</a>: We address the problem of automatic American Sign Language fingerspelling recognition from video. Prior work has largely relied on frame-level labels, hand-crafted features, or other constraints, and has been hampered by the scarcity of data for this task. We introduce a model for fingerspelling recognition that addresses these issues. The model consists of an auto-encoder-based feature extractor and an attention-based neural encoder-decoder, which are trained jointly. The model receives a sequence of image frames and outputs the fingerspelled word, without relying on any frame-level training labels or hand-crafted features. In addition, the auto-encoder subcomponent makes it possible to leverage unlabeled data to improve the feature learning. The model achieves 11.6% and 4.4% absolute letter accuracy improvement respectively in signer-independent and signer- adapted fingerspelling recognition over previous approaches that required frame-level training labels.</li>
<li><a href="http://arxiv.org/abs/1707.09364">Improved Face Detection and Alignment using Cascade Deep Convolutional   Network</a>: Real-world face detection and alignment demand an advanced discriminative model to address challenges by pose, lighting and expression. Illuminated by the deep learning algorithm, some convolutional neural networks based face detection and alignment methods have been proposed. Recent studies have utilized the relation between face detection and alignment to make models computationally efficiency, however they ignore the connection between each cascade CNNs. In this paper, we propose an structure to propose higher quality training data for End-to-End cascade network training, which give computers more space to automatic adjust weight parameter and accelerate convergence. Experiments demonstrate considerable improvement over existing detection and alignment models.</li>
<li><a href="http://arxiv.org/abs/1702.04018">Intercomparison of Machine Learning Methods for Statistical Downscaling:   The Case of Daily and Extreme Precipitation</a>: Statistical downscaling of global climate models (GCMs) allows researchers to study local climate change effects decades into the future. A wide range of statistical models have been applied to downscaling GCMs but recent advances in machine learning have not been explored. In this paper, we compare four fundamental statistical methods, Bias Correction Spatial Disaggregation (BCSD), Ordinary Least Squares, Elastic-Net, and Support Vector Machine, with three more advanced machine learning methods, Multi-task Sparse Structure Learning (MSSL), BCSD coupled with MSSL, and Convolutional Neural Networks to downscale daily precipitation in the Northeast United States. Metrics to evaluate of each method's ability to capture daily anomalies, large scale climate shifts, and extremes are analyzed. We find that linear methods, led by BCSD, consistently outperform non-linear approaches. The direct application of state-of-the-art machine learning methods to statistical downscaling does not provide improvements over simpler, longstanding approaches.</li>
<li><a href="http://arxiv.org/abs/1708.00781">Dynamic Entity Representations in Neural Language Models</a>: Understanding a long document requires tracking how entities are introduced and evolve over time. We present a new type of language model, EntityNLM, that can explicitly model entities, dynamically update their representations, and contextually generate their mentions. Our model is generative and flexible; it can model an arbitrary number of entities in context while generating each entity mention at an arbitrary length. In addition, it can be used for several different tasks such as language modeling, coreference resolution, and entity prediction. Experimental results with all these tasks demonstrate that our model consistently outperforms strong baselines and prior work.</li>
<li><a href="http://arxiv.org/abs/1709.08868">Learning Multi-grid Generative ConvNets by Minimal Contrastive   Divergence</a>: This paper proposes a minimal contrastive divergence method for learning energy-based generative ConvNet models of images at multiple grids (or scales) simultaneously. For each grid, we learn an energy-based probabilistic model where the energy function is defined by a bottom-up convolutional neural network (ConvNet or CNN). Learning such a model requires generating synthesized examples from the model. Within each iteration of our learning algorithm, for each observed training image, we generate synthesized images at multiple grids by initializing the finite-step MCMC sampling from a minimal 1 x 1 version of the training image. The synthesized image at each subsequent grid is obtained by a finite-step MCMC initialized from the synthesized image generated at the previous coarser grid. After obtaining the synthesized examples, the parameters of the models at multiple grids are updated separately and simultaneously based on the differences between synthesized and observed examples. We call this learning method the multi-grid minimal contrastive divergence. We show that this method can learn realistic energy-based generative ConvNet models, and it outperforms the original contrastive divergence (CD) and persistent CD.</li>
<li><a href="http://arxiv.org/abs/1707.09476">FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in   City Cameras</a>: In this paper, we develop deep spatio-temporal neural networks to sequentially count vehicles from low quality videos captured by city cameras (citycams). Citycam videos have low resolution, low frame rate, high occlusion and large perspective, making most existing methods lose their efficacy. To overcome limitations of existing methods and incorporate the temporal information of traffic video, we design a novel FCN-rLSTM network to jointly estimate vehicle density and vehicle count by connecting fully convolutional neural networks (FCN) with long short term memory networks (LSTM) in a residual learning fashion. Such design leverages the strengths of FCN for pixel-level prediction and the strengths of LSTM for learning complex temporal dynamics. The residual learning connection reformulates the vehicle count regression as learning residual functions with reference to the sum of densities in each frame, which significantly accelerates the training of networks. To preserve feature map resolution, we propose a Hyper-Atrous combination to integrate atrous convolution in FCN and combine feature maps of different convolution layers. FCN-rLSTM enables refined feature representation and a novel end-to-end trainable mapping from pixels to vehicle count. We extensively evaluated the proposed method on different counting tasks with three datasets, with experimental results demonstrating their effectiveness and robustness. In particular, FCN-rLSTM reduces the mean absolute error (MAE) from 5.31 to 4.21 on TRANCOS, and reduces the MAE from 2.74 to 1.53 on WebCamT. Training process is accelerated by 5 times on average.</li>
<li><a href="http://arxiv.org/abs/1612.06321">Image Retrieval with Deep Local Features and Attention-based Keypoints</a>: We introduce a local feature descriptor for large-scale image retrieval applications, called DELF (DEep Local Feature). The new feature is based on convolutional neural networks, which are trained without object- and patch-level annotations on a landmark image dataset. To enhance DELF's image retrieval performance, we also propose an attention mechanism for keypoint selection, which shares most network layers with the descriptor. This new framework can be used in image retrieval as a drop-in replacement for other keypoint detectors and descriptors, enabling more accurate feature matching and geometric verification. Our technique is particularly useful for the large-scale setting, where a system must operate with high precision. In this case, our system produces reliable confidence scores to reject false positives effectively---in particular, our system is robust against queries that have no correct match in the database. We present an evaluation methodology for this challenging retrieval setting, using standard and large-scale datasets. We show that recently proposed methods do not perform well in this setup; DELF outperforms several recent global and local descriptors by substantial margins.</li>
<li><a href="http://arxiv.org/abs/1707.09798">Unsupervised Visual Attribute Transfer with Reconfigurable Generative   Adversarial Networks</a>: Learning to transfer visual attributes requires supervision dataset. Corresponding images with varying attribute values with the same identity are required for learning the transfer function. This largely limits their applications, because capturing them is often a difficult task. To address the issue, we propose an unsupervised method to learn to transfer visual attribute. The proposed method can learn the transfer function without any corresponding images. Inspecting visualization results from various unsupervised attribute transfer tasks, we verify the effectiveness of the proposed method.</li>
<li><a href="http://arxiv.org/abs/1705.09275">Who Will Share My Image? Predicting the Content Diffusion Path in Online   Social Networks</a>: Content popularity prediction has been extensively studied due to its importance and interest for both users and hosts of social media sites like Facebook, Instagram, Twitter, and Pinterest. However, existing work mainly focuses on modeling popularity using a single metric such as the total number of likes or shares. In this work, we propose Diffusion-LSTM, a memory-based deep recurrent network that learns to recursively predict the entire diffusion path of an image through a social network. By combining user social features and image features, and encoding the diffusion path taken thus far with an explicit memory cell, our model predicts the diffusion path of an image more accurately compared to alternate baselines that either encode only image or social features, or lack memory. By mapping individual users to user prototypes, our model can generalize to new users not seen during training. Finally, we demonstrate our model's capability of generating diffusion trees, and show that the generated trees closely resemble ground-truth trees.</li>
<li><a href="http://arxiv.org/abs/1612.03266">A Character-Word Compositional Neural Language Model for Finnish</a>: Inspired by recent research, we explore ways to model the highly morphological Finnish language at the level of characters while maintaining the performance of word-level models. We propose a new Character-to-Word-to-Character (C2W2C) compositional language model that uses characters as input and output while still internally processing word level embeddings. Our preliminary experiments, using the Finnish Europarl V7 corpus, indicate that C2W2C can respond well to the challenges of morphologically rich languages such as high out of vocabulary rates, the prediction of novel words, and growing vocabulary size. Notably, the model is able to correctly score inflectional forms that are not present in the training data and sample grammatically and semantically correct Finnish sentences character by character.</li>
<li><a href="http://arxiv.org/abs/1709.01295">SketchParse : Towards Rich Descriptions for Poorly Drawn Sketches using   Multi-Task Hierarchical Deep Networks</a>: The ability to semantically interpret hand-drawn line sketches, although very challenging, can pave way for novel applications in multimedia. We propose SketchParse, the first deep-network architecture for fully automatic parsing of freehand object sketches. SketchParse is configured as a two-level fully convolutional network. The first level contains shared layers common to all object categories. The second level contains a number of expert sub-networks. Each expert specializes in parsing sketches from object categories which contain structurally similar parts. Effectively, the two-level configuration enables our architecture to scale up efficiently as additional categories are added. We introduce a router layer which (i) relays sketch features from shared layers to the correct expert (ii) eliminates the need to manually specify object category during inference. To bypass laborious part-level annotation, we sketchify photos from semantic object-part image datasets and use them for training. Our architecture also incorporates object pose prediction as a novel auxiliary task which boosts overall performance while providing supplementary information regarding the sketch. We demonstrate SketchParse's abilities (i) on two challenging large-scale sketch datasets (ii) in parsing unseen, semantically related object categories (iii) in improving fine-grained sketch-based image retrieval. As a novel application, we also outline how SketchParse's output can be used to generate caption-style descriptions for hand-drawn sketches.</li>
<li><a href="http://arxiv.org/abs/1710.01214">Calligraphic Stylisation Learning with a Physiologically Plausible Model   of Movement and Recurrent Neural Networks</a>: We propose a computational framework to learn stylisation patterns from example drawings or writings, and then generate new trajectories that possess similar stylistic qualities. We particularly focus on the generation and stylisation of trajectories that are similar to the ones that can be seen in calligraphy and graffiti art. Our system is able to extract and learn dynamic and visual qualities from a small number of user defined examples which can be recorded with a digitiser device, such as a tablet, mouse or motion capture sensors. Our system is then able to transform new user drawn traces to be kinematically and stylistically similar to the training examples. We implement the system using a Recurrent Mixture Density Network (RMDN) combined with a representation given by the parameters of the Sigma Lognormal model, a physiologically plausible model of movement that has been shown to closely reproduce the velocity and trace of human handwriting gestures.</li>
<li><a href="http://arxiv.org/abs/1708.04347">Training Neural Networks with Very Little Data -- A Draft</a>: Deep neural networks are complex architectures composed of many layers of nodes, resulting in a large number of parameters including weights and biases that must be estimated through training the network. Larger and more complex networks typically require more training data for adequate convergence than their more simple counterparts. The data available to train these networks is often limited or imbalanced. We propose the radial transform in polar coordinate space for image augmentation to facilitate the training of neural networks from limited source data. Pixel-wise coordinate transforms provide representations of the original image in the polar coordinate system and both augment data as well as increase the diversity of poorly represented classes. Experiments performed on MNIST and a set of multimodal medical images using the AlexNet and GoogLeNet neural network models show high classification accuracy using the proposed method.</li>
<li><a href="http://arxiv.org/abs/1612.00334v">A Theoretical Framework for Robustness of (Deep) Classifiers against   Adversarial Examples</a>: Most machine learning classifiers, including deep neural networks, are vulnerable to adversarial examples. Such inputs are typically generated by adding small but purposeful modifications that lead to incorrect outputs while imperceptible to human eyes. The goal of this paper is not to introduce a single method, but to make theoretical steps towards fully understanding adversarial examples. By using concepts from topology, our theoretical analysis brings forth the key reasons why an adversarial example can fool a classifier ($f_1$) and adds its oracle ($f_2$, like human eyes) in such analysis. By investigating the topological relationship between two (pseudo)metric spaces corresponding to predictor $f_1$ and oracle $f_2$, we develop necessary and sufficient conditions that can determine if $f_1$ is always robust (strong-robust) against adversarial examples according to $f_2$. Interestingly our theorems indicate that just one unnecessary feature can make $f_1$ not strong-robust, and the right feature representation learning is the key to getting a classifier that is both accurate and strong robust.</li>
<li><a href="http://arxiv.org/abs/1710.02534">Contrastive Learning for Image Captioning</a>: Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learning (CL), for image captioning. Specifically, via two constraints formulated on top of a reference model, the proposed method can encourage distinctiveness, while maintaining the overall quality of the generated captions. We tested our method on two challenging datasets, where it improves the baseline model by significant margins. We also showed in our studies that the proposed method is generic and can be used for models with various structures.</li>
<li><a href="http://arxiv.org/abs/1702.06506">PixelNet: Representation of the pixels, by the pixels, and for the   pixels</a>: We explore design principles for general pixel-level prediction problems, from low-level edge detection to mid-level surface normal estimation to high-level semantic segmentation. Convolutional predictors, such as the fully-convolutional network (FCN), have achieved remarkable success by exploiting the spatial redundancy of neighboring pixels through convolutional processing. Though computationally efficient, we point out that such approaches are not statistically efficient during learning precisely because spatial redundancy limits the information learned from neighboring pixels. We demonstrate that stratified sampling of pixels allows one to (1) add diversity during batch updates, speeding up learning; (2) explore complex nonlinear predictors, improving accuracy; and (3) efficiently train state-of-the-art models tabula rasa (i.e., "from scratch") for diverse pixel-labeling tasks. Our single architecture produces state-of-the-art results for semantic segmentation on PASCAL-Context dataset, surface normal estimation on NYUDv2 depth dataset, and edge detection on BSDS.</li>
<li><a href="http://arxiv.org/abs/1708.00376">Using Program Induction to Interpret Transition System Dynamics</a>: Explaining and reasoning about processes which underlie observed black-box phenomena enables the discovery of causal mechanisms, derivation of suitable abstract representations and the formulation of more robust predictions. We propose to learn high level functional programs in order to represent abstract models which capture the invariant structure in the observed data. We introduce the $\pi$-machine (program-induction machine) -- an architecture able to induce interpretable LISP-like programs from observed data traces. We propose an optimisation procedure for program learning based on backpropagation, gradient descent and A* search. We apply the proposed method to two problems: system identification of dynamical systems and explaining the behaviour of a DQN agent. Our results show that the $\pi$-machine can efficiently induce interpretable programs from individual data traces.</li>
<li><a href="http://arxiv.org/abs/1612.06572">Unsupervised Dialogue Act Induction using Gaussian Mixtures</a>: This paper introduces a new unsupervised approach for dialogue act induction. Given the sequence of dialogue utterances, the task is to assign them the labels representing their function in the dialogue.   Utterances are represented as real-valued vectors encoding their meaning. We model the dialogue as Hidden Markov model with emission probabilities estimated by Gaussian mixtures. We use Gibbs sampling for posterior inference.   We present the results on the standard Switchboard-DAMSL corpus. Our algorithm achieves promising results compared with strong supervised baselines and outperforms other unsupervised algorithms.</li>
<li><a href="http://arxiv.org/abs/1612.02699">Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object   Parsing</a>: Monocular 3D object parsing is highly desirable in various scenarios including occlusion reasoning and holistic scene interpretation. We present a deep convolutional neural network (CNN) architecture to localize semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. Our key insight is to exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer intermediate concepts associated with the final task. To acquire training data in desired quantities with ground truth 3D shape and relevant concepts, we render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. We train the network only on synthetic data and demonstrate state-of-the-art performances on real image benchmarks including an extended version of KITTI, PASCAL VOC, PASCAL3D+ and IKEA for 2D and 3D keypoint localization and instance segmentation. The empirical results substantiate the utility of our deep supervision scheme by demonstrating effective transfer of knowledge from synthetic data to real images, resulting in less overfitting compared to standard end-to-end training.</li>
<li><a href="http://arxiv.org/abs/1709.09578">Neural networks for topology optimization</a>: In this research, we propose a deep learning based approach for speeding up the topology optimization methods. The problem we seek to solve is the layout problem. The main novelty of this work is to state the problem as an image segmentation task. We leverage the power of deep learning methods as the efficient pixel-wise image labeling technique to perform the topology optimization. We introduce convolutional encoder-decoder architecture and the overall approach of solving the above-described problem with high performance. The conducted experiments demonstrate the significant acceleration of the optimization process. The proposed approach has excellent generalization properties. We demonstrate the ability of the application of the proposed model to other problems. The successful results, as well as the drawbacks of the current method, are discussed.</li>
<li><a href="http://arxiv.org/abs/1707.09700">Scene Graph Generation from Objects, Phrases and Caption Regions</a>: Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations, and other context information. In this work, to leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner. Objects, phrases, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark the learned model on three tasks, and show the joint learning across three tasks with our proposed method can bring mutual improvements over previous models. Particularly, on the scene graph generation task, our proposed method outperforms the state-of-art method with more than 3% margin.</li>
<li><a href="http://arxiv.org/abs/1709.07109">Deconvolutional Latent-Variable Model for Text Sequence Matching</a>: A latent-variable model is introduced for text matching, inferring sentence representations by jointly optimizing generative and discriminative objectives. To alleviate typical optimization challenges in latent-variable models for text, we employ deconvolutional networks as the sequence decoder (generator), providing learned latent codes with more semantic information and better generalization. Our model, trained in an unsupervised manner, yields stronger empirical predictive performance than a decoder based on Long Short-Term Memory (LSTM), with less parameters and considerably faster training. Further, we apply it to text sequence-matching problems. The proposed model significantly outperforms several strong sentence-encoding baselines, especially in the semi-supervised setting.</li>
<li><a href="http://arxiv.org/abs/1708.05821">Analysing Soccer Games with Clustering and Conceptors</a>: We present a new approach for identifying situations and behaviours, which we call "moves", from soccer games in the 2D simulation league. Being able to identify key situations and behaviours are useful capabilities for analysing soccer matches, anticipating opponent behaviours to aid selection of appropriate tactics, and also as a prerequisite for automatic learning of behaviours and policies. To support a wide set of strategies, our goal is to identify situations from data, in an unsupervised way without making use of pre-defined soccer specific concepts such as "pass" or "dribble". The recurrent neural networks we use in our approach act as a high-dimensional projection of the recent history of a situation on the field. Similar situations, i.e., with similar histories, are found by clustering of network states. The same networks are also used to learn so-called conceptors, that are lower-dimensional manifolds that describe trajectories through a high-dimensional state space that enable situation-specific predictions from the same neural network. With the proposed approach, we can segment games into sequences of situations that are learnt in an unsupervised way, and learn conceptors that are useful for the prediction of the near future of the respective situation.</li>
<li><a href="http://arxiv.org/abs/1705.09451">Algorithmic clothing: hybrid recommendation, from street-style-to-shop</a>: In this paper we detail Cortexica's (https://www.cortexica.com) recommendation framework -- particularly, we describe how a hybrid visual recommender system can be created by combining conditional random fields for segmentation and deep neural networks for object localisation and feature representation. The recommendation system that is built after localisation, segmentation and classification has two properties -- first, it is knowledge based in the sense that it learns pairwise preference/occurrence matrix by utilising knowledge from experts (images from fashion blogs) and second, it is content-based as it utilises a deep learning based framework for learning feature representation. Such a construct is especially useful when there is a scarcity of user preference data, that forms the foundation of many collaborative recommendation algorithms.</li>
<li><a href="http://arxiv.org/abs/1612.06508">Deeply Aggregated Alternating Minimization for Image Restoration</a>: Regularization-based image restoration has remained an active research topic in computer vision and image processing. It often leverages a guidance signal captured in different fields as an additional cue. In this work, we present a general framework for image restoration, called deeply aggregated alternating minimization (DeepAM). We propose to train deep neural network to advance two of the steps in the conventional AM algorithm: proximal mapping and ?- continuation. Both steps are learned from a large dataset in an end-to-end manner. The proposed framework enables the convolutional neural networks (CNNs) to operate as a prior or regularizer in the AM algorithm. We show that our learned regularizer via deep aggregation outperforms the recent data-driven approaches as well as the nonlocalbased methods. The flexibility and effectiveness of our framework are demonstrated in several image restoration tasks, including single image denoising, RGB-NIR restoration, and depth super-resolution.</li>
<li><a href="http://arxiv.org/abs/1707.06005">Detecting Parts for Action Localization</a>: In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i.e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations. This is achieved by training a novel human part detector that scores visible parts while regressing full-body bounding boxes. The core of our method is a convolutional neural network which learns part proposals specific to certain body parts. These are then combined to detect people robustly in each frame. Our tracking algorithm connects the image detections temporally to extract full-body human tubes. We apply our new tube extraction method on the problem of human action localization, on the popular JHMDB dataset, and a very recent challenging dataset DALY (Daily Action Localization in YouTube), showing state-of-the-art results.</li>
<li><a href="http://arxiv.org/abs/1707.06355">Video Question Answering via Attribute-Augmented Attention Network   Learning</a>: Video Question Answering is a challenging problem in visual information retrieval, which provides the answer to the referenced video content according to the question. However, the existing visual question answering approaches mainly tackle the problem of static image question, which may be ineffectively for video question answering due to the insufficiency of modeling the temporal dynamics of video contents. In this paper, we study the problem of video question answering by modeling its temporal dynamics with frame-level attention mechanism. We propose the attribute-augmented attention network learning framework that enables the joint frame-level attribute detection and unified video representation learning for video question answering. We then incorporate the multi-step reasoning process for our proposed attention network to further improve the performance. We construct a large-scale video question answering dataset. We conduct the experiments on both multiple-choice and open-ended video question answering tasks to show the effectiveness of the proposed method.</li>
<li><a href="http://arxiv.org/abs/1707.05840">Multiscale Residual Mixture of PCA: Dynamic Dictionaries for Optimal   Basis Learning</a>: In this paper we are interested in the problem of learning an over-complete basis and a methodology such that the reconstruction or inverse problem does not need optimization. We analyze the optimality of the presented approaches, their link to popular already known techniques s.a. Artificial Neural Networks,k-means or Oja's learning rule. Finally, we will see that one approach to reach the optimal dictionary is a factorial and hierarchical approach. The derived approach lead to a formulation of a Deep Oja Network. We present results on different tasks and present the resulting very efficient learning algorithm which brings a new vision on the training of deep nets. Finally, the theoretical work shows that deep frameworks are one way to efficiently have over-complete (combinatorially large) dictionary yet allowing easy reconstruction. We thus present the Deep Residual Oja Network (DRON). We demonstrate that a recursive deep approach working on the residuals allow exponential decrease of the error w.r.t. the depth.</li>
<li><a href="http://arxiv.org/abs/1710.01217">Wide and deep volumetric residual networks for volumetric image   classification</a>: 3D shape models that directly classify objects from 3D information have become more widely implementable. Current state of the art models rely on deep convolutional and inception models that are resource intensive. Residual neural networks have been demonstrated to be easier to optimize and do not suffer from vanishing/exploding gradients observed in deep networks. Here we implement a residual neural network for 3D object classification of the 3D Princeton ModelNet dataset. Further, we show that widening network layers dramatically improves accuracy in shallow residual nets, and residual neural networks perform comparable to state-of-the-art 3D shape net models, and we show that widening network layers improves classification accuracy. We provide extensive training and architecture parameters providing a better understanding of available network architectures for use in 3D object classification.</li>
<li><a href="http://arxiv.org/abs/1612.07796">First-Person Forecasting with Online Inverse Reinforcement Learning</a>: We address the problem of incrementally modeling and forecasting long-term goals of a first-person camera wearer: what the user will do, where they will go, and what goal they are attempting to reach. In contrast to prior work in trajectory forecasting, our algorithm, DARKO, goes further to reason about semantic states (will I pick up an object?), and future goal states that are far both in terms of space and time. DARKO learns and forecasts from first-person visual observations of the user's daily behaviors via an Online Inverse Reinforcement Learning (IRL) approach. Classical IRL discovers only the rewards in a batch setting, whereas DARKO discovers the states, transitions, rewards, and goals of a user from streaming data. Among other results, we show DARKO forecasts goals better than competing methods in both noisy and ideal settings, and our approach is theoretically and empirically no-regret.</li>
<li><a href="http://arxiv.org/abs/1705.09882">Person Depth ReID: Robust Person Re-identification with Commodity Depth   Sensors</a>: This work targets person re-identification (ReID) from depth sensors such as Kinect. Since depth is invariant to illumination and less sensitive than color to day-by-day appearance changes, a natural question is whether depth is an effective modality for Person ReID, especially in scenarios where individuals wear different colored clothes or over a period of several months. We explore the use of recurrent Deep Neural Networks for learning high-level shape information from low-resolution depth images. In order to tackle the small sample size problem, we introduce regularization and a hard temporal attention unit. The whole model can be trained end to end with a hybrid supervised loss. We carry out a thorough experimental evaluation of the proposed method on three person re-identification datasets, which include side views, views from the top and sequences with varying degree of partial occlusion, pose and viewpoint variations. To that end, we introduce a new dataset with RGB-D and skeleton data. In a scenario where subjects are recorded after three months with new clothes, we demonstrate large performance gains attained using Depth ReID compared to a state-of-the-art Color ReID. Finally, we show further improvements using the temporal attention unit in multi-shot setting.</li>
<li><a href="http://arxiv.org/abs/1708.06850">Learning Deep Neural Network Representations for Koopman Operators of   Nonlinear Dynamical Systems</a>: The Koopman operator has recently garnered much attention for its value in dynamical systems analysis and data-driven model discovery. However, its application has been hindered by the computational complexity of extended dynamic mode decomposition; this requires a combinatorially large basis set to adequately describe many nonlinear systems of interest, e.g. cyber-physical infrastructure systems, biological networks, social systems, and fluid dynamics. Often the dictionaries generated for these problems are manually curated, requiring domain-specific knowledge and painstaking tuning. In this paper we introduce a deep learning framework for learning Koopman operators of nonlinear dynamical systems. We show that this novel method automatically selects efficient deep dictionaries, outperforming state-of-the-art methods. We benchmark this method on partially observed nonlinear systems, including the glycolytic oscillator and show it is able to predict quantitatively 100 steps into the future, using only a single timepoint, and qualitative oscillatory behavior 400 steps into the future.</li>
<li><a href="http://arxiv.org/abs/1709.08828">Towards End-to-End Car License Plates Detection and Recognition with   Deep Neural Networks</a>: In this work, we tackle the problem of car license plate detection and recognition in natural scene images. We propose a unified deep neural network which can localize license plates and recognize the letters simultaneously in a single forward pass. The whole network can be trained end-to-end. In contrast to existing approaches which take license plate detection and recognition as two separate tasks and settle them step by step, our method jointly solves these two tasks by a single network. It not only avoids intermediate error accumulation, but also accelerates the processing speed. For performance evaluation, three datasets including images captured from various scenes under different conditions are tested. Extensive experiments show the effectiveness and efficiency of our proposed approach.</li>
<li><a href="http://arxiv.org/abs/1702.08503">SGD Learns the Conjugate Kernel Class of the Network</a>: We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space, as defined in Daniely, Frostig and Singer (2016). The result holds for log-depth networks from a rich family of architectures. To the best of our knowledge, it is the first polynomial-time guarantee for the standard neural network learning algorithm for networks of depth $\ge 3$.</li>
<li><a href="http://arxiv.org/abs/1708.00999">Extreme Low Resolution Activity Recognition with Multi-Siamese Embedding   Learning</a>: This paper presents an approach for recognition of human activities from extreme low resolution (e.g., 16x12) videos. Extreme low resolution recognition is not only necessary for analyzing actions at a distance but also is crucial for enabling privacy-preserving recognition of human activities. We propose a new approach to learn an embedding (i.e., representation) optimized for low resolution (LR) videos by taking advantage of their inherent property: two images originated from the exact same scene often have totally different pixel (i.e., RGB) values dependent on their LR transformations. We designed a new two-stream multi-Siamese convolutional neural network that learns the embedding space to be shared by low resolution videos created with different LR transforms, thereby enabling learning of transform-robust activity classifiers. We experimentally confirm that our approach of jointly learning the optimal LR video representation and the classifier outperforms the previous state-of-the-art low resolution recognition approaches on two public standard datasets by a meaningful margin.</li>
<li><a href="http://arxiv.org/abs/1707.06783">3DCNN-DQN-RNN: A Deep Reinforcement Learning Framework for Semantic   Parsing of Large-scale 3D Point Clouds</a>: Semantic parsing of large-scale 3D point clouds is an important research topic in computer vision and remote sensing fields. Most existing approaches utilize hand-crafted features for each modality independently and combine them in a heuristic manner. They often fail to consider the consistency and complementary information among features adequately, which makes them difficult to capture high-level semantic structures. The features learned by most of the current deep learning methods can obtain high-quality image classification results. However, these methods are hard to be applied to recognize 3D point clouds due to unorganized distribution and various point density of data. In this paper, we propose a 3DCNN-DQN-RNN method which fuses the 3D convolutional neural network (CNN), Deep Q-Network (DQN) and Residual recurrent neural network (RNN) for an efficient semantic parsing of large-scale 3D point clouds. In our method, an eye window under control of the 3D CNN and DQN can localize and segment the points of the object class efficiently. The 3D CNN and Residual RNN further extract robust and discriminative features of the points in the eye window, and thus greatly enhance the parsing accuracy of large-scale point clouds. Our method provides an automatic process that maps the raw data to the classification results. It also integrates object localization, segmentation and classification into one framework. Experimental results demonstrate that the proposed method outperforms the state-of-the-art point cloud classification methods.</li>
<li><a href="http://arxiv.org/abs/1707.07413">Exploring Neural Transducers for End-to-End Speech Recognition</a>: In this work, we perform an empirical comparison among the CTC, RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech recognition. We show that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a language model, on the popular Hub5'00 benchmark. On our internal diverse dataset, these trends continue - RNNTransducer models rescored with a language model after beam search outperform our best CTC models. These results simplify the speech recognition pipeline so that decoding can now be expressed purely as neural network operations. We also study how the choice of encoder architecture affects the performance of the three models - when all encoder layers are forward only, and when encoders downsample the input representation aggressively.</li>
<li><a href="http://arxiv.org/abs/1612.06138">Boosting Neural Machine Translation</a>: Training efficiency is one of the main problems for Neural Machine Translation (NMT). Deep networks, very large data and many training iterations are necessary to achieve state-of-the-art performance for NMT. This results in very high computation cost and slow down research and industrialization. In this paper, we first investigate the instability by randomizations for NMT training, and further propose an efficient training method based on data boosting and bootstrapping with no modifications to the neural network. Experiments show that this method can converge much faster compared with a baseline system and achieve stable improvement up to 2.36 BLEU points with 80% training cost.</li>
<li><a href="http://arxiv.org/abs/1707.09533">Curriculum Learning and Minibatch Bucketing in Neural Machine   Translation</a>: We examine the effects of particular orderings of sentence pairs on the on-line training of neural machine translation (NMT). We focus on two types of such orderings: (1) ensuring that each minibatch contains sentences similar in some aspect and (2) gradual inclusion of some sentence types as the training progresses (so called "curriculum learning"). In our English-to-Czech experiments, the internal homogeneity of minibatches has no effect on the training but some of our "curricula" achieve a small improvement over the baseline.</li>
<li><a href="http://arxiv.org/abs/1708.00577">Kernalised Multi-resolution Convnet for Visual Tracking</a>: Visual tracking is intrinsically a temporal problem. Discriminative Correlation Filters (DCF) have demonstrated excellent performance for high-speed generic visual object tracking. Built upon their seminal work, there has been a plethora of recent improvements relying on convolutional neural network (CNN) pretrained on ImageNet as a feature extractor for visual tracking. However, most of their works relying on ad hoc analysis to design the weights for different layers either using boosting or hedging techniques as an ensemble tracker. In this paper, we go beyond the conventional DCF framework and propose a Kernalised Multi-resolution Convnet (KMC) formulation that utilises hierarchical response maps to directly output the target movement. When directly deployed the learnt network to predict the unseen challenging UAV tracking dataset without any weight adjustment, the proposed model consistently achieves excellent tracking performance. Moreover, the transfered multi-reslution CNN renders it possible to be integrated into the RNN temporal learning framework, therefore opening the door on the end-to-end temporal deep learning (TDL) for visual tracking.</li>
<li><a href="http://arxiv.org/abs/1705.08562">Hashing as Tie-Aware Learning to Rank</a>: We formulate the problem of supervised hashing, or learning binary embeddings of data, as a learning to rank problem. Specifically, we optimize two common ranking-based evaluation metrics, Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG). Observing that ranking with the discrete Hamming distance naturally results in ties, we propose to use tie-aware versions of ranking metrics in both the evaluation and the learning of supervised hashing. For AP and NDCG, we derive continuous relaxations of their tie-aware versions, and optimize them using stochastic gradient ascent with deep neural networks. Our results establish the new state-of-the-art for tie-aware AP and NDCG on common hashing benchmarks.</li>
<li><a href="http://arxiv.org/abs/1703.02507">Unsupervised Learning of Sentence Embeddings using Compositional n-Gram   Features</a>: The recent tremendous success of unsupervised word embeddings in a multitude of applications raises the obvious question if similar methods could be derived to improve embeddings (i.e. semantic representations) of word sequences as well. We present a simple but efficient unsupervised objective to train distributed representations of sentences. Our method outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings.</li>
<li><a href="http://arxiv.org/abs/1612.06141">Domain specialization: a post-training domain adaptation for Neural   Machine Translation</a>: Domain adaptation is a key feature in Machine Translation. It generally encompasses terminology, domain and style adaptation, especially for human post-editing workflows in Computer Assisted Translation (CAT). With Neural Machine Translation (NMT), we introduce a new notion of domain adaptation that we call "specialization" and which is showing promising results both in the learning speed and in adaptation accuracy. In this paper, we propose to explore this approach under several perspectives.</li>
<li><a href="http://arxiv.org/abs/1702.07405">GapTV: Accurate and Interpretable Low-Dimensional Regression and   Classification</a>: We consider the problem of estimating a regression function in the common situation where the number of features is small, where interpretability of the model is a high priority, and where simple linear or additive models fail to provide adequate performance. To address this problem, we present GapTV, an approach that is conceptually related both to CART and to the more recent CRISP algorithm, a state-of-the-art alternative method for interpretable nonlinear regression. GapTV divides the feature space into blocks of constant value and fits the value of all blocks jointly via a convex optimization routine. Our method is fully data-adaptive, in that it incorporates highly robust routines for tuning all hyperparameters automatically. We compare our approach against CART and CRISP and demonstrate that GapTV finds a much better trade-off between accuracy and interpretability.</li>
<li><a href="http://arxiv.org/abs/1708.00463">Hierarchical Subtask Discovery With Non-Negative Matrix Factorization</a>: Hierarchical reinforcement learning methods offer a powerful means of planning flexible behavior in complicated domains. However, learning an appropriate hierarchical decomposition of a domain into subtasks remains a substantial challenge. We present a novel algorithm for subtask discovery, based on the recently introduced multitask linearly-solvable Markov decision process (MLMDP) framework. The MLMDP can perform never-before-seen tasks by representing them as a linear combination of a previously learned basis set of tasks. In this setting, the subtask discovery problem can naturally be posed as finding an optimal low-rank approximation of the set of tasks the agent will face in a domain. We use non-negative matrix factorization to discover this minimal basis set of tasks, and show that the technique learns intuitive decompositions in a variety of domains. Our method has several qualitatively desirable features: it is not limited to learning subtasks with single goal states, instead learning distributed patterns of preferred states; it learns qualitatively different hierarchical decompositions in the same domain depending on the ensemble of tasks the agent will face; and it may be straightforwardly iterated to obtain deeper hierarchical decompositions.</li>
<li><a href="http://arxiv.org/abs/1612.01033">Areas of Attention for Image Captioning</a>: We propose "Areas of Attention", a novel attention-based model for automatic image caption generation. Our approach models the interplay between the state of the RNN, image region descriptors and word embedding vectors by three pairwise interactions. It allows association of caption words with local visual appearances rather than with descriptors of the entire scene. This enables better generalization to complex scenes not seen during training. Our model is agnostic to the type of attention areas, and we instantiate it using regions based on CNN activation grids, object proposals, and spatial transformer networks. Our results show that all components of our model contribute to obtain state-of-the-art performance on the MSCOCO dataset. In addition, our results indicate that attention areas are correctly associated to meaningful latent semantic structure in the generated captions.</li>
<li><a href="http://arxiv.org/abs/1706.00884">Task-specific Word Identification from Short Texts Using a Convolutional   Neural Network</a>: Task-specific word identification aims to choose the task-related words that best describe a short text. Existing approaches require well-defined seed words or lexical dictionaries (e.g., WordNet), which are often unavailable for many applications such as social discrimination detection and fake review detection. However, we often have a set of labeled short texts where each short text has a task-related class label, e.g., discriminatory or non-discriminatory, specified by users or learned by classification algorithms. In this paper, we focus on identifying task-specific words and phrases from short texts by exploiting their class labels rather than using seed words or lexical dictionaries. We consider the task-specific word and phrase identification as feature learning. We train a convolutional neural network over a set of labeled texts and use score vectors to localize the task-specific words and phrases. Experimental results on sentiment word identification show that our approach significantly outperforms existing methods. We further conduct two case studies to show the effectiveness of our approach. One case study on a crawled tweets dataset demonstrates that our approach can successfully capture the discrimination-related words/phrases. The other case study on fake review detection shows that our approach can identify the fake-review words/phrases.</li>
<li><a href="http://arxiv.org/abs/1703.10902">Fast Predictive Multimodal Image Registration</a>: We introduce a deep encoder-decoder architecture for image deformation prediction from multimodal images. Specifically, we design an image-patch-based deep network that jointly (i) learns an image similarity measure and (ii) the relationship between image patches and deformation parameters. While our method can be applied to general image registration formulations, we focus on the Large Deformation Diffeomorphic Metric Mapping (LDDMM) registration model. By predicting the initial momentum of the shooting formulation of LDDMM, we preserve its mathematical properties and drastically reduce the computation time, compared to optimization-based approaches. Furthermore, we create a Bayesian probabilistic version of the network that allows evaluation of registration uncertainty via sampling of the network at test time. We evaluate our method on a 3D brain MRI dataset using both T1- and T2-weighted images. Our experiments show that our method generates accurate predictions and that learning the similarity measure leads to more consistent registrations than relying on generic multimodal image similarity measures, such as mutual information. Our approach is an order of magnitude faster than optimization-based LDDMM.</li>
<li><a href="http://arxiv.org/abs/1707.07432">LV-ROVER: Lexicon Verified Recognizer Output Voting Error Reduction</a>: Offline handwritten text line recognition is a hard task that requires both an efficient optical character recognizer and language model. Handwriting recognition state of the art methods are based on Long Short Term Memory (LSTM) recurrent neural networks (RNN) coupled with the use of linguistic knowledge. Most of the proposed approaches in the literature focus on improving one of the two components and use constraint, dedicated to a database lexicon. However, state of the art performance is achieved by combining multiple optical models, and possibly multiple language models with the Recognizer Output Voting Error Reduction (ROVER) framework. Though handwritten line recognition with ROVER has been implemented by combining only few recognizers because training multiple complete recognizers is hard. In this paper we propose a Lexicon Verified ROVER: LV-ROVER, that has a reduce complexity compare to the original one and that can combine hundreds of recognizers without language models. We achieve state of the art for handwritten line text on the RIMES dataset.</li>
<li><a href="http://arxiv.org/abs/1707.06065">Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in   Speech Recognition</a>: Layer normalization is a recently introduced technique for normalizing the activities of neurons in deep neural networks to improve the training speed and stability. In this paper, we introduce a new layer normalization technique called Dynamic Layer Normalization (DLN) for adaptive neural acoustic modeling in speech recognition. By dynamically generating the scaling and shifting parameters in layer normalization, DLN adapts neural acoustic models to the acoustic variability arising from various factors such as speakers, channel noises, and environments. Unlike other adaptive acoustic models, our proposed approach does not require additional adaptation data or speaker information such as i-vectors. Moreover, the model size is fixed as it dynamically generates adaptation parameters. We apply our proposed DLN to deep bidirectional LSTM acoustic models and evaluate them on two benchmark datasets for large vocabulary ASR experiments: WSJ and TED-LIUM release 2. The experimental results show that our DLN improves neural acoustic models in terms of transcription accuracy by dynamically adapting to various speakers and environments.</li>
<li><a href="http://arxiv.org/abs/1707.06719">Generalized Convolutional Neural Networks for Point Cloud Data</a>: The introduction of cheap RGB-D cameras, stereo cameras, and LIDAR devices has given the computer vision community 3D information that conventional RGB cameras cannot provide. This data is often stored as a point cloud. In this paper, we present a novel method to apply the concept of convolutional neural networks to this type of data. By creating a mapping of nearest neighbors in a dataset, and individually applying weights to spatial relationships between points, we achieve an architecture that works directly with point clouds, but closely resembles a convolutional neural net in both design and behavior. Such a method bypasses the need for extensive feature engineering, while proving to be computationally efficient and requiring few parameters.</li>
<li><a href="http://arxiv.org/abs/1707.09465">Curriculum Domain Adaptation for Semantic Segmentation of Urban Scenes</a>: During the last half decade, convolutional neural networks (CNNs) have triumphed over semantic segmentation, which is a core task of various emerging industrial applications such as autonomous driving and medical imaging. However, to train CNNs requires a huge amount of data, which is difficult to collect and laborious to annotate. Recent advances in computer graphics make it possible to train CNN models on photo-realistic synthetic data with computer-generated annotations. Despite this, the domain mismatch between the real images and the synthetic data significantly decreases the models' performance. Hence we propose a curriculum-style learning approach to minimize the domain gap in semantic segmentation. The curriculum domain adaptation solves easy tasks first in order to infer some necessary properties about the target domain; in particular, the first task is to learn global label distributions over images and local distributions over landmark superpixels. These are easy to estimate because images of urban traffic scenes have strong idiosyncrasies (e.g., the size and spatial relations of buildings, streets, cars, etc.). We then train the segmentation network in such a way that the network predictions in the target domain follow those inferred properties. In experiments, our method significantly outperforms the baselines as well as the only known existing approach to the same problem.</li>
<li><a href="http://arxiv.org/abs/1710.00870">Rethinking Feature Discrimination and Polymerization for Large-scale   Recognition</a>: Feature matters. How to train a deep network to acquire discriminative features across categories and polymerized features within classes has always been at the core of many computer vision tasks, specially for large-scale recognition systems where test identities are unseen during training and the number of classes could be at million scale. In this paper, we address this problem based on the simple intuition that the cosine distance of features in high-dimensional space should be close enough within one class and far away across categories. To this end, we proposed the congenerous cosine (COCO) algorithm to simultaneously optimize the cosine similarity among data. It inherits the softmax property to make inter-class features discriminative as well as shares the idea of class centroid in metric learning. Unlike previous work where the center is a temporal, statistical variable within one mini-batch during training, the formulated centroid is responsible for clustering inner-class features to enforce them polymerized around the network truncus. COCO is bundled with discriminative training and learned end-to-end with stable convergence. Experiments on five benchmarks have been extensively conducted to verify the effectiveness of our approach on both small-scale classification task and large-scale human recognition problem.</li>
<li><a href="http://arxiv.org/abs/1709.04549">Ignoring Distractors in the Absence of Labels: Optimal Linear Projection   to Remove False Positives During Anomaly Detection</a>: In the anomaly detection setting, the native feature embedding can be a crucial source of bias. We present a technique, Feature Omission using Context in Unsupervised Settings (FOCUS) to learn a feature mapping that is invariant to changes exemplified in training sets while retaining as much descriptive power as possible. While this method could apply to many unsupervised settings, we focus on applications in anomaly detection, where little task-labeled data is available. Our algorithm requires only non-anomalous sets of data, and does not require that the contexts in the training sets match the context of the test set. By maximizing within-set variance and minimizing between-set variance, we are able to identify and remove distracting features while retaining fidelity to the descriptiveness needed at test time. In the linear case, our formulation reduces to a generalized eigenvalue problem that can be solved quickly and applied to test sets outside the context of the training sets. This technique allows us to align technical definitions of anomaly detection with human definitions through appropriate mappings of the feature space. We demonstrate that this method is able to remove uninformative parts of the feature space for the anomaly detection setting.</li>
<li><a href="http://arxiv.org/abs/1704.02956">Surface Normals in the Wild</a>: We study the problem of single-image depth estimation for images in the wild. We collect human annotated surface normals and use them to train a neural network that directly predicts pixel-wise depth. We propose two novel loss functions for training with surface normal annotations. Experiments on NYU Depth and our own dataset demonstrate that our approach can significantly improve the quality of depth estimation in the wild.</li>
<li><a href="http://arxiv.org/abs/1703.02806">Deep Reservoir Computing Using Cellular Automata</a>: Recurrent Neural Networks (RNNs) have been a prominent concept within artificial intelligence. They are inspired by Biological Neural Networks (BNNs) and provide an intuitive and abstract representation of how BNNs work. Derived from the more generic Artificial Neural Networks (ANNs), the recurrent ones are meant to be used for temporal tasks, such as speech recognition, because they are capable of memorizing historic input. However, such networks are very time consuming to train as a result of their inherent nature. Recently, Echo State Networks and Liquid State Machines have been proposed as possible RNN alternatives, under the name of Reservoir Computing (RC). RCs are far more easy to train. In this paper, Cellular Automata are used as reservoir, and are tested on the 5-bit memory task (a well known benchmark within the RC community). The work herein provides a method of mapping binary inputs from the task onto the automata, and a recurrent architecture for handling the sequential aspects of it. Furthermore, a layered (deep) reservoir architecture is proposed. Performances are compared towards earlier work, in addition to its single-layer version. Results show that the single CA reservoir system yields similar results to state-of-the-art work. The system comprised of two layered reservoirs do show a noticeable improvement compared to a single CA reservoir. This indicates potential for further research and provides valuable insight on how to design CA reservoir systems.</li>
<li><a href="http://arxiv.org/abs/1707.07397">Synthesizing Robust Adversarial Examples</a>: Neural networks are susceptible to adversarial examples: small, carefully-crafted perturbations can cause networks to misclassify inputs in arbitrarily chosen ways. However, some studies have showed that adversarial examples crafted following the usual methods are not tolerant to small transformations: for example, zooming in on an adversarial image can cause it to be classified correctly again. This raises the question of whether adversarial examples are a concern in practice, because many real-world systems capture images from multiple scales and perspectives.   This paper shows that adversarial examples can be made robust to distributions of transformations. Our approach produces single images that are simultaneously adversarial under all transformations in a chosen distribution, showing that we cannot rely on transformations such as rescaling, translation, and rotation to protect against adversarial examples.</li>
<li><a href="http://arxiv.org/abs/1703.01396">Stacking-based Deep Neural Network: Deep Analytic Network on   Convolutional Spectral Histogram Features</a>: Stacking-based deep neural network (S-DNN), in general, denotes a deep neural network (DNN) resemblance in terms of its very deep, feedforward network architecture. The typical S-DNN aggregates a variable number of individual learning modules in series to assemble a DNN-alike alternative to the targeted object recognition tasks. This work likewise conceives an S-DNN instantiation, dubbed deep analytic network (DAN), on top of the spectral histogram (SH) features. The DAN learning principle relies on ridge regression, and some key DNN constituents, specifically, rectified linear unit, fine-tuning, and normalization. The DAN aptitude is scrutinized on three repositories of varying domains, including FERET (faces), MNIST (handwritten digits), and CIFAR10 (natural objects). The empirical results unveil that DAN escalates the SH baseline performance over a sufficiently deep layer.</li>
<li><a href="http://arxiv.org/abs/1707.06786">Head Detection with Depth Images in the Wild</a>: In wild contexts, head detection is a demanding task and a key element for many disciplines of the computer vision community, like video surveillance, Human Computer Interaction and face analysis. In this paper, we introduce a novel method for head detection, that conjugates the classification ability of deep learning approaches and depth maps, a type of infrared-based images useful to achieve reliability in case of light changes or bad light conditions. Moreover, depth data are also employed to deal with one of the traditional problems in object detection task, i.e. the scale of the target object. Two public datasets are exploited: the first one, Pandora, is used to train the deep classifier of face or non-face images; the second one, collected by the Cornell University, is used to perform a cross-dataset test during daily activities in unconstrained environments. Experimental results show that the proposed method overcomes state-of-art performance of methods based only on depth images.</li>
<li><a href="http://arxiv.org/abs/1701.08835">Language Independent Single Document Image Super-Resolution using CNN   for improved recognition</a>: Recognition of document images have important applications in restoring old and classical texts. The problem involves quality improvement before passing it to a properly trained OCR to get accurate recognition of the text. The image enhancement and quality improvement constitute important steps as subsequent recognition depends upon the quality of the input image. There are scenarios when high resolution images are not available and our experiments show that the OCR accuracy reduces significantly with decrease in the spatial resolution of document images. Thus the only option is to improve the resolution of such document images. The goal is to construct a high resolution image, given a single low resolution binary image, which constitutes the problem of single image super-resolution. Most of the previous work in super-resolution deal with natural images which have more information-content than the document images. Here, we use Convolution Neural Network to learn the mapping between low and the corresponding high resolution images. We experiment with different number of layers, parameter settings and non-linear functions to build a fast end-to-end framework for document image super-resolution. Our proposed model shows a very good PSNR improvement of about 4 dB on 75 dpi Tamil images, resulting in a 3 % improvement of word level accuracy by the OCR. It takes less time than the recent sparse based natural image super-resolution technique, making it useful for real-time document recognition applications.</li>
<li><a href="http://arxiv.org/abs/1701.02620">Deep Learning for Logo Recognition</a>: In this paper we propose a method for logo recognition using deep learning. Our recognition pipeline is composed of a logo region proposal followed by a Convolutional Neural Network (CNN) specifically trained for logo classification, even if they are not precisely localized. Experiments are carried out on the FlickrLogos-32 database, and we evaluate the effect on recognition performance of synthetic versus real data augmentation, and image pre-processing. Moreover, we systematically investigate the benefits of different training choices such as class-balancing, sample-weighting and explicit modeling the background class (i.e. no-logo regions). Experimental results confirm the feasibility of the proposed method, that outperforms the methods in the state of the art.</li>
<li><a href="http://arxiv.org/abs/1707.05236">Artificial Error Generation with Machine Translation and Syntactic   Patterns</a>: Shortage of available training data is holding back progress in the area of automated error detection. This paper investigates two alternative methods for artificially generating writing errors, in order to create additional resources. We propose treating error generation as a machine translation task, where grammatically correct text is translated to contain errors. In addition, we explore a system for extracting textual patterns from an annotated corpus, which can then be used to insert errors into grammatically correct sentences. Our experiments show that the inclusion of artificially generated errors significantly improves error detection accuracy on both FCE and CoNLL 2014 datasets.</li>
<li><a href="http://arxiv.org/abs/1709.03933">Hash Embeddings for Efficient Word Representations</a>: We present hash embeddings, an efficient method for representing words in a continuous vector form. A hash embedding may be seen as an interpolation between a standard word embedding and a word embedding created using a random hash function (the hashing trick). In hash embeddings each token is represented by $k$ $d$-dimensional embeddings vectors and one $k$ dimensional weight vector. The final $d$ dimensional representation of the token is the product of the two. Rather than fitting the embedding vectors for each token these are selected by the hashing trick from a shared pool of $B$ embedding vectors. Our experiments show that hash embeddings can easily deal with huge vocabularies consisting of millions of tokens. When using a hash embedding there is no need to create a dictionary before training nor to perform any kind of vocabulary pruning after training. We show that models trained using hash embeddings exhibit at least the same level of performance as models trained using regular embeddings across a wide range of tasks. Furthermore, the number of parameters needed by such an embedding is only a fraction of what is required by a regular embedding. Since standard embeddings and embeddings constructed using the hashing trick are actually just special cases of a hash embedding, hash embeddings can be considered an extension and improvement over the existing regular embedding types.</li>
<li><a href="http://arxiv.org/abs/1709.09783">A Deep Neural Network Approach To Parallel Sentence Extraction</a>: Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose an end-to-end deep neural network approach to detect translational equivalence between sentences in two different languages. In contrast to previous approaches, which typically rely on multiples models and various word alignment features, by leveraging continuous vector representation of sentences we remove the need of any domain specific feature engineering. Using a siamese bidirectional recurrent neural networks, our results against a strong baseline based on a state-of-the-art parallel sentence extraction system show a significant improvement in both the quality of the extracted parallel sentences and the translation performance of statistical machine translation systems. We believe this study is the first one to investigate deep learning for the parallel sentence extraction task.</li>
<li><a href="http://arxiv.org/abs/1707.06878">Unsupervised, Knowledge-Free, and Interpretable Word Sense   Disambiguation</a>: Interpretability of a predictive model is a powerful feature that gains the trust of users in the correctness of the predictions. In word sense disambiguation (WSD), knowledge-based systems tend to be much more interpretable than knowledge-free counterparts as they rely on the wealth of manually-encoded elements representing word senses, such as hypernyms, usage examples, and images. We present a WSD system that bridges the gap between these two so far disconnected groups of methods. Namely, our system, providing access to several state-of-the-art WSD models, aims to be interpretable as a knowledge-based system while it remains completely unsupervised and knowledge-free. The presented tool features a Web interface for all-word disambiguation of texts that makes the sense predictions human readable by providing interpretable word sense inventories, sense representations, and disambiguation results. We provide a public API, enabling seamless integration.</li>
<li><a href="http://arxiv.org/abs/1702.06151">Developing a comprehensive framework for multimodal feature extraction</a>: Feature extraction is a critical component of many applied data science workflows. In recent years, rapid advances in artificial intelligence and machine learning have led to an explosion of feature extraction tools and services that allow data scientists to cheaply and effectively annotate their data along a vast array of dimensions---ranging from detecting faces in images to analyzing the sentiment expressed in coherent text. Unfortunately, the proliferation of powerful feature extraction services has been mirrored by a corresponding expansion in the number of distinct interfaces to feature extraction services. In a world where nearly every new service has its own API, documentation, and/or client library, data scientists who need to combine diverse features obtained from multiple sources are often forced to write and maintain ever more elaborate feature extraction pipelines. To address this challenge, we introduce a new open-source framework for comprehensive multimodal feature extraction. Pliers is an open-source Python package that supports standardized annotation of diverse data types (video, images, audio, and text), and is expressly with both ease-of-use and extensibility in mind. Users can apply a wide range of pre-existing feature extraction tools to their data in just a few lines of Python code, and can also easily add their own custom extractors by writing modular classes. A graph-based API enables rapid development of complex feature extraction pipelines that output results in a single, standardized format. We describe the package's architecture, detail its major advantages over previous feature extraction toolboxes, and use a sample application to a large functional MRI dataset to illustrate how pliers can significantly reduce the time and effort required to construct sophisticated feature extraction workflows while increasing code clarity and maintainability.</li>
<li><a href="http://arxiv.org/abs/1707.09564">A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for   Neural Networks</a>: We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.</li>
<li><a href="http://arxiv.org/abs/1708.00807">Adversarial-Playground: A Visualization Suite Showing How Adversarial   Examples Fool Deep Learning</a>: Recent studies have shown that attackers can force deep learning models to misclassify so-called "adversarial examples": maliciously generated images formed by making imperceptible modifications to pixel values. With growing interest in deep learning for security applications, it is important for security experts and users of machine learning to recognize how learning systems may be attacked. Due to the complex nature of deep learning, it is challenging to understand how deep models can be fooled by adversarial examples. Thus, we present a web-based visualization tool, Adversarial-Playground, to demonstrate the efficacy of common adversarial methods against a convolutional neural network (CNN) system. Adversarial-Playground is educational, modular and interactive. (1) It enables non-experts to compare examples visually and to understand why an adversarial example can fool a CNN-based image classifier. (2) It can help security experts explore more vulnerability of deep learning as a software module. (3) Building an interactive visualization is challenging in this domain due to the large feature space of image classification (generating adversarial examples is slow in general and visualizing images are costly). Through multiple novel design choices, our tool can provide fast and accurate responses to user requests. Empirically, we find that our client-server division strategy reduced the response time by an average of 1.5 seconds per sample. Our other innovation, a faster variant of JSMA evasion algorithm, empirically performed twice as fast as JSMA and yet maintains a comparable evasion rate.   Project source code and data from our experiments available at: https://github.com/QData/AdversarialDNN-Playground</li>
<li><a href="http://arxiv.org/abs/1702.06383">Deep Geometric Retrieval</a>: Comparing images in order to recommend items from an image-inventory is a subject of continued interest. Added with the scalability of deep-learning architectures the once `manual' job of hand-crafting features have been largely alleviated, and images can be compared according to features generated from a deep convolutional neural network. In this paper, we compare distance metrics (and divergences) to rank features generated from a neural network, for content-based image retrieval. Specifically, after modelling individual images using approximations of mixture models or sparse covariance estimators we resort to their information-theoretic and Riemann geometric comparisons. We show that using approximations of mixture models enable us to to compute a distance measure based on the Wasserstein metric that requires less effort than computationally intensive optimal transport plans; finally, an affine invariant metric is used to compare the optimal transport metric to its Riemann geometric counterpart -- we conclude that although expensive, retrieval metric based on Wasserstein geometry are more suitable than information theoretic comparison of images. In short, we combine GPU scalability in learning deep feature vectors with computationally efficient metrics that we foresee being utilized in a commercial setting.</li>
<li><a href="http://arxiv.org/abs/1705.10000">Robust Online Matrix Factorization for Dynamic Background Subtraction</a>: We propose an effective online background subtraction method, which can be robustly applied to practical videos that have variations in both foreground and background. Different from previous methods which often model the foreground as Gaussian or Laplacian distributions, we model the foreground for each frame with a specific mixture of Gaussians (MoG) distribution, which is updated online frame by frame. Particularly, our MoG model in each frame is regularized by the learned foreground/background knowledge in previous frames. This makes our online MoG model highly robust, stable and adaptive to practical foreground and background variations. The proposed model can be formulated as a concise probabilistic MAP model, which can be readily solved by EM algorithm. We further embed an affine transformation operator into the proposed model, which can be automatically adjusted to fit a wide range of video background transformations and make the method more robust to camera movements. With using the sub-sampling technique, the proposed method can be accelerated to execute more than 250 frames per second on average, meeting the requirement of real-time background subtraction for practical video processing tasks. The superiority of the proposed method is substantiated by extensive experiments implemented on synthetic and real videos, as compared with state-of-the-art online and offline background subtraction methods.</li>
<li><a href="http://arxiv.org/abs/1706.03142">Deep Learning for Isotropic Super-Resolution from Non-Isotropic 3D   Electron Microscopy</a>: The most sophisticated existing methods to generate 3D isotropic super-resolution (SR) from non-isotropic electron microscopy (EM) are based on learned dictionaries. Unfortunately, none of the existing methods generate practically satisfying results. For 2D natural images, recently developed super-resolution methods that use deep learning have been shown to significantly outperform the previous state of the art.   We have adapted one of the most successful architectures (FSRCNN) for 3D super-resolution, and compared its performance to a 3D U-Net architecture that has not been used previously to generate super-resolution.   We trained both architectures on artificially downscaled isotropic ground truth from focused ion beam milling scanning EM (FIB-SEM) and tested the performance for various hyperparameter settings.   Our results indicate that both architectures can successfully generate 3D isotropic super-resolution from non-isotropic EM, with the U-Net performing consistently better. We propose several promising directions for practical application.</li>
<li><a href="http://arxiv.org/abs/1612.06062">Improving Tweet Representations using Temporal and User Context</a>: In this work we propose a novel representation learning model which computes semantic representations for tweets accurately. Our model systematically exploits the chronologically adjacent tweets ('context') from users' Twitter timelines for this task. Further, we make our model user-aware so that it can do well in modeling the target tweet by exploiting the rich knowledge about the user such as the way the user writes the post and also summarizing the topics on which the user writes. We empirically demonstrate that the proposed models outperform the state-of-the-art models in predicting the user profile attributes like spouse, education and job by 19.66%, 2.27% and 2.22% respectively.</li>
<li><a href="http://arxiv.org/abs/1708.00159">Image Denoising via CNNs: An Adversarial Approach</a>: Is it possible to recover an image from its noisy version using convolutional neural networks? This is an interesting problem as convolutional layers are generally used as feature detectors for tasks like classification, segmentation and object detection. We present a new CNN architecture for blind image denoising which synergically combines three architecture components, a multi-scale feature extraction layer which helps in reducing the effect of noise on feature maps, an l_p regularizer which helps in selecting only the appropriate feature maps for the task of reconstruction, and finally a three step training approach which leverages adversarial training to give the final performance boost to the model. The proposed model shows competitive denoising performance when compared to the state-of-the-art approaches.</li>
<li><a href="http://arxiv.org/abs/1705.10574">Multi-Focus Image Fusion Via Coupled Sparse Representation and   Dictionary Learning</a>: We address the multi-focus image fusion problem, where multiple images captured with different focal settings are to be fused into an all-in-focus image of higher quality. Algorithms for this problem necessarily admit the source image characteristics along with focused and blurred feature. However, most sparsity-based approaches use a single dictionary in focused feature space to describe multi-focus images, and ignore the representations in blurred feature space. Here, we propose a multi-focus image fusion approach based on coupled sparse representation. The approach exploits the facts that (i) the patches in given training set can be sparsely represented by a couple of overcomplete dictionaries related to the focused and blurred categories of images; and (ii) merging such representations leads to a more flexible and therefore better fusion strategy than the one based on just selecting the sparsest representation in the original image estimate. By jointly learning the coupled dictionary, we enforce the similarity of sparse representations in the focused and blurred feature spaces, and then introduce a fusion approach to combine these representations for generating an all-in-focus image. We also discuss the advantages of the fusion approach based on coupled sparse representation and present an efficient algorithm for learning the coupled dictionary. Extensive experimental comparisons with state-of-the-art multi-focus image fusion algorithms validate the effectiveness of the proposed approach.</li>
<li><a href="http://arxiv.org/abs/1709.03749">Deep Mean-Shift Priors for Image Restoration</a>: In this paper we introduce a natural image prior that directly represents a Gaussian-smoothed version of the natural image distribution. We include our prior in a formulation of image restoration as a Bayes estimator that also allows us to solve noise-blind image restoration problems. We show that the gradient of our prior corresponds to the mean-shift vector on the natural image distribution. In addition, we learn the mean-shift vector field using denoising autoencoders, and use it in a gradient descent approach to perform Bayes risk minimization. We demonstrate competitive results for noise-blind deblurring, super-resolution, and demosaicing.</li>
<li><a href="http://arxiv.org/abs/1701.05549">Deep Neural Networks - A Brief History</a>: Introduction to deep neural networks and their history.</li>
<li><a href="http://arxiv.org/abs/1710.01927">Data Augmentation of Spectral Data for Convolutional Neural Network   (CNN) Based Deep Chemometrics</a>: Deep learning methods are used on spectroscopic data to predict drug content in tablets from near infrared (NIR) spectra. Using convolutional neural networks (CNNs), features are ex- tracted from the spectroscopic data. Extended multiplicative scatter correction (EMSC) and a novel spectral data augmentation method are benchmarked as preprocessing steps. The learned models perform better or on par with hypothetical optimal partial least squares (PLS) models for all combinations of preprocessing. Data augmentation with subsequent EMSC in combination gave the best results. The deep learning model CNNs also outperform the PLS models in an extrapolation chal- lenge created using data from a second instrument and from an analyte concentration not covered by the training data. Qualitative investigations of the CNNs kernel activations show their resemblance to wellknown data processing methods such as smoothing, slope/derivative, thresholds and spectral region selection.</li>
<li><a href="http://arxiv.org/abs/1704.02681">Pyramid Vector Quantization for Deep Learning</a>: This paper explores the use of Pyramid Vector Quantization (PVQ) to reduce the computational cost for a variety of neural networks (NNs) while, at the same time, compressing the weights that describe them. This is based on the fact that the dot product between an N dimensional vector of real numbers and an N dimensional PVQ vector can be calculated with only additions and subtractions and one multiplication. This is advantageous since tensor products, commonly used in NNs, can be re-conduced to a dot product or a set of dot products. Finally, it is stressed that any NN architecture that is based on an operation that can be re-conduced to a dot product can benefit from the techniques described here.</li>
<li><a href="http://arxiv.org/abs/1709.07220">Human Pose Estimation using Global and Local Normalization</a>: In this paper, we address the problem of estimating the positions of human joints, i.e., articulated pose estimation. Recent state-of-the-art solutions model two key issues, joint detection and spatial configuration refinement, together using convolutional neural networks. Our work mainly focuses on spatial configuration refinement by reducing variations of human poses statistically, which is motivated by the observation that the scattered distribution of the relative locations of joints e.g., the left wrist is distributed nearly uniformly in a circular area around the left shoulder) makes the learning of convolutional spatial models hard. We present a two-stage normalization scheme, human body normalization and limb normalization, to make the distribution of the relative joint locations compact, resulting in easier learning of convolutional spatial models and more accurate pose estimation. In addition, our empirical results show that incorporating multi-scale supervision and multi-scale fusion into the joint detection network is beneficial. Experiment results demonstrate that our method consistently outperforms state-of-the-art methods on the benchmarks.</li>
<li><a href="http://arxiv.org/abs/1701.07570">Strongly Adaptive Regret Implies Optimally Dynamic Regret</a>: To cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret independently. In this paper, we illustrate an intrinsic connection between these two concepts by showing that the dynamic regret can be expressed in terms of the adaptive regret and the functional variation. This observation implies that strongly adaptive algorithms can be directly leveraged to minimize the dynamic regret. As a result, we present a series of strongly adaptive algorithms whose dynamic regrets are minimax optimal for convex functions, exponentially concave functions, and strongly convex functions, respectively. To the best of our knowledge, this is the first time that such kind of dynamic regret bound is established for exponentially concave functions. Moreover, all of those adaptive algorithms do not need any prior knowledge of the functional variation, which is a significant advantage over previous specialized methods for minimizing dynamic regret.</li>
<li><a href="http://arxiv.org/abs/1703.01968">Max-value Entropy Search for Efficient Bayesian Optimization</a>: Entropy Search (ES) and Predictive Entropy Search (PES) are popular and empirically successful Bayesian Optimization techniques. Both rely on a compelling information-theoretic motivation, and maximize the information gained about the $\arg\max$ of the unknown function. Yet, both are plagued by expensive computation, e.g., for estimating entropy. We propose a new criterion, Max-value Entropy Search (MES), that instead uses the information about the maximum value. We observe that MES maintains or improves the good empirical performance of ES/PES, while tremendously lightening the computational burden. In particular, MES is much more robust to the number of samples used for computing entropy, and hence more efficient. We show relations of MES to other BO methods, and establish a regret bound. Empirical evaluations on a variety of tasks demonstrate the good performance of MES.</li>
<li><a href="http://arxiv.org/abs/1709.02576">Deep learning for undersampled MRI reconstruction</a>: This paper presents a deep learning method for faster magnetic resonance imaging (MRI) by reducing k-space data with sub-Nyquist sampling strategies and provides a rationale for why the proposed approach works well. Uniform subsampling is used in the time-consuming phase-encoding direction to capture high-resolution image information, while permitting the image-folding problem dictated by the Poisson summation formula. To deal with the localization uncertainty due to image folding, very few low-frequency k-space data are added. Training the deep learning net involves input and output images that are pairs of Fourier transforms of the subsampled and fully sampled k-space data. Numerous experiments show the remarkable performance of the proposed method; only 29% of k-space data can generate images of high quality as effectively as standard MRI reconstruction with fully sampled data.</li>
<li><a href="http://arxiv.org/abs/1612.03142">Quantifying and Predicting Image Scenicness</a>: Capturing the beauty of outdoor scenes in an image motivates many amateur and professional photographers and serves as the basis for many image sharing sites. While natural beauty is often considered a subjective property of images, in this paper, we take an objective approach and provide methods for quantifying and predicting the scenicness of an image. Using a dataset containing hundreds of thousands of outdoor images captured throughout Great Britain with crowdsourced ratings of natural beauty, we propose an approach to predict scenicness which explicitly accounts for the variance of human raters. We demonstrate that quantitative measures of scenicness can benefit semantic image understanding, content-aware image processing, and a novel application of cross-view mapping, where the sparsity of labeled ground-level images can be addressed by incorporating unlabeled aerial images in the training and prediction steps. For each application, our methods for scenicness prediction result in quantitative and qualitative improvements over baseline approaches.</li>
<li><a href="http://arxiv.org/abs/1707.07833">ssEMnet: Serial-section Electron Microscopy Image Registration using a   Spatial Transformer Network with Learned Features</a>: The alignment of serial-section electron microscopy (ssEM) images is critical for efforts in neuroscience that seek to reconstruct neuronal circuits. However, each ssEM plane contains densely packed structures that vary from one section to the next, which makes matching features across images a challenge. Advances in deep learning has resulted in unprecedented performance in similar computer vision problems, but to our knowledge, they have not been successfully applied to ssEM image co-registration. In this paper, we introduce a novel deep network model that combines a spatial transformer for image deformation and a convolutional autoencoder for unsupervised feature learning for robust ssEM image alignment. This results in improved accuracy and robustness while requiring substantially less user intervention than conventional methods. We evaluate our method by comparing registration quality across several datasets.</li>
<li><a href="http://arxiv.org/abs/1709.07758">Improving Language Modelling with Noise-contrastive estimation</a>: Neural language models do not scale well when the vocabulary is large. Noise-contrastive estimation (NCE) is a sampling-based method that allows for fast learning with large vocabularies. Although NCE has shown promising performance in neural machine translation, it was considered to be an unsuccessful approach for language modelling. A sufficient investigation of the hyperparameters in the NCE-based neural language models was also missing. In this paper, we showed that NCE can be a successful approach in neural language modelling when the hyperparameters of a neural network are tuned appropriately. We introduced the 'search-then-converge' learning rate schedule for NCE and designed a heuristic that specifies how to use this schedule. The impact of the other important hyperparameters, such as the dropout rate and the weight initialisation range, was also demonstrated. We showed that appropriate tuning of NCE-based neural language models outperforms the state-of-the-art single-model methods on a popular benchmark.</li>
<li><a href="http://arxiv.org/abs/1707.03321">A deep learning architecture for temporal sleep stage classification   using multivariate and multimodal time series</a>: Sleep stage classification constitutes an important preliminary exam in the diagnosis of sleep disorders and is traditionally performed by a sleep expert who assigns to each 30s of signal a sleep stage, based on the visual inspection of signals such as electroencephalograms (EEG), electrooculograms (EOG), electrocardiograms (ECG) and electromyograms (EMG). In this paper, we introduce the first end-to-end deep learning approach that performs automatic temporal sleep stage classification from multivariate and multimodal Polysomnography (PSG) signals. We build a general deep architecture which can extract information from EEG, EOG and EMG channels and pools the learnt representations into a final softmax classifier. The architecture is light enough to be distributed in time in order to learn from the temporal context of each sample, namely previous and following data segments. Our model, which is unique in its ability to learn a feature representation from multiple modalities, is compared to alternative automatic approaches based on convolutional networks or decisions trees. Results obtained on 61 publicly available PSG records with up to 20 EEG channels demonstrate that our network architecture yields state-of-the-art performance. Our study reveals a number of insights on the spatio-temporal distribution of the signal of interest: a good trade-off for optimal classification performance measured with balanced accuracy is to use 6 EEG with some EOG and EMG channels. Also exploiting one minute of data before and after each data segment to be classified offers the strongest improvement when a limited number of channels is available. Our approach aims to improve a key step in the study of sleep disorders. As sleep experts, our system exploits the multivariate and multimodal character of PSG signals to deliver state-of-the-art classification performance at a very low complexity cost.</li>
<li><a href="http://arxiv.org/abs/1701.05228">Recommendation under Capacity Constraints</a>: In this paper, we investigate the common scenario where every candidate item for recommendation is characterized by a maximum capacity, i.e., number of seats in a Point-of-Interest (POI) or size of an item's inventory. Despite the prevalence of the task of recommending items under capacity constraints in a variety of settings, to the best of our knowledge, none of the known recommender methods is designed to respect capacity constraints. To close this gap, we extend three state-of-the art latent factor recommendation approaches: probabilistic matrix factorization (PMF), geographical matrix factorization (GeoMF), and bayesian personalized ranking (BPR), to optimize for both recommendation accuracy and expected item usage that respects the capacity constraints. We introduce the useful concepts of user propensity to listen and item capacity. Our experimental results in real-world datasets, both for the domain of item recommendation and POI recommendation, highlight the benefit of our method for the setting of recommendation under capacity constraints.</li>
<li><a href="http://arxiv.org/abs/1710.03144">Island Loss for Learning Discriminative Features in Facial Expression   Recognition</a>: Over the past few years, Convolutional Neural Networks (CNNs) have shown promise on facial expression recognition. However, the performance degrades dramatically under real-world settings due to variations introduced by subtle facial appearance changes, head pose variations, illumination changes, and occlusions.   In this paper, a novel island loss is proposed to enhance the discriminative power of the deeply learned features. Specifically, the IL is designed to reduce the intra-class variations while enlarging the inter-class differences simultaneously. Experimental results on four benchmark expression databases have demonstrated that the CNN with the proposed island loss (IL-CNN) outperforms the baseline CNN models with either traditional softmax loss or the center loss and achieves comparable or better performance compared with the state-of-the-art methods for facial expression recognition.</li>
<li><a href="http://arxiv.org/abs/1707.09751">Skill2vec: Machine Learning Approaches for Determining the Relevant   Skill from Job Description</a>: Un-supervise learned word embeddings have seen tremendous success in numerous Natural Language Processing (NLP) tasks in recent years. The main contribution of this paper is to develop a technique called Skill2vec, which applies machine learning techniques in recruitment to enhance the search strategy to find the candidates who possess the right skills. Skill2vec is a neural network architecture which inspired by Word2vec, developed by Mikolov et al. in 2013, to transform a skill to a new vector space. This vector space has the characteristics of calculation and present their relationship. We conducted an experiment using AB testing in a recruitment company to demonstrate the effectiveness of our approach.</li>
<li><a href="http://arxiv.org/abs/1709.02896">Simultaneously Learning Neighborship and Projection Matrix for   Supervised Dimensionality Reduction</a>: Explicitly or implicitly, most of dimensionality reduction methods need to determine which samples are neighbors and the similarity between the neighbors in the original highdimensional space. The projection matrix is then learned on the assumption that the neighborhood information (e.g., the similarity) is known and fixed prior to learning. However, it is difficult to precisely measure the intrinsic similarity of samples in high-dimensional space because of the curse of dimensionality. Consequently, the neighbors selected according to such similarity might and the projection matrix obtained according to such similarity and neighbors are not optimal in the sense of classification and generalization. To overcome the drawbacks, in this paper we propose to let the similarity and neighbors be variables and model them in low-dimensional space. Both the optimal similarity and projection matrix are obtained by minimizing a unified objective function. Nonnegative and sum-to-one constraints on the similarity are adopted. Instead of empirically setting the regularization parameter, we treat it as a variable to be optimized. It is interesting that the optimal regularization parameter is adaptive to the neighbors in low-dimensional space and has intuitive meaning. Experimental results on the YALE B, COIL-100, and MNIST datasets demonstrate the effectiveness of the proposed method.</li>
<li><a href="http://arxiv.org/abs/1709.07842">Bayesian Optimization for Parameter Tuning of the XOR Neural Network</a>: When applying Machine Learning techniques to problems, one must select model parameters to ensure that the system converges but also does not become stuck at the objective function's local minimum. Tuning these parameters becomes a non-trivial task for large models and it is not always apparent if the user has found the optimal parameters. We aim to automate the process of tuning a Neural Network, (where only a limited number of parameter search attempts are available) by implementing Bayesian Optimization. In particular, by assigning Gaussian Process Priors to the parameter space, we utilize Bayesian Optimization to tune an Artificial Neural Network used to learn the XOR function, with the result of achieving higher prediction accuracy.</li>
<li><a href="http://arxiv.org/abs/1708.00909">Machine learning for neural decoding</a>: While machine learning tools have been rapidly advancing, the majority of neural decoding approaches still use last century's methods. Improving the performance of neural decoding algorithms allows us to better understand what information is contained in the brain, and can help advance engineering applications such as brain machine interfaces. Here, we apply modern machine learning techniques, including neural networks and gradient boosting, to decode from spiking activity in 1) motor cortex, 2) somatosensory cortex, and 3) hippocampus. We compare the predictive ability of these modern methods with traditional decoding methods such as Wiener and Kalman filters. Modern methods, in particular neural networks and ensembles, significantly outperformed the traditional approaches. For instance, for all of the three brain areas, an LSTM decoder explained over 40% of the unexplained variance from a Wiener filter. These results suggest that modern machine learning techniques should become the standard methodology for neural decoding. We provide code to facilitate wider implementation of these methods.</li>
<li><a href="http://arxiv.org/abs/1701.03329">A Data-Oriented Model of Literary Language</a>: We consider the task of predicting how literary a text is, with a gold standard from human ratings. Aside from a standard bigram baseline, we apply rich syntactic tree fragments, mined from the training set, and a series of hand-picked features. Our model is the first to distinguish degrees of highly and less literary novels using a variety of lexical and syntactic features, and explains 76.0 % of the variation in literary ratings.</li>
<li><a href="http://arxiv.org/abs/1703.02992">A Manifold Approach to Learning Mutually Orthogonal Subspaces</a>: Although many machine learning algorithms involve learning subspaces with particular characteristics, optimizing a parameter matrix that is constrained to represent a subspace can be challenging. One solution is to use Riemannian optimization methods that enforce such constraints implicitly, leveraging the fact that the feasible parameter values form a manifold. While Riemannian methods exist for some specific problems, such as learning a single subspace, there are more general subspace constraints that offer additional flexibility when setting up an optimization problem, but have not been formulated as a manifold.   We propose the partitioned subspace (PS) manifold for optimizing matrices that are constrained to represent one or more subspaces. Each point on the manifold defines a partitioning of the input space into mutually orthogonal subspaces, where the number of partitions and their sizes are defined by the user. As a result, distinct groups of features can be learned by defining different objective functions for each partition. We illustrate the properties of the manifold through experiments on multiple dataset analysis and domain adaptation.</li>
<li><a href="http://arxiv.org/abs/1612.01213">Deep Metric Learning via Facility Location</a>: Learning the representation and the similarity metric in an end-to-end fashion with deep networks have demonstrated outstanding results for clustering and retrieval. However, these recent approaches still suffer from the performance degradation stemming from the local metric training procedure which is unaware of the global structure of the embedding space.   We propose a global metric learning scheme for optimizing the deep metric embedding with the learnable clustering function and the clustering metric (NMI) in a novel structured prediction framework.   Our experiments on CUB200-2011, Cars196, and Stanford online products datasets show state of the art performance both on the clustering and retrieval tasks measured in the NMI and Recall@K evaluation metrics.</li>
<li><a href="http://arxiv.org/abs/1701.00449">Retrieving Similar X-Ray Images from Big Image Data Using Radon Barcodes   with Single Projections</a>: The idea of Radon barcodes (RBC) has been introduced recently. In this paper, we propose a content-based image retrieval approach for big datasets based on Radon barcodes. Our method (Single Projection Radon Barcode, or SP-RBC) uses only a few Radon single projections for each image as global features that can serve as a basis for weak learners. This is our most important contribution in this work, which improves the results of the RBC considerably. As a matter of fact, only one projection of an image, as short as a single SURF feature vector, can already achieve acceptable results. Nevertheless, using multiple projections in a long vector will not deliver anticipated improvements. To exploit the information inherent in each projection, our method uses the outcome of each projection separately and then applies more precise local search on the small subset of retrieved images. We have tested our method using IRMA 2009 dataset a with 14,400 x-ray images as part of imageCLEF initiative. Our approach leads to a substantial decrease in the error rate in comparison with other non-learning methods.</li>
<li><a href="http://arxiv.org/abs/1708.00598">Controllable Generative Adversarial Network</a>: Although it is recently introduced, in last few years, generative adversarial network (GAN) has been shown many promising results to generate realistic samples. However, it is hardly able to control generated samples since input variables for a generator are from a random distribution. Some attempts have been made to control generated samples from GAN, but they have shown moderate results. Furthermore, it is hardly possible to control the generator to concentrate on reality or distinctness. For example, with existing models, a generator for face image generation cannot be set to concentrate on one of the two objectives, i.e. generating realistic face and generating difference face according to input labels. Here, we propose controllable GAN (CGAN) in this paper. CGAN shows powerful performance to control generated samples; in addition, it can control the generator to concentrate on reality or distinctness. In this paper, CGAN is evaluated with CelebA datasets. We believe that CGAN can contribute to the research in generative neural network models.</li>
<li><a href="http://arxiv.org/abs/1708.02918">The Tensor Memory Hypothesis</a>: We discuss memory models which are based on tensor decompositions using latent representations of entities and events. We show how episodic memory and semantic memory can be realized and discuss how new memory traces can be generated from sensory input: Existing memories are the basis for perception and new memories are generated via perception. We relate our mathematical approach to the hippocampal memory indexing theory. We describe the first mathematical memory models that are truly declarative by generating explicit semantic triples describing both memory content and sensory inputs. Our main hypothesis is that perception includes an active semantic decoding process, which relies on latent representations of entities and predicates, and that episodic and semantic memories depend on the same decoding process.</li>
<li><a href="http://arxiv.org/abs/1709.07368">Multi-label Pixelwise Classification for Reconstruction of Large-scale   Urban Areas</a>: Object classification is one of the many holy grails in computer vision and as such has resulted in a very large number of algorithms being proposed already. Specifically in recent years there has been considerable progress in this area primarily due to the increased efficiency and accessibility of deep learning techniques. In fact, for single-label object classification [i.e. only one object present in the image] the state-of-the-art techniques employ deep neural networks and are reporting very close to human-like performance. There are specialized applications in which single-label object-level classification will not suffice; for example in cases where the image contains multiple intertwined objects of different labels.   In this paper, we address the complex problem of multi-label pixelwise classification. We present our distinct solution based on a convolutional neural network (CNN) for performing multi-label pixelwise classification and its application to large-scale urban reconstruction. A supervised learning approach is followed for training a 13-layer CNN using both LiDAR and satellite images. An empirical study has been conducted to determine the hyperparameters which result in the optimal performance of the CNN. Scale invariance is introduced by training the network on five different scales of the input and labeled data. This results in six pixelwise classifications for each different scale. An SVM is then trained to map the six pixelwise classifications into a single-label. Lastly, we refine boundary pixel labels using graph-cuts for maximum a-posteriori (MAP) estimation with Markov Random Field (MRF) priors. The resulting pixelwise classification is then used to accurately extract and reconstruct the buildings in large-scale urban areas. The proposed approach has been extensively tested and the results are reported.</li>
<li><a href="http://arxiv.org/abs/1707.06990">Memory-Efficient Implementation of DenseNets</a>: The DenseNet architecture is highly computationally efficient as a result of feature reuse. However, a naive DenseNet implementation can require a significant amount of GPU memory: If not properly managed, pre-activation batch normalization and contiguous convolution operations can produce feature maps that grow quadratically with network depth. In this technical report, we introduce strategies to reduce the memory consumption of DenseNets during training. By strategically using shared memory allocations, we reduce the memory cost for storing feature maps from quadratic to linear. Without the GPU memory bottleneck, it is now possible to train extremely deep DenseNets. Networks with 14M parameters can be trained on a single GPU, up from 4M. A 264-layer DenseNet (73M parameters), which previously would have been infeasible to train, can now be trained on a single workstation with 8 NVIDIA Tesla M40 GPUs. On the ImageNet ILSVRC classification dataset, this large DenseNet obtains a state-of-the-art single-crop top-1 error of 20.26%.</li>
<li><a href="http://arxiv.org/abs/1709.06750">SegFlow: Joint Learning for Video Object Segmentation and Optical Flow</a>: This paper proposes an end-to-end trainable network, SegFlow, for simultaneously predicting pixel-wise object segmentation and optical flow in videos. The proposed SegFlow has two branches where useful information of object segmentation and optical flow is propagated bidirectionally in a unified framework. The segmentation branch is based on a fully convolutional network, which has been proved effective in image segmentation task, and the optical flow branch takes advantage of the FlowNet model. The unified framework is trained iteratively offline to learn a generic notion, and fine-tuned online for specific objects. Extensive experiments on both the video object segmentation and optical flow datasets demonstrate that introducing optical flow improves the performance of segmentation and vice versa, against the state-of-the-art algorithms.</li>
<li><a href="http://arxiv.org/abs/1707.09875">SAR Target Recognition Using the Multi-aspect-aware Bidirectional LSTM   Recurrent Neural Networks</a>: The outstanding pattern recognition performance of deep learning brings new vitality to the synthetic aperture radar (SAR) automatic target recognition (ATR). However, there is a limitation in current deep learning based ATR solution that each learning process only handle one SAR image, namely learning the static scattering information, while missing the space-varying information. It is obvious that multi-aspect joint recognition introduced space-varying scattering information should improve the classification accuracy and robustness. In this paper, a novel multi-aspect-aware method is proposed to achieve this idea through the bidirectional Long Short-Term Memory (LSTM) recurrent neural networks based space-varying scattering information learning. Specifically, we first select different aspect images to generate the multi-aspect space-varying image sequences. Then, the Gabor filter and three-patch local binary pattern (TPLBP) are progressively implemented to extract a comprehensive spatial features, followed by dimensionality reduction with the Multi-layer Perceptron (MLP) network. Finally, we design a bidirectional LSTM recurrent neural network to learn the multi-aspect features with further integrating the softmax classifier to achieve target recognition. Experimental results demonstrate that the proposed method can achieve 99.9% accuracy for 10-class recognition. Besides, its anti-noise and anti-confusion performance are also better than the conventional deep learning based methods.</li>
<li><a href="http://arxiv.org/abs/1708.01986">Identifying 3 moss species by deep learning, using the "chopped picture"   method</a>: In general, object identification tends not to work well on ambiguous, amorphous objects such as vegetation. In this study, we developed a simple but effective approach to identify ambiguous objects and applied the method to several moss species. As a result, the model correctly classified test images with accuracy more than 90%. Using this approach will help progress in computer vision studies.</li>
<li><a href="http://arxiv.org/abs/1707.09605">CNN-based Cascaded Multi-task Learning of High-level Prior and Density   Estimation for Crowd Counting</a>: Estimating crowd count in densely crowded scenes is an extremely challenging task due to non-uniform scale variations. In this paper, we propose a novel end-to-end cascaded network of CNNs to jointly learn crowd count classification and density map estimation. Classifying crowd count into various groups is tantamount to coarsely estimating the total count in the image thereby incorporating a high-level prior into the density estimation network. This enables the layers in the network to learn globally relevant discriminative features which aid in estimating highly refined density maps with lower count error. The joint training is performed in an end-to-end fashion. Extensive experiments on highly challenging publicly available datasets show that the proposed method achieves lower count error and better quality density maps as compared to the recent state-of-the-art methods.</li>
<li><a href="http://arxiv.org/abs/1707.06183">Domain-adversarial neural networks to address the appearance variability   of histopathology images</a>: Preparing and scanning histopathology slides consists of several steps, each with a multitude of parameters. The parameters can vary between pathology labs and within the same lab over time, resulting in significant variability of the tissue appearance that hampers the generalization of automatic image analysis methods. Typically, this is addressed with ad-hoc approaches such as staining normalization that aim to reduce the appearance variability. In this paper, we propose a systematic solution based on domain-adversarial neural networks. We hypothesize that removing the domain information from the model representation leads to better generalization. We tested our hypothesis for the problem of mitosis detection in breast cancer histopathology images and made a comparative analysis with two other approaches. We show that combining color augmentation with domain-adversarial training is a better alternative than standard approaches to improve the generalization of deep learning methods.</li>
<li><a href="http://arxiv.org/abs/1706.00066">Descriptions of Objectives and Processes of Mechanical Learning</a>: In [1], we introduced mechanical learning and proposed 2 approaches to mechanical learning. Here, we follow one such approach to well describe the objects and the processes of learning. We discuss 2 kinds of patterns: objective and subjective pattern. Subjective pattern is crucial for learning machine. We prove that for any objective pattern we can find a proper subjective pattern based upon least base patterns to express the objective pattern well. X-form is algebraic expression for subjective pattern. Collection of X-forms form internal representation space, which is center of learning machine. We discuss learning by teaching and without teaching. We define data sufficiency by X-form. We then discussed some learning strategies. We show, in each strategy, with sufficient data, and with certain capabilities, learning machine indeed can learn any pattern (universal learning machine). In appendix, with knowledge of learning machine, we try to view deep learning from a different angle, i.e. its internal representation space and its learning dynamics.</li>
<li><a href="http://arxiv.org/abs/1707.03553">Aerial Vehicle Tracking by Adaptive Fusion of Hyperspectral Likelihood   Maps</a>: Hyperspectral cameras can provide unique spectral signatures for consistently distinguishing materials that can be used to solve surveillance tasks. In this paper, we propose a novel real-time hyperspectral likelihood maps-aided tracking method (HLT) inspired by an adaptive hyperspectral sensor. A moving object tracking system generally consists of registration, object detection, and tracking modules. We focus on the target detection part and remove the necessity to build any offline classifiers and tune a large amount of hyperparameters, instead learning a generative target model in an online manner for hyperspectral channels ranging from visible to infrared wavelengths. The key idea is that, our adaptive fusion method can combine likelihood maps from multiple bands of hyperspectral imagery into one single more distinctive representation increasing the margin between mean value of foreground and background pixels in the fused map. Experimental results show that the HLT not only outperforms all established fusion methods but is on par with the current state-of-the-art hyperspectral target tracking frameworks.</li>
<li><a href="http://arxiv.org/abs/1705.09476">Learning Robust Features with Incremental Auto-Encoders</a>: Automatically learning features, especially robust features, has attracted much attention in the machine learning community. In this paper, we propose a new method to learn non-linear robust features by taking advantage of the data manifold structure. We first follow the commonly used trick of the trade, that is learning robust features with artificially corrupted data, which are training samples with manually injected noise. Following the idea of the auto-encoder, we first assume features should contain much information to well reconstruct the input from its corrupted copies. However, merely reconstructing clean input from its noisy copies could make data manifold in the feature space noisy. To address this problem, we propose a new method, called Incremental Auto-Encoders, to iteratively denoise the extracted features. We assume the noisy manifold structure is caused by a diffusion process. Consequently, we reverse this specific diffusion process to further contract this noisy manifold, which results in an incremental optimization of model parameters . Furthermore, we show these learned non-linear features can be stacked into a hierarchy of features. Experimental results on real-world datasets demonstrate the proposed method can achieve better classification performances.</li>
<li><a href="http://arxiv.org/abs/1709.04344">Flexible Network Binarization with Layer-wise Priority</a>: How to effectively approximate real-valued parameters with binary codes plays a central role in neural network binarization. In this work, we reveal an important fact that binarizing different layers has a widely-varied effect on the compression ratio of network and the loss of performance. Based on this fact, we propose a novel and flexible neural network binarization method by introducing the concept of layer-wise priority which binarizes parameters in inverse order of their layer depth. In each training step, our method selects a specific network layer, minimizes the discrepancy between the original real-valued weights and its binary approximations, and fine-tunes the whole network accordingly. During the iteration of the above process, it is significant that we can flexibly decide whether to binarize the remaining floating layers or not and explore a trade-off between the loss of performance and the compression ratio of model. The resulting binary network is applied for efficient pedestrian detection. Extensive experimental results on several benchmarks show that under the same compression ratio, our method achieves much lower miss rate and faster detection speed than the state-of-the-art neural network binarization method.</li>
</ul>
</body>
</html>