cwhy/nips2018.md

## nips2018.md

      
    Raw
  

              nips2018.md
            
          
    Unsupervisedly Learned Latent Graphs as Transferable Representations

Modern deep transfer learning approaches have mainly focused on learning
\emph{generic} feature vectors from one task that are transferable to other
tasks, such as word embeddings in language and pretrained convolutional
features in vision. However, these approaches usually transfer unary features
and largely ignore more structured graphical representations. This work
explores the possibility of learning generic latent graphs that capture
dependencies between pairs of data units (e.g., words or pixels) from large-
scale unlabeled data and transferring the graphs to downstream tasks. Our
proposed transfer learning framework improves performance on various tasks
including question answering, natural language inference, sentiment analysis,
and image classification. We also show that the learned graphs are generic
enough to be transferred to different types of input embeddings or features.
Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects

We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deep
generative model for image sequences. It can reliably discover and track
objects through the sequence; it can also conditionally generate future
frames, thereby simulating expected motion of objects. This is achieved by
explicitly encoding object numbers, locations and appearances in the latent
variables of the model. SQAIR retains all strengths of its predecessor,
Attend, Infer, Repeat (AIR, Eslami et. al. 2016), including unsupervised
learning, made possible by inductive biases present in the model structure. We
use a moving multi-\textsc{mnist} dataset to show limitations of AIR in
detecting overlapping or partially occluded objects, and show how
\textsc{sqair} overcomes them by leveraging temporal consistency of objects.
Finally, we also apply SQAIR to real-world pedestrian CCTV data, where it
learns to reliably detect, track and generate walking pedestrians with no
supervision.
A convex program for bilinear inversion of sparse vectors

We consider the bilinear inverse problem of recovering two vectors, x in R^L
and w in R^L, from their entrywise product. We consider the case where x and w
have known signs and are sparse with respect to known dictionaries of size K
and N, respectively. Here, K and N may be larger than, smaller than, or equal
to L. We introduce L1-BranchHull, which is a convex program posed in the
natural parameter space and does not require an approximate solution or
initialization in order to be stated or solved. We study the case where x and
w are S1- and S2-sparse with respect to a random dictionary and present a
recovery guarantee that only depends on the number of measurements as L >
Omega(S1+S2)(log(K+N))^2. Numerical experiments verify that the scaling
constant in the theorem is not too large. One application of this problem is
the sweep distortion removal task in dielectric imaging, where one of the
signals is a nonnegative reflectivity, and the other signal lives in a known
subspace, for example that given by dominant wavelet coefficients. We also
introduce a variants of L1-BranchHull for the purposes of tolerating noise and
outliers, and for the purpose of recovering piecewise constant signals. We
provide an ADMM implementation of these variants and show they can extract
piecewise constant behavior from real images.
Learning Conditioned Graph Structures for Interpretable Visual Question Answering

Visual Question answering is a challenging problem requiring a combination of
concepts from Computer Vision and Natural Language Processing. Most existing
approaches use a two streams strategy, computing image and question features
that are consequently merged using a variety of techniques. Nonetheless, very
few rely on higher level image representations, which allow to capture
semantic and spatial relationships. In this paper, we propose a novel graph-
based approach for Visual Question Answering. Our method combines a graph
learner module, which learns a question specific graph representation of the
input image, with the recent concept of graph convolutions, aiming to learn
image representations that capture question specific interactions. We test our
approach on the VQA v2 dataset using a simple baseline architecture enhanced
by the proposed graph learner module. We obtain state of the art results with
65.77% accuracy and demonstrate the interpretability of the proposed method.
The promises and pitfalls of Stochastic Gradient Langevin Dynamics

Stochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC
algorithm for Bayesian learning from large scale datasets. While SGLD with
decreasing step sizes converges weakly to the posterior distribution, the
algorithm is often used with a constant step size in practice and has
demonstrated spectacular successes in machine learning tasks. The current
practice is to set the step size inversely proportional to N where N is the
number of training samples. As N becomes large, we show that the SGLD
algorithm has an invariant probability measure which significantly departs
from the target posterior and behaves like as Stochastic Gradient Descent
(SGD). This difference is inherently due to the high variance of the
stochastic gradients. Several strategies have been suggested to reduce this
effect; among them, SGLD Fixed Point (SGLDFP) uses carefully designed control
variates to reduce the variance of the stochastic gradients. We show that
SGLDFP gives approximate samples from the posterior distribution, with an
accuracy comparable to the Langevin Monte Carlo (LMC) algorithm for a
computational cost sublinear in the number of data points. We provide a
detailed analysis of the Wasserstein distances between LMC, SGLD, SGLDFP and
SGD and explicit expressions of the means and covariance matrices of their
invariant distributions. Our findings are supported by limited numerical
experiments.
Monte-Carlo Tree Search for Constrained POMDPs

Monte-Carlo Tree Search (MCTS) has been successfully applied to very large
POMDPs, a standard model for stochastic sequential decision-making problems.
However, many real-world problems inherently have multiple goals, where multi-
objective formulations are more natural. The constrained POMDP (CPOMDP) is
such a model that maximizes the reward while constraining the cost, extending
the standard POMDP model. To date, solution methods for CPOMDPs assume an
explicit model of the environment, and thus are hardly applicable to large-
scale real-world problems. In this paper, we present CC-POMCP (Cost-
Constrained POMCP), an online MCTS algorithm for large CPOMDPs that leverages
the optimization of LP-induced parameters and only requires a black-box
simulator of the environment. In the experiments, we demonstrate that CC-POMCP
converges to the optimal stochastic action selection in CPOMDP and pushes the
state-of-the-art by being able to scale to very large problems.
Learning Loop Invariants for Program Verification

A fundamental problem in program verification concerns inferring loop
invariants. The problem is undecidable and even practical instances are
challenging. Inspired by how human experts construct loop invariants, we
propose a reasoning framework that constructs the solution by multi-step
decision making and querying an external program graph memory block. By
training with reinforcement learning, we are able to capture rich program
features and avoid the need for ground truth solutions as supervision.
Compared to previous learning tasks in domains with graph-structured data, we
address unique challenges, such as a binary objective function and an
extremely sparse reward that is given by an automated theorem prover only
after the complete loop invariant is proposed. We evaluate our approach on a
suite of 133 benchmark problems and compare it to three state-of-the-art
systems. It solves 97 problems compared to 73 by a stochastic search-based
system, 77 by a symbolic constraint solver, and 100 by a decision tree
learning-based system. Moreover, the strategy learned can be generalized to
new programs: compared to solving new instances from scratch, the pre-trained
agent is more sample efficient in finding solutions.
Adversarial Scene Editing: Automatic Object Removal from Weak Supervision

While great progress has been made recently in automatic image manipulation,
it has been limited to object centric images like faces or structured scene
datasets. In this work, we take a step towards general scene-level image
editing by developing an automatic interaction-free object removal model. Our
model learns to find and remove objects from general scene images using image-
level labels and unpaired data in a generative adversarial network (GAN)
framework. We achieve this with two key contributions: a two-stage editor
architecture consisting of a mask generator and image in-painter that co-
operate to remove objects, and a novel GAN based prior for the mask generator
that allows us to flexibly incorporate knowledge about object shapes. We
experimentally show on two datasets that our method effectively removes a wide
variety of objects using weak supervision only.
Plug-in Estimation in High-Dimensional Linear Inverse Problems: A Rigorous Analysis

Estimating a vector $\mathbf{x}$ from noisy linear measurements
$\mathbf{Ax+w}$ often requires use of prior knowledge or structural
constraints on $\mathbf{x}$ for accurate reconstruction. Several recent works
have considered combining linear least-squares estimation with a generic or
plug-in denoiser" function that can be designed in a modular manner based on the prior knowledge about $\mathbf{x}$. While these methods have shown excellent performance, it has been difficult to obtain rigorous performance guarantees. This work considers plug-in denoising combined with the recently- developed Vector Approximate Message Passing (VAMP) algorithm, which is itself derived via Expectation Propagation techniques. It shown that the mean squared error of this plug-in" VAMP can be exactly predicted for a large class of
high-dimensional random $\Abf$ and denoisers. The method is illustrated in
image reconstruction and parametric bilinear estimation.
GILBO: One Metric to Measure Them All

We propose a simple, tractable lower bound on the mutual information contained
in the joint generative density of any latent variable generative model: the
GILBO (Generative Information Lower BOund). It offers a data-independent
measure of the complexity of the learned latent variable description, giving
the log of the effective description length. It is well-defined for both VAEs
and GANs. We compute the GILBO for 800 GAN s and VAE s each trained on four
datasets (MNIST, FashionMNIST, CIFAR-10 and CelebA) and discuss the results.
Bayesian Adversarial Learning

Deep neural networks have been known to be vulnerable to adversarial attacks,
raising lots of security concerns in the practical deployment. Popular
defensive approaches can be formulated as a (distributionally) robust
optimization problem, which minimizes a ``point estimate'' of worst-case loss
derived from either per-datum perturbation or adversary data-generating
distribution within certain pre-defined constraints. This point estimate
ignores potential test adversaries that are beyond the pre-defined
constraints. The model robustness might deteriorate sharply in the scenario of
stronger test adversarial data. In this work, a novel robust training
framework is proposed to alleviate this issue, Bayesian Robust Learning, in
which a distribution is put on the adversarial data-generating distribution to
account for the uncertainty of the adversarial data-generating process. The
uncertainty directly helps to consider the potential adversaries that are
stronger than the point estimate in the cases of distributionally robust
optimization. The uncertainty of model parameters is also incorporated to
accommodate the full Bayesian framework. We design a scalable Markov Chain
Monte Carlo sampling strategy to obtain the posterior distribution over model
parameters. Various experiments are conducted to verify the superiority of BAL
over existing adversarial training methods.
Efficient Gradient Computation for Structured Output Learning with Rational and Tropical Losses

We present a general framework for designing efficient and scalable gradient
computations for structured output prediction problems. These gradients can be
used with algorithms such as back propagation for deep learning. While many
structured prediction problems admit a natural loss function for evaluation
such as the edit-distance or $n$-gram loss, existing learning algorithms are
typically designed to optimize alternative objectives such as the cross-
entropy. This is because a na"{i}ve implementation of the natural loss
functions often results in intractable gradient computations. In this paper,
we design efficient gradient computations for learning algorithms for two
broad families of structured prediction loss functions, rational and tropical
losses, which include the edit-distance and other string similarity measures
as special cases. Our methods cast the gradient computation as a shortest path
problem on a weighted finite state transducer and allow training machine
learning models (including neural networks) using complex structured losses
which were heretofore intractable. We report experimental results confirming
significant runtime improvement compared to direct methods.
Unsupervised Learning of Artistic Styles with Archetypal Style Analysis

In this paper, we introduce an unsupervised learning approach to automatically
dis- cover, summarize, and manipulate artistic styles from large collections
of paintings. Our method is based on archetypal analysis, which is an
unsupervised learning technique akin to sparse coding with a geometric
interpretation. When applied to deep image representations from a data
collection, it learns a dictionary of archetypal styles, which can be easily
visualized. After training the model, the style of a new image, which is
characterized by local statistics of deep visual features, is approximated by
a sparse convex combination of archetypes. This allows us to interpret which
archetypal styles are present in the input image, and in which proportion.
Finally, our approach allows us to manipulate the coefficients of the latent
archetypal decomposition, and achieve various special effects such as style
enhancement, transfer, and interpolation between multiple archetypes.
Neural Ordinary Differential Equations

We introduce a new family of deep neural network models. Instead of specifying
a discrete sequence of hidden layers, we parameterize the derivative of the
hidden state using a neural network. The output of the network is computed
using a blackbox differential equation solver. These continuous-depth models
have constant memory cost, adapt their evaluation strategy to each input, and
can explicitly trade numerical precision for speed. We demonstrate these
properties in continuous-depth residual networks and continuous-time latent
variable models. We also construct continuous normalizing flows, a generative
model that can train by maximum likelihood, without partitioning or ordering
the data dimensions. For training, we show how to scalably backpropagate
through any ODE solver, without access to its internal operations. This allows
end-to-end training of ODEs within larger models.
SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

We focus on the task of uncertainty estimation in deep learning models, where
it is computationally challenging to even form a Gaussian approximation to the
posterior. Because of this, many existing Gaussian approximations only use a
diagonal covariance matrix even though such matrices are known to give poor
uncertainty estimates. We propose a new stochastic, low-rank, approximate
natural-gradient (SLANG) method for variational inference, which allows us to
efficiently fit a non-diagonal approximation. The method estimates a
``diagonal plus low-rank'' structure based solely on back-propagated gradients
of the network log-likelihood. This requires strictly less gradient
computation than methods that require the gradient of the whole variational
objective. Empirical evaluations on standard benchmarks confirm that SLANG
obtains reasonable posterior approximations which give comparable accuracy to
the state-of-the-art.
The Convergence of Sparsified Gradient Methods

Distributed training of massive machine learning models, in particular deep
neural networks, via Stochastic Gradient Descent (SGD) is becoming
commonplace. Several families of communication-reduction methods, such as
quantization, large-batch methods, and gradient sparsification, have been
proposed. To date, gradient sparsification methods--where each node sorts
gradients by magnitude, and only communicates a subset of the components,
accumulating the rest locally--are known to yield some of the largest
practical gains. Such methods can reduce the amount of communication per step
by up to \emph{three orders of magnitude}, while preserving model accuracy.
Yet, this family of methods currently has no theoretical justification. This
is the question we address in this paper. We prove that, under analytic
assumptions, sparsifying gradients by magnitude with local error correction
provides convergence guarantees, for both convex and non-convex smooth
objectives, for data-parallel SGD. The main insight is that sparsification
methods implicitly maintain bounds on the maximum impact of stale updates,
thanks to selection by magnitude. Our analysis and empirical validation also
reveal that these methods do require analytical conditions to converge well,
justifying existing heuristics.
Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance

Applications of optimal transport have recently gained remarkable attention as
a result of the computational advantages of entropic regularization. However,
in most situations the Sinkhorn approximation to the Wasserstein distance is
replaced by a regularized version that is less accurate but easy to
differentiate. In this work we characterize the differential properties of the
original Sinkhorn distance, proving that it enjoys the same smoothness of its
regularized version and we explicitly provide an efficient algorithm to
compute its gradient. We show that this result benefits both theory and
applications: on one hand, high order smoothness confers statistical
guarantees to learning with Wasserstein approximations. On the other hand, the
gradient formula allows to efficiently solve learning and optimization
problems in practice. Promising preliminary experiments complement our
analysis.
Scalable Robust Matrix Factorization with Nonconvex Loss

Robust matrix factorization (RMF), which uses the $\ell_1$-loss, often
outperforms standard matrix factorization using the $\ell_2$-loss,
particularly when outliers are present. The state-of-the-art RMF solver is the
RMF-MM algorithm, which, however, cannot utilize data sparsity. Moreover,
sometimes even the (convex) $\ell_1$-loss is not robust enough. In this paper,
we propose the use of nonconvex loss to enhance robustness. To address the
resultant difficult optimization problem, we use majorization-minimization
(MM) optimization and propose a new MM surrogate. To improve scalability, we
exploit data sparsity and optimize the surrogate via its dual with the
accelerated proximal gradient algorithm. The resultant algorithm has low time
and space complexities and is guaranteed to converge to a critical point.
Extensive experiments demonstrate its superiority over the state-of-the-art in
terms of both accuracy and scalability.
LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

This paper presents a new class of gradient methods for distributed machine
learning that adaptively skip the gradient calculations to learn with reduced
communication and computation. Simple rules are designed to detect slowly-
varying gradients and, therefore, trigger the reuse of outdated gradients. The
resultant gradient-based algorithms are termed Lazily Aggregated Gradient ---
justifying our acronym \textbf{LAG} used henceforth. Theoretically, the merits
of this contribution are: i) the convergence rate is the same as batch
gradient descent in strongly-convex, convex, and nonconvex cases; and, ii) if
the distributed datasets are heterogeneous (quantified by certain measurable
constants), the communication rounds needed to achieve a targeted accuracy are
reduced thanks to the adaptive reuse of \emph{lagged} gradients. Numerical
experiments on both synthetic and real data corroborate a significant
communication reduction compared to alternatives.
Lifelong Inverse Reinforcement Learning

Methods for learning from demonstration (LfD) have shown success in acquiring
behavior policies by imitating a user. However, even for a single task, LfD
may require numerous demonstrations. For versatile agents that must learn many
tasks via demonstration, this process would substantially burden the user if
each task were learned in isolation. To address this challenge, we introduce
the novel problem of lifelong learning from demonstration, which allows the
agent to continually build upon knowledge learned from previously demonstrated
tasks to accelerate the learning of new tasks, reducing the amount of
demonstrations required. As one solution to this problem, we propose the first
lifelong learning approach to inverse reinforcement learning, which learns
consecutive tasks via demonstration, continually transferring knowledge
between tasks to improve performance.
Amortized Inference Regularization

The variational autoencoder (VAE) is a popular model for density estimation
and representation learning. Canonically, the variational principle suggests
to prefer an expressive inference model so that the variational approximation
is accurate. However, it is often overlooked that an overly-expressive
inference model can be detrimental to the test set performance of both the
amortized posterior approximator and, more importantly, the generative density
estimator. In this paper, we leverage the fact that VAEs rely on amortized
inference and propose techniques for amortized inference regularization (AIR)
that control the smoothness of the inference model. We demonstrate that, by
applying AIR, it is possible to improve VAE generalization on both inference
and generative performance. Our paper challenges the belief that amortized
inference is simply a mechanism for approximating maximum likelihood training
and illustrates that regularization of the amortization family provides a new
direction for understanding and improving generalization in VAEs.
Flexible and accurate inference and learning for deep generative models

We introduce new approach to learning in hierarchical latent-variable
generative models called the “distributed distributional code Helmholtz
machine”, which emphasises flexibility and accuracy in the inferential
process. In common with the original Helmholtz machine and later variational
autoencoder algorithms (but unlike adverserial methods) our approach learns an
explicit inference or “recognition” model to approximate the posterior
distribution over the latent variables. Unlike in these earlier methods, the
posterior representation is not limited to a narrow tractable parameterised
form (nor is it represented by samples). To train the generative and
recognition models we develop an extended wake-sleep algorithm inspired by the
original Helmholtz Machine. This makes it possible to learn hierarchical
latent models with both discrete and continuous variables, where an accurate
posterior representation is essential. We demonstrate that the new algorithm
outperforms current state-of-the-art methods on synthetic, natural image patch
and the MNIST data sets.
Direct Runge-Kutta Discretization Achieves Acceleration

We study gradient-based optimization methods obtained by directly discretizing
a second-order ordinary differential equation (ODE) related to the continuous
limit of Nesterov's accelerated gradient method. When the function is smooth
enough, we show that acceleration can be achieved by a stable discretization
of this ODE using standard Runge-Kutta integrators. Specifically, we prove
that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability
assumptions, the sequence of iterates generated by discretizing the proposed
second-order ODE converges to the optimal solution at a rate of
$\mathcal{O}({N^{-2\frac{s}{s+1}}})$, where $s$ is the order of the Runge-
Kutta numerical integrator. Furthermore, we introduce a new local flatness
condition on the objective, under which rates even faster than
$\mathcal{O}(N^{-2})$ can be achieved with low-order integrators and only
gradient information. Notably, this flatness condition is satisfied by several
standard loss functions used in machine learning. We provide numerical
experiments that verify the theoretical rates predicted by our results.
Lipschitz regularity of deep neural networks: analysis and efficient estimation

Deep neural networks are notorious for being sensitive to small well-chosen
perturbations, and estimating the regularity of such architectures is of
utmost importance for safe and robust practical applications. In this paper,
we investigate one of the key characteristics to assess the regularity of such
methods: the Lipschitz constant of deep learning architectures. First, we show
that, even for two layer neural networks, the exact computation of this
quantity is NP-hard and state-of-art methods may significantly overestimate
it. Then, we both extend and improve previous estimation methods by providing
AutoLip, the first generic algorithm for upper bounding the Lipschitz constant
of any automatically differentiable function. We provide a power method
algorithm working with automatic differentiation, allowing efficient
computations even on large convolutions. Second, for sequential neural
networks, we propose an improved algorithm named SeqLip that takes advantage
of the linear computation graph to split the computation per pair of
consecutive layers. Third we propose heuristics on SeqLip in order to tackle
very large networks. Our experiments show that SeqLip can significantly
improve on the existing upper bounds. Finally, we provide an implementation of
AutoLip in the PyTorch environment that may be used to better estimate the
robustness of a given neural network to small perturbations or regularize it
using more precise Lipschitz estimations. These results also hint at the
difficulty to estimate the Lipschitz constant of deep networks.
A Bandit Approach to Sequential Experimental Design with False Discovery Control

We propose a new adaptive sampling approach to multiple testing which aims to
maximize statistical power while ensuring anytime false discovery control. We
consider $n$ distributions whose means are partitioned by whether they are
below or equal to a baseline (nulls), versus above the baseline (true
positives). In addition, each distribution can be sequentially and repeatedly
sampled. Using techniques from multi-armed bandits, we provide an algorithm
that takes as few samples as possible to exceed a target true positive
proportion (i.e. proportion of true positives discovered) while giving anytime
control of the false discovery proportion (nulls predicted as true positives).
Our sample complexity results match known information theoretic lower bounds
and through simulations we show a substantial performance improvement over
uniform sampling and an adaptive elimination style algorithm. Given the
simplicity of the approach, and its sample efficiency, the method has promise
for wide adoption in the biological sciences, clinical testing for drug
discovery, and maximization of click through in A/B/n testing problems.
Recurrent Relational Networks

This paper is concerned with learning to solve tasks that require a chain of
interde- pendent steps of relational inference, like answering complex
questions about the relationships between objects, or solving puzzles where
the smaller elements of a solution mutually constrain each other. We introduce
the recurrent relational net- work, a general purpose module that operates on
a graph representation of objects. As a generalization of Santoro et al.
[2017]’s relational network, it can augment any neural network model with the
capacity to do many-step relational reasoning. We achieve state of the art
results on the bAbI textual question-answering dataset with the recurrent
relational network, consistently solving 20/20 tasks. As bAbI is not
particularly challenging from a relational reasoning point of view, we
introduce Pretty-CLEVR, a new diagnostic dataset for relational reasoning. In
the Pretty- CLEVR set-up, we can vary the question to control for the number
of relational reasoning steps that are required to obtain the answer. Using
Pretty-CLEVR, we probe the limitations of multi-layer perceptrons, relational
and recurrent relational networks. Finally, we show how recurrent relational
networks can learn to solve Sudoku puzzles from supervised training data, a
challenging task requiring upwards of 64 steps of relational reasoning. We
achieve state-of-the-art results amongst comparable methods by solving 96.6%
of the hardest Sudoku puzzles.
Contextual Combinatorial Multi-armed Bandits with Volatile Arms and Submodular Reward

In this paper, we study the stochastic contextual combinatorial multi-armed
bandit (CC-MAB) framework that is tailored for volatile arms and submodular
reward functions. CC-MAB inherits properties from both contextual bandit and
combinatorial bandit: it aims to select a set of arms in each round based on
the side information (a.k.a. context) associated with the arms. By volatile arms'', we mean that the available arms to select from in each round may change; and by submodular rewards'', we mean that the total reward achieved
by selected arms is not a simple sum of individual rewards but demonstrates a
feature of diminishing returns determined by the relations between selected
arms (e.g. relevance and redundancy). Volatile arms and submodular rewards are
often seen in many real-world applications, e.g. recommender systems and
crowdsourcing, in which multi-armed bandit (MAB) based strategies are
extensively applied. Although there exist works that investigate these issues
separately based on standard MAB, jointly considering all these issues in a
single MAB problem requires very different algorithm design and regret
analysis. Our algorithm CC-MAB provides an online decision-making policy in a
contextual and combinatorial bandit setting and effectively addresses the
issues raised by volatile arms and submodular reward functions. The proposed
algorithm is proved to achieve $O(cT^{\frac{2\alpha+D}{3\alpha + D}}\log(T))$
regret after a span of $T$ rounds. The performance of CC-MAB is evaluated by
experiments conducted on a real-world crowdsourcing dataset, and the result
shows that our algorithm outperforms the prior art.
Moonshine: Distilling with Cheap Convolutions

Many engineers wish to deploy modern neural networks in memory-limited
settings; but the development of flexible methods for reducing memory use is
in its infancy, and there is little knowledge of the resulting cost-benefit.
We propose structural model distillation for memory reduction using a strategy
that produces a student architecture that is a simple transformation of the
teacher architecture: no redesign is needed, and the same hyperparameters can
be used. Using attention transfer, we provide Pareto curves/tables for
distillation of residual networks with four benchmark datasets, indicating the
memory versus accuracy payoff. We show that substantial memory savings are
possible with very little loss of accuracy, and confirm that distillation
provides student network performance that is better than training that student
architecture directly on data.
Query Complexity of Bayesian Private Learning

We study the query complexity of Bayesian Private Learning: a learner wishes
to locate a random target within an interval by submitting queries, in the
presence of an adversary who observes all of her queries but not the
responses. How many queries are necessary and sufficient in order for the
learner to accurately estimate the target, while simultaneously concealing the
target from the adversary? Our main result is a query complexity lower bound
that is tight up to the first order. We show that if the learner wants to
estimate the target within an error of $\epsilon$, while ensuring that no
adversary estimator can achieve a constant additive error with probability
greater than $1/L$, then the query complexity is on the order of
$L\log(1/\epsilon)$ as $\epsilon \to 0$. Our result demonstrates that
increased privacy, as captured by $L$, comes at the expense of a
\emph{multiplicative} increase in query complexity. The proof builds on Fano's
inequality and properties of certain proportional-sampling estimators.
Smoothed analysis of the low-rank approach for smooth semidefinite programs

We consider semidefinite programs (SDPs) of size $n$ with equality
constraints. In order to overcome the scalability issues arising for large
instances, Burer and Monteiro proposed a factorized approach based on
optimizing over a matrix $Y$ of size $n\times k$ such that $X=YY^$ is the SDP
variable. The advantages of such formulation are twofold: the dimension of the
optimization variable is reduced and positive semidefiniteness is naturally
enforced. However, problem in $Y$ is non-convex. In prior work, it has been
shown that, when the constraints on the factorized variable regularly define a
smooth manifold, almost all second-order stationary points (SOSPs) are
optimal. Nevertheless, in practice, one can only compute points which
approximately satisfy necessary optimality conditions, so that it is crucial
to know whether such points are also approximately optimal. To this end, and
under similar assumptions, we use smoothed analysis to show that ASOSPs for a
randomly perturbed objective function are approximate global optima, as long
as the number of constraints scales sub-quadratically with the desired rank of
the optimal solution. In this setting, an approximate optimum $Y$ maps to the
approximate optimum $X=YY^$ of the SDP. We particularize our results to SDP
relaxations of phase retrieval.
The Description Length of Deep Learning models

Deep learning models often have more parameters than observations, and still
perform well. This is sometimes described as a paradox. In this work, we show
experimentally that despite their huge number of parameters, deep neural
networks can compress the data losslessly \emph{even when taking the cost of
encoding the parameters into account}. Such a compression viewpoint originally
motivated the use of \emph{variational methods} in neural networks
\cite{Hinton,Schmidhuber1997}. However, we show that these variational methods
provide surprisingly poor compression bounds, despite being explicitly built
to minimize such bounds. This might explain the relatively poor practical
performance of variational methods in deep learning. Better encoding methods,
imported from the Minimum Description Length (MDL) toolbox, yield much better
compression values on deep networks.
Pelee: A Real-Time Object Detection System on Mobile Devices

An increasing need of running Convolutional Neural Network (CNN) models on
mobile devices with limited computing power and memory resource encourages
studies on efficient model design. A number of efficient architectures have
been proposed in recent years, for example, MobileNet, ShuffleNet, and
NASNet-A. However, all these models are heavily dependent on depthwise
separable convolution which lacks efficient implementation in most deep
learning frameworks. In this study, we propose an efficient architecture named
PeleeNet, which is built with conventional convolution instead. On ImageNet
ILSVRC 2012 dataset, our proposed PeleeNet achieves a higher accuracy by 1.8%
(72.4% vs. 70.6%) and 23% faster speed than MobileNet, the state-of-the-art
efficient architecture. Meanwhile, PeleeNet is only 66% of the model size of
MobileNet. We then propose a real-time object detection system by combining
PeleeNet with Single Shot MultiBox Detector (SSD) method and optimizing the
architecture for fast speed. Our proposed detection system, named Pelee,
achieves 76.4% mAP (mean average precision) on PASCAL VOC2007 and 22.4 mAP on
MS COCO dataset at the speed of 23.6 FPS on iPhone 8 and 74 FPS on NVIDIA TX2.
The result on COCO outperforms YOLOv2 in consideration of a higher precision,
13.6 times lower computational cost and 11.3 times smaller model size. The
code and models are open sourced.
Learning to Exploit Stability for 3D Scene Parsing

Human scene understanding uses a variety of visual and non-visual cues to
perform inference on object types, poses, and relations. Physics is a rich and
universal cue which we exploit to enhance scene understanding. We integrate
the physical cue of stability into the learning process using a REINFORCE
approach coupled to a physics engine, and apply this to the problem of
producing the 3D bounding boxes and poses of objects in a scene. We first show
that applying physics supervision to an existing scene understanding model
increases performance, produces more stable predictions, and allows training
to an equivalent performance level with fewer annotated training examples. We
then present a novel architecture for 3D scene parsing named Prim R-CNN. With
physics supervision, Prim R-CNN outperforms existing scene understanding
approaches on this problem. Finally, we show that applying physics supervision
on unlabeled real images improves real domain transfer of models training on
synthetic data.
Neighbourhood Consensus Networks

We address the problem of finding reliable dense correspondences between a
pair of images. This is a challenging task due to strong appearance
differences between the corresponding scene elements and ambiguities generated
by repetitive patterns. The contributions of this work are threefold. First,
inspired by the classic idea of disambiguating feature matches using semi-
local constraints, we develop an end-to-end trainable convolutional neural
network architecture that identifies sets of spatially consistent matches by
analyzing neighbourhood consensus patterns in the 4D space of all possible
correspondences between a pair of images without the need for a global
geometric model. Second, we demonstrate that the model can be trained
effectively from weak supervision in the form of matching and non-matching
image pairs without the need for costly manual annotation of point to point
correspondences. Third, we show the proposed neighbourhood consensus network
can be applied to a range of matching tasks including both category- and
instance-level matching, obtaining state-of-the-art results on the PF Pascal
dataset and the InLoc indoor visual localization benchmark.
Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior

Inferring intent from observed behavior has been studied extensively within
the frameworks of Bayesian inverse planning and inverse reinforcement
learning. These methods infer a goal or reward function that best explains the
actions of the observed agent, typically a human demonstrator. Another agent
can use this inferred intent to predict, imitate, or assist the human user.
However, a central assumption in inverse reinforcement learning is that the
demonstrator is close to optimal. While models of suboptimal behavior exist,
they typically assume that suboptimal actions are the result of some type of
random noise or a known cognitive bias, like temporal inconsistency. In this
paper, we take an alternative approach, and model suboptimal behavior as the
result of internal model misspecification: the reason that user actions might
deviate from near-optimal actions is that the user has an incorrect set of
beliefs about the rules -- the dynamics -- governing how actions affect the
environment. Our insight is that while demonstrated actions may be suboptimal
in the real world, they may actually be near-optimal with respect to the
user's internal model of the dynamics. By estimating these internal beliefs
from observed behavior, we arrive at a new method for inferring intent. We
demonstrate in simulation and in a user study with 12 participants that this
approach enables us to more accurately model human intent, and can be used in
a variety of applications, including offering assistance in a shared autonomy
framework and inferring human preferences.
New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity

As an incremental-gradient algorithm, the hybrid stochastic gradient descent
(HSGD) enjoys merits of both stochastic and full gradient methods for finite-
sum minimization problem. However, the existing rate-of-convergence analysis
for HSGD is made under with-replacement sampling (WRS) and is restricted to
convex problems. It is not clear whether HSGD still carries these advantages
under the common practice of without-replacement sampling (WoRS) for non-
convex problems. In this paper, we affirmatively answer this open question by
showing that under WoRS and for both convex and non-convex problems, it is
still possible for HSGD (with constant step-size) to match full gradient
descent in rate of convergence, while maintaining comparable sample-size-
independent incremental first-order oracle complexity to stochastic gradient
descent. For a special class of finite-sum problems with linear prediction
models, our convergence results can be further improved in some cases.
Extensive numerical results confirm our theoretical affirmation and
demonstrate the favorable efficiency of WoRS-based HSGD.
Video-to-Video Synthesis

We study the problem of video-to-video synthesis, whose goal is to learn a
mapping function from an input source video (e.g., a sequence of semantic
segmentation masks) to an output photorealistic video that precisely depicts
the content of the source video. While its image counterpart, the image-to-
image synthesis problem, is a popular topic, the video-to-video synthesis
problem is less explored in the literature. Without understanding temporal
dynamics, directly applying existing image synthesis approaches to an input
video often results in temporally incoherent videos of low visual quality. In
this paper, we propose a novel approach for the video-to-video synthesis
problem under adversarial learning framework. Through the introduction of new
generator and discriminator architectures, coupled with a spatial-temporal
adversarial objective, we achieve high-resolution, photorealistic, temporally
coherent video results. Experiments on multiple benchmarks show the advantage
of our method compared to strong baselines. In particular, our model is
capable of synthesizing 2K resolution videos of street scenes up to 30 seconds
long, not possible before our work. Finally, we apply our approach to future
video prediction, outperforming several state-of-the-art competing systems.
(Note: using Adobe Reader is highly recommended to view the paper.)
Frequency-Domain Dynamic Pruning for Convolutional Neural Networks

Deep convolutional neural networks have demonstrated their powerfulness in a
variety of applications. However, the storage and computational requirements
have largely restricted their further extensions on mobile devices. Recently,
pruning of unimportant parameters has been used for both network compression
and acceleration. Considering that there are spatial redundancy within most
filters in a CNN, we propose a frequency-domain dynamic pruning scheme to
exploit the spatial correlations. The frequency-domain coefficients are pruned
dynamically in each iteration and different frequency bands are pruned
discriminatively, given their different importance on accuracy. Experimental
results demonstrate that the proposed scheme can outperform previous spatial-
domain counterparts by a large margin. Specifically, it can achieve a
compression ratio of 8x and an inference speed-up of 8.9x for ResNet-110,
while the accuracy is even better than the reference model on CIFAR-10.
Combinatorial Optimization with Graph Convolutional Networks and Guided Tree Search

We present a learning-based approach to computing solutions for certain NP-
hard problems. Our approach combines deep learning techniques with useful
algorithmic elements from classic heuristics. The central component is a graph
convolutional network that is trained to estimate the likelihood, for each
vertex in a graph, of whether this vertex is part of the optimal solution. The
network is designed and trained to synthesize a diverse set of solutions,
which enables rapid exploration of the solution space via tree search. The
presented approach is evaluated on four canonical NP-hard problems and five
datasets, which include benchmark satisfiability problems and real social
network graphs with up to a hundred thousand nodes. Experimental results
demonstrate that the presented approach substantially outperforms recent work,
generalizes across datasets, and scales to graphs that are orders of magnitude
larger than those used during training.
Kalman Normalization

As an indispensable component, Batch Normalization (BN) has successfully
improved the training of deep neural networks (DNNs) with mini-batches, by
normalizing the distribution of the internal representation for each hidden
layer. However, the effectiveness of BN would diminish with the scenario of
micro-batch (\eg~less than 4 samples in a mini-batch), since the estimated
statistics in a mini-batch are not reliable with insufficient samples. This
limits BN's room in training larger models on segmentation, detection, and
video-related problems, which require small batches constrained by memory
consumption. In this paper, we present a novel normalization method, called
Kalman Normalization (KN), for improving and accelerating the training of
DNNs, particularly under the context of micro-batches. Specifically, unlike
the existing solutions treating each hidden layer as an isolated system, KN
treats all the layers in a network as a whole system, and estimates the
statistics of a certain layer by considering the distributions of all its
preceding layers, mimicking the merits of Kalman Filtering. On ResNet50
trained in ImageNet, KN has 3.4% lower error than its BN counterpart when
using a batch size of 4; Even when using typical batch sizes, KN still
maintains an advantage over BN while other BN variants suffers a performance
degradation. Moreover, KN can be naturally generalized to many existing
normalization variants to obtain gains, \eg equipping Group Normalization
\cite{wu2018group} with Group Kalman Normalization (GKN). KN can outperform BN
and its variants for large scale object detection and segmentation task in
COCO 2017.
Efficient Algorithms for Non-convex Isotonic Regression through Submodular Optimization

We consider the minimization of submodular functions subject to ordering
constraints. We show that this potentially non-convex optimization problem can
be cast as a convex optimization problem on a space of uni-dimensional
measures, with ordering constraints corresponding to first-order stochastic
dominance. We propose new discretization schemes that lead to simple and
efficient algorithms based on zero-th, first, or higher order oracles; these
algorithms also lead to improvements without isotonic constraints. Finally,
our experiments show that non-convex loss functions can be much more robust to
outliers for isotonic regression, while still being solvable in polynomial
time.
Structure-Aware Convolutional Neural Networks

Convolutional neural networks (CNNs) are inherently subject to invariable
filters that can only aggregate local inputs with the same topological
structures. It causes that CNNs are allowed to manage data with Euclidean or
grid-like structures (e.g., images), not ones with non-Euclidean or graph
structures (e.g., traffic networks). To broaden the reach of CNNs, we develop
structure-aware convolution to eliminate the invariance, yielding a unified
mechanism of dealing with both Euclidean and non-Euclidean structured data.
Technically, filters in the structure-aware convolution are generalized to
univariate functions, which are capable of aggregating local inputs with
diverse topological structures. Since infinite parameters are required to
determine a univariate function, we parameterize these filters with numbered
learnable parameters in the context of the function approximation theory. By
replacing the classical convolution in CNNs with the structure-aware
convolution, Structure-Aware Convolutional Neural Networks (SACNNs) are
readily established. Extensive experiments on eleven datasets strongly
evidence that SACNNs outperform current models on various machine learning
tasks, including image classification and clustering, text categorization,
skeleton-based action recognition, molecular activity detection, and taxi flow
prediction. Code will be available.
HOGWILD!-Gibbs can be PanAccurate

Asynchronous Gibbs sampling has been recently shown to be fast-mixing and an
accurate method for estimating probabilities of events on a small number of
variables of a graphical model satisfying Dobrushin's
condition~\cite{DeSaOR16}. We investigate whether it can be used to accurately
estimate expectations of functions of {\em all the variables} of the model.
Under the same condition, we show that the synchronous and asynchronous Gibbs
samplers can be coupled so that the expected Hamming distance between their
(multivariate) samples remains bounded by $O(\tau \log n),$ where $n$ is the
number of variables in the graphical model, and $\tau$ is a measure of the
asynchronicity. A similar bound holds for any constant power of the Hamming
distance. Hence, the expectation of any function that is Lipschitz with
respect to a power of the Hamming distance, can be estimated with a bias that
grows logarithmically in $n$. Going beyond Lipschitz functions, we consider
the bias arising from asynchronicity in estimating the expectation of
polynomial functions of all variables in the model. Using recent concentration
of measure results~\cite{DaskalakisDK17,GheissariLP17,GotzeSS18}, we show that
the bias introduced by the asynchronicity is of smaller order than the
standard deviation of the function value already present in the true model. We
perform experiments on a multi-processor machine to empirically illustrate our
theoretical findings.
Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language

This paper addresses the problem of manipulating images using natural language
description. Our task aims to semantically modify visual attributes of an
object in an image according to the text describing the new visual appearance.
Although existing methods synthesize images having new attributes, they do not
fully preserve text-irrelevant contents of the original image. In this paper,
we propose the text-adaptive generative adversarial network (TAGAN) to
generate semantically manipulated images while preserving text-irrelevant
contents. The key to our method is the text-adaptive discriminator that
creates word level local discriminators according to input text to classify
fine-grained attributes independently. With this discriminator, the generator
learns to generate images where only regions that correspond to the given text
is modified. Experimental results show that our method outperforms existing
methods on CUB and Oxford-102 datasets, and our results were mostly preferred
on a user study. Extensive analysis shows that our method is able to
effectively disentangle visual attributes and produce pleasing outputs.
IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis

We present a novel introspective variational autoencoder (IntroVAE) model for
synthesizing high-resolution photographic images. IntroVAE is capable of self-
evaluating the quality of its generated samples and improving itself
accordingly. Its inference and generator models are jointly trained in an
introspective way. On one hand, the generator is required to reconstruct the
input images from the noisy outputs of the inference model as normal VAEs. On
the other hand, the inference model is encouraged to classify between the
generated and real samples while the generator tries to fool it as GANs. These
two famous generative frameworks are integrated in a simple yet efficient
single-stream architecture that can be trained in a single stage. IntroVAE
preserves the advantages of VAEs, such as stable training and nice latent
manifold. Unlike most other hybrid models of VAEs and GANs, IntroVAE requires
no extra discriminators, because the inference model itself serves as a
discriminator to distinguish between the generated and real samples.
Experiments demonstrate that our method produces high-resolution photo-
realistic images (e.g., CELEBA images at (1024^{2})), which are comparable to
or better than the state-of-the-art GANs.
Doubly Robust Bayesian Inference for Non-Stationary Streaming Data with $\beta$-Divergences

We present the very first robust Bayesian Online Changepoint Detection
algorithm through General Bayesian Inference (GBI) with $\beta$-divergences.
The resulting inference procedure is doubly robust for both the predictive and
the changepoint (CP) posterior, with linear time and constant space
complexity. We provide a construction for exponential models and demonstrate
it on the Bayesian Linear Regression model. In so doing, we make two
additional contributions: Firstly, we make GBI scalable using Structural
Variational approximations that are exact as $\beta \to 0$. Secondly, we give
a principled way of choosing the divergence parameter $\beta$ by minimizing
expected predictive loss on-line. We offer the state of the art and improve
the False Discovery Rate of CPs by more than 80% on real world data.
Adapted Deep Embeddings: A Synthesis of Methods for k-Shot Inductive Transfer Learning

The focus in machine learning has branched beyond training classifiers on a
single task to investigating how previously acquired knowledge in a source
domain can be leveraged to facilitate learning in a related target domain,
known as inductive transfer learning. Three active lines of research have
independently explored transfer learning using neural networks. In weight
transfer, a model trained on the source domain is used as an initialization
point for a network to be trained on the target domain. In deep metric
learning, the source domain is used to construct an embedding that captures
class structure in both the source and target domains. In few-shot learning,
the focus is on generalizing well in the target domain based on a limited
number of labeled examples. We compare state-of-the-art methods from these
three paradigms and also explore hybrid adapted-embedding methods that use
limited target-domain data to fine tune embeddings constructed from source-
domain data. We conduct a systematic comparison of methods in a variety of
domains, varying the number of labeled instances available in the target
domain (k), as well as the number of target-domain classes. The following are
the major results: (1) Deep embeddings are far superior, compared to weight
transfer, as a starting point for inter-domain transfer or model re-use (2)
Our hybrid methods robustly outperform every few-shot learning and every deep
metric learning method previously proposed, with a mean error reduction of 30%
over state-of-the-art. (3) Among loss functions for discovering embeddings,
the histogram loss (Ustinova & Lempitsky, 2016) is most robust. We hope our
results will motivate a unification of research in weight transfer, deep
metric learning, and few-shot learning.
Generalized Inverse Optimization through Online Learning

Inverse optimization is a powerful paradigm for learning preferences and
restrictions that explain the behavior of a decision maker, based on a set of
external signal and the corresponding decision pairs. However, most inverse
optimization algorithms are designed specially in batch setting, where all the
data is available in advance. As a consequence, there has been rarely use of
these methods in an online setting suitable for real-time applications. In
this paper, we propose a general framework for inverse optimization through
online learning. Specifically, we develop an online learning algorithm that
uses an implicit update rule which can handle noisy data. Moreover, under
additional regularity assumptions in terms of the data and the model, we prove
that our algorithm converges at a rate of $\mathcal{O}(1/\sqrt{T})$ and is
statistically consistent. In our experiments, we show the online learning
approach can learn the parameters with great accuracy and is very robust to
noises, and achieves a drastic improvement in computational efficacy over the
batch learning approach.
An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Policy gradient methods are widely used for control in reinforcement learning,
particularly for the continuous action setting. There have been a host of
theoretically sound algorithms proposed for the on-policy setting, due to the
existence of the policy gradient theorem which provides a simplified form for
the gradient. In off-policy learning, however, where the behaviour policy is
not necessarily attempting to learn and follow the optimal policy for the
given task, the existence of such a theorem has been elusive. In this work, we
solve this open problem by providing the first off-policy policy gradient
theorem. The key to the derivation is the use of emphatic weightings. We
develop a new actor-critic algorithm---called Actor Critic with Emphatic
weightings (ACE)---that approximates the simplified gradients provided by the
theorem. We demonstrate in a simple counterexample that previous off-policy
policy gradient methods---particularly OffPAC and DPG---converge to the wrong
solution whereas ACE finds the optimal solution.
Supervised autoencoders: Improving generalization performance with unsupervised regularizers

Generalization performance is a central goal in machine learning, particularly
when learning representations with large neural networks. A common strategy to
improve generalization has been through the use of regularizers, typically as
a norm constraining the parameters. Regularizing hidden layers in a neural
network architecture, however, is not straightforward. There have been a few
effective layer-wise suggestions, but without theoretical guarantees for
improved performance. In this work, we theoretically and empirically analyze
one such model, called a supervised auto-encoder: a neural network that
predicts both inputs (reconstruction error) and targets jointly. We provide a
novel generalization result for linear auto-encoders, proving uniform
stability based on the inclusion of the reconstruction error---particularly as
an improvement on simplistic regularization such as norms or even on more
advanced regularizations such as the use of auxiliary tasks. Empirically, we
then demonstrate that, across an array of architectures with a different
number of hidden units and activation functions, the supervised auto-encoder
compared to the corresponding standard neural network never harms performance
and can significantly improve generalization.
Visual Object Networks: Image Generation with Disentangled 3D Representations

Recent progress in deep generative models has led to tremendous breakthroughs
in natural image generation. While being able to synthesize photorealistic
images, existing models lack an understanding of our underlying 3D world.
Different from previous work that is built on 2D datasets and models, we
present a new generative model, Visual Object Networks (VON), synthesizing
natural images with a 3D disentangled representation. Inspired by classic
graphics rendering pipelines, we disentangle the image formation process into
three conditionally independent factors---viewpoint, shape, and texture---and
present an end-to-end adversarial learning framework that jointly models 3D
shape and 2D texture. Our model first learns to synthesize 3D shapes that are
indistinguishable from real shapes. It can then render its 2.5D sketch (i.e.
object silhouette and depth map) from any viewpoint. Finally, it learns to add
realistic texture to its 2.5D projections to generate final, 2D realistic
images. Our model not only generates images that are more realistic than
state-of-the-art, 2D-only image synthesis methods, but enables many 3D
operations such as changing the viewpoint of a generated image, shape and
texture editing and interpolation, and example-based texture transfer.
Understanding Weight Normalized Deep Neural Networks with Rectified Linear Units

This paper presents a general framework for norm-based capacity control for
$L_{p,q}$ weight normalized deep neural networks. We first establish the upper
bound on the Rademacher complexities of this family. In particular, with an
$L_{1,q}$ normalization, we discuss properties of an architecture-independent
capacity control. For the regression problem, we provide both the
generalization bound and the approximation bound. We argue that both
generalization and approximation errors can be controlled by the $L_1$ norm of
the output layer for any $L_{1,\infty}$ weight normalized neural network
without relying on the network width and depth.
Learning Pipelines with Limited Data and Domain Knowledge: A Study in Parsing Physics Problems

As machine learning becomes more widely used in practice, we need new methods
to build complex intelligent systems that integrate learning with existing
software, and with domain knowledge encoded as rules. As a case study, we
present such a system that learns to parse Newtonian physics problems in
textbooks. This system, Nuts&Bolts;, learns a pipeline process that
incorporates existing code, pre-learned machine learning models, and human
engineered rules. It jointly trains the entire pipeline to prevent propagation
of errors, using a combination of labelled and unlabelled data. Our approach
achieves a good performance on the parsing task, outperforming the simple
pipeline and its variants. We further use Nuts&Bolts; to show improvements on
the end task of answering these problems.
Learning long-range spatial dependencies with horizontal gated recurrent units

Progress in deep learning has spawned great successes in many engineering
applications. As a prime example, convolutional neural networks, a type of
feedforward neural networks, are now approaching -- and sometimes even
surpassing -- human accuracy on a variety of visual recognition tasks. Here,
however, we show that these neural networks and their recent extensions
struggle in recognition tasks where co-dependent visual features must be
detected over long spatial ranges. We introduce the horizontal gated-recurrent
unit (hGRU) to learn intrinsic horizontal connections -- both within and
across feature columns. We demonstrate that a single hGRU layer matches or
outperforms all tested feedforward hierarchical baselines including state-of-
the-art architectures which have orders of magnitude more free parameters. We
further discuss the biological plausibility of the hGRU in comparison to
anatomical data from the visual cortex as well as human behavioral data on a
classic contour detection task.
Joint Sub-bands Learning with Clique Structures for Wavelet Domain Super-Resolution

Convolutional neural networks (CNNs) have recently achieved great success in
single-image super-resolution (SISR). However, these methods tend to produce
over-smoothed outputs and miss some textural details. To solve these problems,
we propose the Super-Resolution CliqueNet (SRCliqueNet) to reconstruct the
high resolution (HR) image with better textural details in the wavelet domain.
The proposed SRCliqueNet firstly extracts a set of feature maps from the low
resolution (LR) image by the clique blocks group. Then we send the set of
feature maps to the clique up-sampling module to reconstruct the HR image. The
clique up-sampling module consists of four sub-nets which predict the high
resolution wavelet coefficients of four sub-bands. Since we consider the edge
feature properties of four sub-bands, the four sub-nets are connected to the
others so that they can learn the coefficients of four sub-bands jointly.
Finally we apply inverse discrete wavelet transform (IDWT) to the output of
four sub-nets at the end of the clique up-sampling module to increase the
resolution and reconstruct the HR image. Extensive quantitative and
qualitative experiments on benchmark datasets show that our method achieves
superior performance over the state-of-the-art methods.
Fast Similarity Search via Optimal Sparse Lifting

Similarity search is a fundamental problem in computing science with various
applications, and has attracted significant research attention, especially for
large-scale search problems in high dimensions. Motivated by the evidence in
biological science, we propose a novel approach for similarity search.
Fundamentally different from existing methods which mostly try to reduce the
dimension of the data during the search, our approach projects the data into
an even higher-dimensional space and ensures the data to be sparse and binary
in the output space, where the search speed can be significantly improved.
Specifically, our approach has two key steps and contributions. Firstly, we
seek an {\em optimal sparse lifting} for the input data that increases the
dimension of the data while approximately preserving the pairwise similarity
through a general matrix factorization method. Secondly, we seek a {\em
lifting operator} that maps input samples to their {\em sparse lifting} by
solving an optimization model. In empirical studies, our approach reported
significantly improved search results over the state-of-the-art solution in
information retrieval applications, and exhibited its high potential in
solving practical problems.
Learning Deep Disentangled Embeddings With the F-Statistic Loss

Deep-embedding methods aim to discover representations of a domain that make
explicit the domain's class structure and thereby support few-shot learning.
Disentangling methods aim to make explicit compositional or factorial
structure. We combine these two active but independent lines of research and
propose a new paradigm suitable for both goals. We propose and evaluate a
novel loss function based on the $F$ statistic, which describes the separation
of two or more distributions. By ensuring that distinct classes are well
separated on a subset of embedding dimensions, we obtain embeddings that are
useful for few-shot learning. By not requiring separation on all dimensions,
we encourage the discovery of disentangled representations. Our embedding
method matches or beats state-of-the-art, as evaluated by performance on
recall@$k$ and few-shot learning tasks. Our method also obtains performance
superior to a variety of alternatives on disentangling, as evaluated by two
key properties of a disentangled representation: modularity and explicitness.
The goal of our work is to obtain more interpretable, manipulable, and
generalizable deep representations of concepts and categories.
Geometrically Coupled Monte Carlo Sampling

Monte Carlo sampling in high-dimensional, low-sample settings is important in
many machine learning tasks. We improve current methods for sampling in
Euclidean spaces by avoiding independence, and instead consider ways to couple
samples. We show fundamental connections to optimal transport theory, leading
to novel sampling algorithms, and providing new theoretical grounding for
existing strategies. We compare our new strategies against prior methods for
improving sample efficiency, including QMC, by studying discrepancy. We
explore our findings empirically, and observe benefits of our sampling schemes
for reinforcement learning and generative modelling.
Cooperative Holistic 3D Scene Understanding from a Single RGB Image

Holistic 3D indoor scene understanding involves jointly recovering the room
layout, camera pose, and object bounding boxes, all in 3D. Most current
methods either are inefficient or only tackle part of the problem. In this
paper, we propose an end-to-end model that simultaneously solves all three
tasks in real-time given a single RGB image. The key idea is to improve the
prediction by i) parametrizing the targets (e.g., 3D boxes) instead of
directly estimating the targets, and ii) cooperative training among different
modules. Specifically, we parametrize the 3D object bounding boxes by the
predictions from several modules, i.e., 3D camera pose, depth, and object
poses and sizes. The proposed method brings up three major advantages. i) The
parametrization helps maintain the consistency between 2D images and 3D world.
ii) It largely reduces the prediction variances of the 3D coordinates. iii)
Constraints can be imposed on the parametrizations to train different modules
simultaneously. We call these constraints ``cooperative losses". In this
paper, we employ three cooperative losses for 3D bounding boxes, 2D
projection, and physical constraints to estimate a geometrically consistent
and physically plausible 3D scene. Experiments on the SUN RGB-D dataset shows
that our method significantly outperforms prior approaches on 3D layout
estimation, 3D object detection, and holistic scene understanding.
An Efficient Pruning Algorithm for Robust Isotonic Regression

We study a generalization of the classic isotonic regression problem where we
allow separable nonconvex objective functions, focusing on the case of
estimators used in robust regression. A simple dynamic programming approach
allows us to solve this problem to within ε-accuracy (of the global minimum)
in time linear in 1/ε and the dimension. We can combine techniques from the
convex case with branch-and-bound ideas to form a new algorithm for this
problem that naturally exploits the shape of the objective function. Our
algorithm achieves the best bounds for both the general nonconvex and convex
case (linear in log (1/ε)), while performing much faster in practice than a
straightforward dynamic programming approach, especially as the desired
accuracy increases.
PAC-learning in the presence of adversaries

The existence of evasion attacks during the test phase of machine learning
algorithms represents a significant challenge to both their deployment and
understanding. These attacks can be carried out by adding imperceptible
perturbations to inputs to generate adversarial examples and finding effective
defenses and detectors has proven to be difficult. In this paper, we step away
from the attack-defense arms race and seek to understand the limits of what
can be learned in the presence of a test-time adversary. In particular, we
extend the Probably Approximately Correct (PAC)-learning framework to account
for the presence of an adversary. We first define corrupted hypothesis classes
which arise from standard binary hypothesis classes in the presence of an
evasion adversary and derive the Vapnik-Chervonenkis (VC)-dimension for these,
denoted as the Adversarial VC-dimension. We then show that a corresponding
Fundamental Theorem of Statistical learning can be proved for evasion
adversaries, where the sample complexity is controlled by the Adversarial VC-
dimension. We then explicitly derive the Adversarial VC-dimension for
halfspace classifiers in the presence of a sample-wise norm-constrained
adversary of the type commonly studied for evasion attacks and show that it is
the same as the standard VC-dimensiont, closing an open question. Finally, we
prove that the Adversarial VC-dimension can be either larger or smaller than
the standard VC-dimension depending on the hypothesis class and adversary,
making it an interesting object of study in its own right.
Sparse DNNs with Improved Adversarial Robustness

Deep neural networks (DNNs) are computationally/memory-intensive and
vulnerable to adversarial attacks, making them prohibitive in some real-world
applications. By converting dense models into sparse ones, pruning appears to
be a promising solution to reducing the computation/memory cost. This paper
studies classification models, especially DNN-based ones, to demonstrate that
there exists intrinsic relationships between their sparsity and adversarial
robustness. Our analyses reveal, both theoretically and empirically, that
nonlinear DNN-based classifiers behave differently under $l_2$ attacks from
some linear ones. We further demonstrate that an appropriately higher model
sparsity implies better robustness of nonlinear DNNs, whereas over-sparsified
models can be more difficult to resist adversarial examples.
Snap ML: A Hierarchical Framework for Machine Learning

We describe a new software framework for fast training of generalized linear
models. The framework, named Snap Machine Learning (Snap ML), combines recent
advances in machine learning systems and algorithms in a nested manner to
reflect the hierarchical architecture of modern computing systems. We prove
theoretically that such a hierarchical system can accelerate training in
distributed environments where intra-node communication is cheaper than inter-
node communication. Additionally, we provide a review of the implementation of
Snap ML in terms of GPU acceleration, pipelining, communication patterns and
software architecture, highlighting aspects that were critical for achieving
high performance. We evaluate the performance of Snap ML in both single-node
and multi-node environments, quantifying the benefit of the hierarchical
scheme and the data streaming functionality, and comparing with other widely-
used machine learning software frameworks. Finally, we present a logistic
regression benchmark on the Criteo Terabyte Click Logs dataset and show that
Snap ML achieves the same test loss an order of magnitude faster than any of
the previously reported results.
See and Think: Disentangling Semantic Scene Completion

Semantic scene completion predicts volumetric occupancy and object category of
a 3D scene, which helps intelligent agents to understand and interact with the
surroundings. In this work, we propose a disentangled framework, sequentially
carrying out 2D semantic segmentation, 2D-3D reprojection and 3D semantic
scene completion. This three-stage framework has three advantages: (1)
explicit semantic segmentation significantly boosts performance; (2) flexible
fusion ways of sensor data bring good extensibility; (3) progress in any
subtask will promote the holistic performance. Experimental results show that
regardless of inputing a single depth or RGB-D, our framework can generate
high-quality semantic scene completion, and outperforms state-of-the-art
approaches on both synthetic and real datasets.
Chain of Reasoning for Visual Question Answering

Reasoning plays an essential role in Visual Question Answering (VQA). Multi-
step and dynamic reasoning is often necessary for answering complex questions.
For example, a question "What is placed next to the bus on the right of the
picture?" talks about a compound object "bus on the right," which is generated
by the relation . Furthermore, a new relation including this compound object
is then required to infer the answer. However, previous methods support either
one-step or static reasoning, without updating relations or generating
compound objects. This paper proposes a novel reasoning model for addressing
these problems. A chain of reasoning (CoR) is constructed for supporting
multi-step and dynamic reasoning on changed relations and objects. In detail,
iteratively, the relational reasoning operations form new relations between
objects, and the object refining operations generate new compound objects from
relations. We achieve new state-of-the-art results on four publicly available
datasets. The visualization of the chain of reasoning illustrates the progress
that the CoR generates new compound objects that lead to the answer of the
question step by step.
Sigsoftmax: Reanalysis of the Softmax Bottleneck

Softmax is an output activation function for modeling categorical probability
distributions in many applications of deep learning. However, a recent study
revealed that softmax can be a bottleneck of representational capacity of
neural networks in language modeling (the softmax bottleneck). In this paper,
we propose an output activation function for breaking the softmax bottleneck
without additional parameters. We re-analyze the softmax bottleneck from the
perspective of the output set of log-softmax and identify the cause of the
softmax bottleneck. On the basis of this analysis, we propose sigsoftmax,
which is composed of a multiplication of an exponential function and sigmoid
function. Sigsoftmax can break the softmax bottleneck. The experiments on
language modeling demonstrate that sigsoftmax and mixture of sigsoftmax
outperform softmax and mixture of softmax, respectively.
Deep Non-Blind Deconvolution via Generalized Low-Rank Approximation

In this paper, we present a deep convolutional neural network to capture the
inherent properties of image degradation, which can handle different kernels
and saturated pixels in a unified framework. The proposed neural network is
motivated by the low-rank property of pseudo-inverse kernels. We first compute
a generalized low-rank approximation for a large number of blur kernels, and
then use separable filters to initialize the convolutional parameters in the
network. Our analysis shows that the estimated decomposed matrices contain the
most essential information of the input kernel, which ensures the proposed
network to handle various blurs in a unified framework and generate high-
quality deblurring results. Experimental results on benchmark datasets with
noise and saturated pixels demonstrate that the proposed algorithm performs
favorably against state-of-the-art methods.
Probabilistic Pose Graph Optimization via Bingham Distributions and Tempered Geodesic MCMC

We introduce Tempered Geodesic MCMC (TG-MCMC) algorithm for initializing pose
graph optimization problems, arising in various scenarios such as SFM
(structure from motion) or SLAM (simultaneous localization and mapping). TG-
MCMC is first of its kind as it unites asymptotically global non-convex
optimization on the spherical manifold of quaternions with posterior sampling,
in order to provide both reliable initial poses and uncertainty estimates that
are informative about the quality of individual solutions. We devise rigorous
theoretical convergence guarantees for our method and extensively evaluate it
on synthetic and real benchmark datasets. Besides its elegance in formulation
and theory, we show that our method is robust to missing data, noise and the
estimated uncertainties capture intuitive properties of the data.
MetaAnchor: Learning to Detect Objects with Customized Anchors

We propose a novel and flexible anchor mechanism named MetaAnchor for object
detection frameworks. Unlike many previous detectors model anchors via a
predefined manner, in MetaAnchor anchor functions could be dynamically
generated from the arbitrary customized prior boxes. Taking advantage of
weight prediction, MetaAnchor is able to work with most of the anchor-based
object detection systems such as RetinaNet. Compared with the predefined
anchor scheme, we empirically find that MetaAnchor is more robust to anchor
settings and bounding box distributions; in addition, it also shows the
potential on the transfer task. Our experiment on COCO detection task shows
MetaAnchor consistently outperforms the counterparts in various scenarios.
Image Inpainting via Generative Multi-column Convolutional Neural Networks

In this paper, we propose a generative multi-column network for image
inpainting. This network can synthesize different image components in a
parallel manner within one stage. To better characterize global structures, we
design a confidence-driven reconstruction loss while an implicit diversified
MRF regularization is adopted to enhance local details. The multi-column
network combined with the reconstruction and MRF loss propagates local and
global information derived from context to the target inpainting regions.
Extensive experiments on challenging street view, face, natural objects and
scenes manifest that our method can produce visual compelling results even
without previously common post-processing.
On Misinformation Containment in Online Social Networks

The widespread online misinformation could cause public panic and serious
economic damages. The misinformation containment problem aims at limiting the
spread of misinformation in online social networks by launching competing
campaigns. Motivated by realistic scenarios, we present the first analysis of
the misinformation containment problem for the case when an arbitrary number
of cascades are allowed. This paper makes four contributions. First, we
provide a formal model for multi-cascade diffusion and introduce an important
concept called as cascade priority. Second, we show that the misinformation
containment problem cannot be approximated within a factor of
$\Omega(2^{\log^{1-\epsilon}n^4})$ in polynomial time unless $NP \subseteq
DTIME(n^{\polylog{n}})$. Third, we introduce several types of cascade priority
that are frequently seen in real social networks. Finally, we design novel
algorithms for solving the misinformation containment problem. The
effectiveness of the proposed algorithm is supported by encouraging
experimental results.
A^2-Nets: Double Attention Networks

Learning to capture long-range relations is fundamental to image/video
recognition. Existing CNN models generally rely on increasing depth to model
such relations which is highly inefficient. In this work, we propose the
``double attention block'', a novel component that aggregates and propagates
informative global features from the entire spatio-temporal space of input
images/videos, enabling subsequent convolution layers to access features from
the entire space efficiently. The component is designed with a double
attention mechanism in two steps, where the first step gathers features from
the entire space into a compact set through second-order attention pooling and
the second step adaptively selects and distributes features to each location
via another attention. The proposed double attention block is easy to adopt
and can be plugged into existing deep neural networks conveniently. We conduct
extensive ablation studies and experiments on both image and video recognition
tasks for evaluating its performance. On the image recognition task, a
ResNet-50 equipped with our double attention blocks outperforms a much larger
ResNet-152 architecture on ImageNet-1k dataset with over 40% less the number
of parameters and less FLOPs. On the action recognition task, our proposed
model achieves the state-of-the-art results on the Kinetics and UCF-101
datasets with significantly higher efficiency than recent works.
Self-Supervised Generation of Spatial Audio for 360-degree Video

We introduce an approach to convert mono audio recorded by a 360-degree video
camera into spatial audio, a representation of the distribution of sound over
the full viewing sphere. Spatial audio is an important component of immersive
360-degree video viewing, but spatial audio microphones are still rare in
current 360-degree video production. Our system consists of end-to-end
trainable neural networks that separate individual sound sources and localize
them on the viewing sphere, conditioned on multi-modal analysis from the audio
and 360-degree video frames. We introduce several datasets, including one
filmed ourselves, and one collected in-the-wild from YouTube, consisting of
360-degree videos uploaded with spatial audio. During training, ground truth
spatial audio serves as self-supervision and a mixed down mono track forms the
input to our network. Using our approach we show that it is possible to infer
the spatial localization of sounds based only on a synchronized 360-degree
video and the mono audio track.
How Many Samples are Needed to Learn a Convolutional Neural Network?

A widespread folklore for explaining the success of convolutional neural
network (CNN) is that CNN is a more compact representation than the fully
connected neural network (FNN) and thus requires fewer samples for learning.
We initiate the study of rigorously characterizing the sample complexity of
learning convolutional neural networks. We show that for learning an
$m$-dimensional convolutional filter with linear activation acting on a
$d$-dimensional input, the sample complexity of achieving population
prediction error of $\epsilon$ is $\widetilde{O}(m/\epsilon^2)$ whereas its
FNN counterpart needs at least $\Omega(d/\epsilon^2)$ samples. Since $m \ll
d$, this result demonstrates the advantage of using CNN. We further consider
the sample complexity of learning a one-hidden-layer CNN with linear
activation where both the $m$-dimensional convolutional filter and the
$r$-dimensional output weights are unknown. For this model, we show the sample
complexity is $\widetilde{O}\left((m+r)/\epsilon^2\right)$ when the ratio
between the stride size and the filter size is a constant. For both models, we
also present lower bounds showing our sample complexities are tight up to
logarithmic factors. Our main tools for deriving these results are localized
empirical process and a new lemma characterizing the convolutional structure.
We believe these tools may inspire further developments in understanding CNN.
Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

We study the implicit regularization imposed by gradient descent for learning
multi-layer homogeneous functions including feed-forward fully connected and
convolutional deep neural networks with linear, ReLU or Leaky ReLU activation.
We rigorously prove that gradient flow (i.e. gradient descent with
infinitesimal step size) effectively enforces the differences between squared
norms across different layers to remain invariant without any explicit
regularization. This result implies that if the weights are initially small,
gradient flow automatically balances the magnitudes of all layers. Using a
discretization argument, we analyze gradient descent with positive step size
for the non-convex low-rank asymmetric matrix factorization problem without
any regularization. Inspired by our findings for gradient flow, we prove that
gradient descent with step sizes $\eta_t=O(t^{−(1/2+\delta)}) (0
Optimization for Approximate Submodularity

We consider the problem of maximizing a submodular function when given access
to its approximate version. Submodular functions are heavily studied in a wide
variety of disciplines, since they are used to model many real world
phenomena, and are amenable to optimization. However, there are many cases in
which the phenomena we observe is only approximately submodular and the
approximation guarantees cease to hold. We describe a technique which we call
the sampled mean approximation that yields strong guarantees for maximization
of submodular functions from approximate surrogates under cardinality and
intersection of matroid constraints. In particular, we show tight guarantees
for maximization under a cardinality constraint and 1/(1+P) approximation
under intersection of P matroids.
(Probably) Concave Graph Matching

In this paper we address the graph matching problem. Following the recent
works of \cite{zaslavskiy2009path,Vestner2017} we analyze and generalize the
idea of concave relaxations. We introduce the concepts of \emph{conditionally
concave} and \emph{probably conditionally concave} energies on polytopes and
show that they encapsulate many instances of the graph matching problem,
including matching Euclidean graphs and graphs on surfaces. We further prove
that local minima of probably conditionally concave energies on general
matching polytopes (\eg, doubly stochastic) are with high probability extreme
points of the matching polytope (\eg, permutations).
Deep Defense: Training DNNs with Improved Adversarial Robustness

Despite the efficacy on a variety of computer vision tasks, deep neural
networks (DNNs) are vulnerable to adversarial attacks, limiting their
applications in security-critical systems. Recent works have shown the
possibility of generating imperceptibly perturbed image inputs (a.k.a.,
adversarial examples) to fool well-trained DNN classifiers into making
arbitrary predictions. To address this problem, we propose a training recipe
named "deep defense". Our core idea is to integrate an adversarial
perturbation-based regularizer into the classification objective, such that
the obtained models learn to resist potential attacks, directly and precisely.
The whole optimization problem is solved just like training a recursive
network. Experimental results demonstrate that our method outperforms training
with adversarial/Parseval regularizations by large margins on various datasets
(including MNIST, CIFAR-10 and ImageNet) and different DNN architectures. Code
and models for reproducing our results are available at
https://github.com/ZiangYan/deepdefense.pytorch.
Rest-Katyusha: Exploiting the Solution's Structure via Scheduled Restart Schemes

We propose a structure-adaptive variant of the state-of-the-art stochastic
variance-reduced gradient algorithm Katyusha for regularized empirical risk
minimization. The proposed method is able to exploit the intrinsic low-
dimensional structure of the solution, such as sparsity or low rank which is
enforced by a non-smooth regularization, to achieve even faster convergence
rate. This provable algorithmic improvement is done by restarting the Katyusha
algorithm according to restricted strong-convexity constants. We demonstrate
the effectiveness of our approach via numerical experiments.
Implicit Reparameterization Gradients

By providing a simple and efficient way of computing low-variance gradients of
continuous random variables, the reparameterization trick has become the
technique of choice for training a variety of latent variable models. However,
it is not applicable to a number of important continuous distributions. We
introduce an alternative approach to computing reparameterization gradients
based on implicit differentiation and demonstrate its broader applicability by
applying it to Gamma, Beta, Dirichlet, and von Mises distributions, which
cannot be used with the classic reparameterization trick. Our experiments show
that the proposed approach is faster and more accurate than the existing
gradient estimators for these distributions.
Training DNNs with Hybrid Block Floating Point

The wide adoption of DNNs has given birth to unrelenting computing
requirements, forcing datacenter operators to adopt domain-specific
accelerators to train them. These accelerators typically employ densely packed
full precision floating-point arithmetic to maximize performance per area.
Ongoing research efforts seek to further increase that performance density by
replacing floating-point with fixed-point arithmetic. However, a significant
roadblock for these attempts has been fixed point's narrow dynamic range,
which is insufficient for DNN training convergence. We identify block floating
point~(BFP) as a promising alternative representation since it exhibits wide
dynamic range and enables the majority of DNN operations to be performed with
fixed-point logic. Unfortunately, BFP alone introduces several limitations
that preclude its direct applicability. In this work, we introduce HBFP, a
hybrid BFP-FP approach, which performs all dot products in BFP and other
operations in floating point. HBFP delivers the best of both worlds: the high
accuracy of floating point at the superior hardware density of fixed point.
For a wide variety of models we show that HBFP matches floating point's
accuracy while enabling hardware implementations that deliver up to 8.5x
higher throughput.
A Model for Learned Bloom Filters and Optimizing by Sandwiching

Recent work has suggested enhancing Bloom filters by using a pre-filter, based
on applying machine learning to determine a function that models the data set
the Bloom filter is meant to represent. Here we model such learned Bloom
filters, with the following outcomes: (1) we clarify what guarantees can and
cannot be associated with such a structure; (2) we show how to estimate what
size the learning function must obtain in order to obtain improved
performance; (3) we provide a simple method, sandwiching, for optimizing
learned Bloom filters; and (4) we propose a design and analysis approach for a
learned Bloomier filter, based on our modeling approach.
Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis

Despite remarkable advances in image synthesis research, existing works often
fail in manipulating images under the context of large geometric
transformations. Synthesizing person images conditioned on arbitrary poses is
one of the most representative examples where the generation quality largely
relies on the capability of identifying and modeling arbitrary transformations
on different body parts. Current generative models are often built on local
convolutions and overlook the key challenges (e.g. heavy occlusions, different
views or dramatic appearance changes) when distinct geometric changes happen
for each part, caused by arbitrary pose manipulations. This paper aims to
resolve these challenges induced by geometric variability and spatial
displacements via a new Soft-Gated Warping Generative Adversarial Network
(Warping-GAN), which is composed of two stages: 1) it first synthesizes a
target part segmentation map given a target pose, which depicts the region-
level spatial layouts for guiding image synthesis with higher-level structure
constraints; 2) the Warping-GAN equipped with a soft-gated warping-block
learns feature-level mapping to render textures from the original image into
the generated segmentation map. Warping-GAN is capable of controlling
different transformation degrees given distinct target poses. Moreover, the
proposed warping-block is light-weight and flexible enough to be injected into
any networks. Human perceptual studies and quantitative evaluations
demonstrate the superiority of our Warping-GAN that significantly outperforms
all existing methods on two large datasets.
Deep Functional Dictionaries: Learning Consistent Semantic Structures on 3D Models from Functions

Various 3D semantic attributes such as segmentation masks, geometric features,
keypoints, and materials can be encoded as per-point probe functions on 3D
geometries. Given a collection of related 3D shapes, we consider how to
jointly analyze such probe functions over different shapes, and how to
discover common latent structures using a neural network --- even in the
absence of any correspondence information. Our network is trained on point
cloud representations of shape geometry and associated semantic functions on
that point cloud. These functions express a shared semantic understanding of
the shapes but are not coordinated in any way. For example, in a segmentation
task, the functions can be indicator functions of arbitrary sets of shape
parts, with the particular combination involved not known to the network. Our
network is able to produce a small dictionary of basis functions for each
shape, a dictionary whose span includes the semantic functions provided for
that shape. Even though our shapes have independent discretizations and no
functional correspondences are provided, the network is able to generate
latent bases, in a consistent order, that reflect the shared semantic
structure among the shapes. We demonstrate the effectiveness of our technique
in various segmentation and keypoint selection applications.
Nonlocal Neural Networks, Nonlocal Diffusion and Nonlocal Modeling

Nonlocal neural networks have been proposed and shown to be effective in
several computer vision tasks, where the nonlocal operations can directly
capture long-range dependencies in the feature space. In this paper, we study
the nature of diffusion and damping effect of nonlocal networks by doing the
spectrum analysis on the weight matrices of the well-trained networks, and
propose a new formulation of the nonlocal block. The new block not only learns
the nonlocal interactions but also has stable dynamics and thus allows deeper
nonlocal structures. Moreover, we interpret our formulation from the general
nonlocal modeling perspective, where we make connections between the proposed
nonlocal network and other nonlocal models, such as nonlocal diffusion
processes and nonlocal Markov jump processes.
Are ResNets Provably Better than Linear Predictors?

A residual network (or ResNet) is a standard deep neural net architecture,
with state-of-the-art performance across numerous applications. The main
premise of ResNets is that they allow the training of each layer to focus on
fitting just the residual of the previous layer's output and the target
output. Thus, we should expect that the trained network is no worse than what
we can obtain if we remove the residual layers and train a shallower network
instead. However, due to the non-convexity of the optimization problem, it is
not at all clear that ResNets indeed achieve this behavior, rather than
getting stuck at some arbitrarily poor local minimum. In this paper, we
rigorously prove that arbitrarily deep, nonlinear residual units indeed
exhibit this behavior, in the sense that the optimization landscape contains
no local minima with value above what can be obtained with a linear predictor
(namely a 1-layer network). Notably, we show this under minimal or no
assumptions on the precise network architecture, data distribution, or loss
function used. We also provide a quantitative analysis of approximate
stationary points for this problem. Finally, we show that with a certain tweak
to the architecture, training the network with standard stochastic gradient
descent achieves an objective value close or better than any linear predictor.
Learning to Decompose and Disentangle Representations for Video Prediction

Our goal is to predict future video frames given a sequence of input frames.
Despite large amounts of video data, this remains a challenging task because
of the high-dimensionality of video frames. We address this challenge by
proposing the Decompositional Disentangled Predictive Auto-Encoder (DDPAE), a
framework that combines structured probabilistic models and deep networks to
automatically (i) decompose the high-dimensional video that we aim to predict
into components, and (ii) disentangle each component to have low-dimensional
temporal dynamics that are easier to predict. Crucially, with an appropriately
specified generative model of video frames, our DDPAE is able to learn both
the latent decomposition and disentanglement without explicit supervision. For
the Moving MNIST dataset, we show that DDPAE is able to recover the underlying
components (individual digits) and disentanglement (appearance and location)
as we would intuitively do. We further demonstrate that DDPAE can be applied
to the Bouncing Balls dataset involving complex interactions between multiple
objects to predict the video frame directly from the pixels and recover
physical states without explicit supervision.
Multi-Task Learning as Multi-Objective Optimization

In multi-task learning, multiple tasks are solved jointly, sharing inductive
bias between them. Multi-task learning is inherently a multi-objective problem
since different tasks may conflict, necessitating a trade-off between them. A
common approach to this trade-off is to optimize a proxy objective that
minimizes a weighted linear combination of per-task losses. However, this
proxy is only valid when the tasks do not compete, which is rarely the case.
In this paper, we explicitly cast multi-task learning as multi-objective
optimization, with the overall objective of finding a Pareto optimal solution.
To this end, we use algorithms developed in the gradient-based multi-objective
optimization literature. Although these algorithms have desirable theoretical
guarantees, they are not directly applicable to large-scale learning problems.
We therefore propose efficient and accurate approximations. We apply our
method to a variety of multi-task deep learning problems including digit
classification, scene understanding (joint semantic segmentation, instance
segmentation, and depth estimation), and multi-label classification. Our
method yields higher-performing models than recent multi-task learning
formulations or per-task training.
Self-Handicapping Network for Integral Object Attention

Recently, adversarial erasing for weakly-supervised object attention has been
deeply studied due to its capability in localizing integral object regions.
However, such a strategy raises one key problem that attention regions will
gradually expand to non-object regions as training iterations continue, which
significantly decreases the quality of the produced attention maps. To tackle
such an issue as well as promote the quality of object attention, we introduce
a simple yet effective Self-Handicapping Network (ShaNet) to prohibit
attentions from spreading to unexpected background regions. In particular,
ShaNet leverages two self-handicapping strategies to encourage networks to use
reliable object and background cues for learning to attention. In this way,
integral object regions can be effectively highlighted without including much
more background regions. To test the quality of the generated attention maps,
we employ the mined object regions as heuristic cues for learning semantic
segmentation models. Experiments on Pascal VOC well demonstrate the
superiority of the ShaNet over other state-of-the-art methods.
LinkNet: Relational Embedding for Scene Graph

Objects and their relationships are critical contents for image understanding.
A scene graph provides a structured description that captures these properties
of an image. However, reasoning about the relationships between objects is
very challenging and only a few recent works have attempted to solve the
problem of generating a scene graph from an image. In this paper, we present a
novel method that improves scene graph generation by explicitly modeling
inter-dependency among the entire object instances. We design a simple and
effective relational embedding module that enables our model to jointly
represent connections among all related objects, rather than focus on an
object in isolation. Our novel method significantly benefits two main parts of
the scene graph generation task: object classification and relationship
classification. Using it on top of a basic Faster R-CNN, our model achieves
state-of-the-art results on the Visual Genome benchmark. We further push the
performance by introducing global context encoding module and geometrical
layout encoding module. We validate our final model, LinkNet, through
extensive ablation studies, demonstrating its efficacy in scene graph
generation.
BourGAN: Generative Networks with Metric Embeddings

This paper addresses the mode collapse for generative adversarial networks
(GANs). We view modes as a geometric structure of data distribution in a
metric space. Under this geometric lens, we embed subsamples of the dataset
from an arbitrary metric space into the `2 space, while preserving their
pairwise distance distribution. Not only does this metric embedding determine
the dimensionality of the latent space automatically, it also enables us to
construct a mixture of Gaussians to draw latent space random vectors. We use
the Gaussian mixture model in tandem with a simple augmentation of the
objective function to train GANs. Every major step of our method is supported
by theoretical analysis, and our experiments on real and synthetic data
confirm that the generator is able to produce samples spreading over most of
the modes while avoiding unwanted samples, outperforming several recent GAN
variants on a number of metrics and offering new features.
Zero-Shot Transfer with Deictic Object-Oriented Representation in Reinforcement Learning

Object-oriented representations in reinforcement learning have shown promise
in transfer learning, with previous research introducing a propositional
object-oriented framework that has provably efficient learning bounds with
respect to sample complexity. However, this framework has limitations in terms
of the classes of tasks it can efficiently learn. In this paper we introduce a
novel deictic object-oriented framework that has provably efficient learning
bounds and can solve a broader range of tasks. Additionally, we show that this
framework is capable of zero-shot transfer of transition dynamics across tasks
and demonstrate this empirically for the Taxi and Sokoban domains.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

Many modern machine learning models are trained to achieve zero or near-zero
training error in order to obtain near-optimal (but non-zero) test error. This
phenomenon of strong generalization performance for overfitted'' / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and weighted $k$-nearest neighbor schemes. Consistency or near- consistency is proved for these schemes in classification and regression problems. These schemes have an inductive bias that benefits from higher dimension, a kind of blessing of dimensionality''. Finally, connections to
kernel machines, random forests, and adversarial examples in the interpolated
regime are discussed.
Breaking the Span Assumption Yields Fast Finite-Sum Minimization

In this paper, we show that SVRG and SARAH can be modified to be fundamentally
faster than all of the other standard algorithms that minimize the sum of $n$
smooth functions, such as SAGA, SAG, SDCA, and SDCA without duality. Most
finite sum algorithms follow what we call the ``span assumption'': Their
updates are in the span of a sequence of component gradients chosen in a
random IID fashion. In the big data regime, where the condition number
$\kappa=O(n)$, the span assumption prevents algorithms from converging to an
approximate solution of accuracy $\epsilon$ in less than $n\ln(1/\eps)$
iterations. SVRG and SARAH do not follow the span assumption since they are
updated with a hybrid of full-gradient and component-gradient information. We
show that because of this, they can be up to $\Omega(1+(\ln(n/\kappa))_+)$
times faster. In particular, to obtain an accuracy $\epsilon = 1/n^\alpha$ for
$\kappa=n^\beta$ and $\alpha,\beta\in(0,1)$, modified SVRG requires $O(n)$
iterations, whereas algorithms that follow the span assumption require
$\cO\p{n\ln\p{n}}$ iterations. Moreover, we present lower bound results that
show this speedup is optimal, and provide analysis to help explain why this
speedup exists. With the understanding that the span assumption is a point of
weakness of finite sum algorithms, future work may purposefully exploit this
to yield even faster algorithms in the big data regime.
Structured Local Minima in Sparse Blind Deconvolution

Blind deconvolution is a ubiquitous problem of recovering two unknown signals
from their convolution. Unfortunately, this is an ill-posed problem in
general. This paper focuses on the {\em short and sparse} blind deconvolution
problem, where the one unknown signal is short and the other one is sparsely
and randomly supported. This variant captures the structure of the unknown
signals in several important applications. We assume the short signal to have
unit $\ell^2$ norm and cast the blind deconvolution problem as a nonconvex
optimization problem over the sphere. We demonstrate that (i) in a certain
region of the sphere, every local optimum is close to some shift truncation of
the ground truth, and (ii) for a generic short signal of length $k$, when the
sparsity of activation signal $\theta\lesssim k^{-2/3}$ and number of
measurements $m\gtrsim\poly\paren{k}$, a simple initialization method together
with a descent algorithm which escapes strict saddle points recovers a near
shift truncation of the ground truth kernel.
GIANT: Globally Improved Approximate Newton Method for Distributed Optimization

For distributed computing environment, we consider the empirical risk
minimization problem and propose a distributed and communication-efficient
Newton-type optimization method. At every iteration, each worker locally finds
an Approximate NewTon (ANT) direction, which is sent to the main driver. The
main driver, then, averages all the ANT directions received from workers to
form a Globally Improved ANT (GIANT) direction. GIANT is highly communication
efficient and naturally exploits the trade-offs between local computations and
global communications in that more local computations result in fewer overall
rounds of communications. Theoretically, we show that GIANT enjoys an improved
convergence rate as compared with first-order methods and existing distributed
Newton-type methods. Further, and in sharp contrast with many existing
distributed Newton-type methods, as well as popular first-order methods, a
highly advantageous practical feature of GIANT is that it only involves one
tuning parameter. We conduct large-scale experiments on a computer cluster
and, empirically, demonstrate the superior performance of GIANT.
Modelling sparsity, heterogeneity, reciprocity and community structure in temporal interaction data

We propose a novel class of network models for temporal dyadic interaction
data. Our objective is to capture important features often observed in social
interactions: sparsity, degree heterogeneity, community structure and
reciprocity. We use mutually-exciting Hawkes processes to model the
interactions between each (directed) pair of individuals. The intensity of
each process allows interactions to arise as responses to opposite
interactions (reciprocity), or due to shared interests between individuals
(community structure). For sparsity and degree heterogeneity, we build the non
time dependent part of the intensity function on compound random measures
following Todeschini et al., 2016. We conduct experiments on real-world
temporal interaction data and show that the proposed model outperforms
competing approaches for link prediction, and leads to interpretable
parameters.
Non-monotone Submodular Maximization in Exponentially Fewer Iterations

In this paper we consider parallelization for applications whose objective can
be expressed as maximizing a non-monotone submodular function under a
cardinality constraint. Our main result is an algorithm whose approximation is
arbitrarily close to 1/2e in O(log^2 n) adaptive rounds, where n is the size
of the ground set. This is an exponential speedup in parallel running time
over any previously studied algorithm for constrained non-monotone submodular
maximization. Beyond its provable guarantees, the algorithm performs well in
practice. Specifically, experiments on traffic monitoring and personalized
data summarization applications show that the algorithm finds solutions whose
values are competitive with state-of-the-art algorithms while running in
exponentially fewer parallel iterations.
MetaGAN: An Adversarial Approach to Few-Shot Learning

In this paper, we propose a conceptually simple and general framework called
MetaGAN for few-shot learning problems. Most state-of-the-art few-shot
classification models can be integrated with MetaGAN in a principled and
straightforward way. By introducing an adversarial generator conditioned on
tasks, we augment vanilla few-shot classification models with the ability to
discriminate between real and fake data. We argue that this GAN-based approach
can help few-shot classifiers to learn sharper decision boundary, which could
generalize better. We show that with our MetaGAN framework, we can extend
supervised few-shot learning models to naturally cope with unsupervised data.
Different from previous work in semi-supervised few-shot learning, our
algorithms can deal with semi-supervision at both sample-level and task-level.
We give theoretical justifications of the strength of MetaGAN, and validate
the effectiveness of MetaGAN on challenging few-shot image classification
benchmarks.
Local Differential Privacy for Evolving Data

There are now several large scale deployments of differential privacy used to
collect statistical information about users. However, these deployments
periodically recollect the data and recompute the statistics using algorithms
designed for a single use. As a result, these systems do not provide
meaningful privacy guarantees over long time scales. Moreover, existing
techniques to mitigate this effect do not apply in the ``local model'' of
differential privacy that these systems use. In this paper, we introduce a new
technique for local differential privacy that makes it possible to maintain
up-to-date statistics over time, with privacy guarantees that degrade only in
the number of changes in the underlying distribution rather than the number of
collection periods. We use our technique for tracking a changing statistic in
the setting where users are partitioned into an unknown collection of groups,
and at every time period each user draws a single bit from a common (but
changing) group-specific distribution. We also provide an application to
frequency and heavy-hitter estimation.
Gaussian Process Conditional Density Estimation

Conditional Density Estimation (CDE) models deal with estimating conditional
distributions. The conditions imposed on the distribution are the inputs of
the model. CDE is a challenging task as there is a fundamental trade-off
between model complexity, representational capacity and overfitting. In this
work, we propose to extend the model's input with latent variables and use
Gaussian processes (GP) to map this augmented input onto samples from the
conditional distribution. Our Bayesian approach allows for the modeling of
small datasets, but we also provide the machinery for it to be applied to big
data using stochastic variational inference. Our approach can be used to model
densities even in sparse data regions, and allows for sharing learned
structure between conditions. We illustrate the effectiveness and wide-
reaching applicability of our model on a variety of real-world problems, such
as spatio-temporal density estimation of taxi drop-offs, non-Gaussian noise
modeling, and few-shot learning on omniglot images.
Meta-Gradient Reinforcement Learning

The goal of reinforcement learning algorithms is to estimate and/or optimise
the value function. However, unlike supervised learning, no teacher or oracle
is available to provide the true value function. Instead, the majority of
reinforcement learning algorithms estimate and/or optimise a proxy for the
value function. This proxy is typically based on a sampled and bootstrapped
approximation to the true value function, known as a \emph{return}. The
particular choice of return is one of the chief components determining the
nature of the algorithm: the rate at which future rewards are discounted; when
and how values should be bootstrapped; or even the nature of the rewards
themselves. It is well-known that these decisions are crucial to the overall
success of RL algorithms. We introduce a novel, gradient-based meta-learning
algorithm that is able to adapt the nature of the return, online, whilst
interacting and learning from the environment. When applied to 57 games on the
Atari 2600 environment over 200 million frames, our algorithm achieved a new
state-of-the-art.
Modular Networks: Learning to Decompose Neural Computation

Scaling model capacity has been vital in the success of deep learning. For a
typical network, necessary compute resources and training time grow
dramatically with model size. Conditional computation is a promising way to
increase the number of parameters with a relatively small increase in
resources. We propose a training algorithm that flexibly chooses neural
modules based on the data to be processed. Both the decomposition and modules
are learned end-to-end. In contrast to existing approaches, training does not
rely on regularization to enforce diversity in module use. We apply modular
networks both to image recognition and language modeling tasks, where we
achieve superior performance compared to several baselines. Introspection
reveals that modules specialize to interpretable contexts.
Learning to Navigate in Cities Without a Map

Navigating through unstructured environments is a basic capability of
intelligent creatures, and thus is of fundamental interest in the study and
development of artificial intelligence. Long-range navigation is a complex
cognitive task that relies on developing an internal representation of space,
grounded by recognisable landmarks and robust visual processing, that can
simultaneously support continuous self-localisation ("I am here") and a
representation of the goal ("I am going there"). Building upon recent research
that applies deep reinforcement learning to maze navigation problems, we
present an end-to-end deep reinforcement learning approach that can be applied
on a city scale. Recognising that successful navigation relies on integration
of general policies with locale-specific knowledge, we propose a dual pathway
architecture that allows locale-specific features to be encapsulated, while
still enabling transfer to multiple cities. A key contribution of this paper
is an interactive navigation environment that uses Google Street View for its
photographic content and worldwide coverage. Our baselines demonstrate that
deep reinforcement learning agents can learn to navigate in multiple cities
and to traverse to target destinations that may be kilometres away. A video
summarizing our research and showing the trained agent in diverse city
environments as well as on the transfer task is available at:
https://sites.google.com/view/learn-navigate-cities-nips18
A theory on the absence of spurious optimality

We study the set of continuous functions that admit no spurious local optima
(i.e. local minima that are not global minima) which we term global functions.
They satisfy various powerful properties for analyzing nonconvex and nonsmooth
optimization problems. For instance, they satisfy a theorem akin to the
fundamental uniform limit theorem in the analysis regarding continuous
functions. Global functions are also endowed with useful properties regarding
the composition of functions and change of variables. Using these new results,
we show that a class of non-differentiable nonconvex optimization problems
arising in tensor decomposition applications are global functions. This is the
first result concerning nonconvex methods for nonsmooth objective functions.
Our result provides a theoretical guarantee for the widely-used $\ell_1$ norm
to avoid outliers in nonconvex optimization.
Recurrent World Models Facilitate Policy Evolution

A generative recurrent neural network is quickly trained in an unsupervised
manner to model popular reinforcement learning environments through compressed
spatio-temporal representations. The world model's extracted features are fed
into compact and simple policies trained by evolution, achieving state of the
art results in various environments. We also train our agent entirely inside
of an environment generated by its own internal world model, and transfer this
policy back into the actual environment. Interactive version of this paper:
https://nipsanon.github.io
Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling

Ridge leverage scores provide a balance between low-rank approximation and
regularization, and are ubiquitous in randomized linear algebra and machine
learning. Deterministic algorithms are also of interest in the moderately big
data regime, because deterministic algorithms provide interpretability to the
practitioner by having no failure probability and always returning the same
results. We provide provable guarantees for deterministic column sampling
using ridge leverage scores. The matrix sketch returned by our algorithm is a
column subset of the original matrix, yielding additional interpretability.
Like the randomized counterparts, the deterministic algorithm provides
$(1+\epsilon)$ error column subset selection, $(1+\epsilon)$ error projection-
cost preservation, and an additive-multiplicative spectral bound. We also show
that under the assumption of power-law decay of ridge leverage scores, this
deterministic algorithm is provably as accurate as randomized algorithms.
Lastly, ridge regression is frequently used to regularize ill-posed linear
least-squares problems. While ridge regression provides shrinkage for the
regression coefficients, many of the coefficients remain small but non-zero.
Performing ridge regression with the matrix sketch returned by our algorithm
and a particular regularization parameter forces coefficients to zero and has
a provable $(1+\epsilon)$ bound on the statistical risk. As such, it is an
interesting alternative to elastic net regularization.
Wasserstein Variational Inference

This paper introduces Wasserstein variational inference, a new form of
approximate Bayesian inference based on optimal transport theory. Wasserstein
variational inference uses a new family of divergences that includes both
f-divergences and the Wasserstein distance as special cases. The gradients of
the Wasserstein variational loss are obtained by backpropagating through the
Sinkhorn iterations. This technique results in a very stable likelihood-free
training method that can be used with implicit distributions and probabilistic
programs. Using the Wasserstein variational inference framework, we introduce
several new forms of autoencoders and test their robustness and performance
against existing variational autoencoding techniques.
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)

Batch Normalization (BatchNorm) is a widely adopted technique that enables
faster and more stable training of deep neural networks (DNNs). Despite its
pervasiveness, the exact reasons for BatchNorm's effectiveness are still
poorly understood. The popular belief is that this effectiveness stems from
controlling the change of the layers' input distributions during training to
reduce the so-called "internal covariate shift". In this work, we demonstrate
that such distributional stability of layer inputs has little to do with the
success of BatchNorm. Instead, we uncover a more fundamental impact of
BatchNorm on the training process: it makes the optimization landscape
significantly smoother. This smoothness induces a more predictive and stable
behavior of the gradients, allowing for faster training. These findings bring
us closer to a true understanding of our DNN training toolkit.
Verifiable Reinforcement Learning via Policy Extraction

While deep reinforcement learning has successfully solved many challenging
control tasks, its real-world applicability has been limited by the inability
to ensure the safety of learned policies. We propose an approach to verifiable
reinforcement learning by training decision tree policies, which can represent
complex policies (since they are nonparametric), yet can be efficiently
verified using existing techniques (since they are highly structured). The
challenge is that decision tree policies are difficult to train. We propose
VIPER, an algorithm that combines ideas from model compression and imitation
learning to learn decision tree policies guided by a DNN policy (called the
oracle) and its Q-function, and show that it substantially outperforms two
baselines. We use VIPER to (i) learn a provably robust decision tree policy
for a variant of Atari Pong with a symbolic state space, (ii) learn a decision
tree policy for a toy game based on Pong that provably never loses, and (iii)
learn a provably stable decision tree policy for cart-pole. In each case, the
decision tree policy achieves performance equal to that of the original DNN
policy.
Leveraged volume sampling for linear regression

Suppose an n x d design matrix in a linear regression problem is given, but
the response for each point is hidden unless explicitly requested. The goal is
to sample only a small number k << n of the responses, and then produce a
weight vector whose sum of squares loss over all points is at most 1+epsilon
times the minimum. When k is very small (e.g., k=d), jointly sampling diverse
subsets of points is crucial. One such method called "volume sampling" has a
unique and desirable property that the weight vector it produces is an
unbiased estimate of the optimum. It is therefore natural to ask if this
method offers the optimal unbiased estimate in terms of the number of
responses k needed to achieve a 1+epsilon loss approximation. Surprisingly we
show that volume sampling can have poor behavior when we require a very
accurate approximation -- indeed worse than some i.i.d. sampling techniques
whose estimates are biased, such as leverage score sampling. We then develop a
new rescaled variant of volume sampling that produces an unbiased estimate
which avoids this bad behavior and has at least as good a tail bound as
leverage score sampling: sample size k=O(d log d + d/epsilon) suffices to
guarantee total loss at most 1+epsilon times the minimum with high
probability. Thus, we improve on the best previously known sample size for an
unbiased estimator, k=O(d^2/epsilon). Our rescaling procedure leads to a new
efficient algorithm for volume sampling which is based on a "determinantal
rejection sampling" technique with potentially broader applications to
determinantal point processes. Other contributions include introducing the
combinatorics needed for rescaled volume sampling and developing tail bounds
for sums of dependent random matrices which arise in the process.
Supervised Local Modeling for Interpretability

Model interpretability is an increasingly important component of practical
machine learning systems, with example-based, local, and global explanations
representing some of the most common forms of explanations. We present a novel
explanation system called SLIM that leverages favorable properties of all
three of these explanation types. By combining local linear modeling
techniques with dual interpretations of random forests (as a supervised
neighborhood approach and as a feature selection method), we present a novel
local explanation system with several favorable properties. First, SLIM
sidesteps the typical accuracy-interpretability trade-off, as it is highly
accurate while also providing both example-based and local explanations.
Second, while SLIM does not provide global explanations, it can detect global
patterns and thus diagnose limitations in its local explanations. Third, SLIM
can select an appropriate explanation for a new test point when restricted to
an existing set of exemplar explanations. Finally, in addition to providing
faithful self-explanations, SLIM can be deployed as a black-box explanation
system.
A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication

The large communication overhead has imposed a bottleneck on the performance
of distributed Stochastic Gradient Descent (SGD) for training deep neural
networks. Previous works have demonstrated the potential of using gradient
sparsification and quantization to reduce the communication cost. However,
there is still a lack of understanding about how sparse and quantized
communication affects the convergence rate of the training algorithm. In this
paper, we study the convergence rate of distributed SGD for non-convex
optimization with two communication reducing strategies: sparse parameter
averaging and gradient quantization. We show that $O(1/\sqrt{MK})$ convergence
rate can be achieved if the sparsification and quantization hyperparameters
are configured properly. We also propose a strategy called periodic quantized
averaging (PQASGD) that further reduces the communication cost while
preserving the $O(1/\sqrt{MK})$ convergence rate. Our evaluation validates our
theoretical results and shows that our PQASGD can converge as fast as full-
precision SGD with only $3%-5%$ communication data size.
Active Learning for Non-Parametric Regression Using Purely Random Trees

Active learning is the task of using labelled data to select additional points
to label, with the goal of fitting the most accurate model with a fixed budget
of labelled points. In binary classification active learning is known to
produce faster rates than passive learning for a broad range of settings.
However in regression restrictive structure and tailored methods have been
needed to obtain theoretically superior rates. In this paper we propose an
intuitive tree based active learning algorithm for non-parametric regression
with provable improvement over random sampling. When implemented with Mondrian
Trees our algorithm is tuning parameter free, consistent and minimax optimal
for Lipschitz functions.
Tree-to-tree Neural Networks for Program Translation

Program translation is an important tool to migrate legacy code in one
language into an ecosystem built in a different language. In this work, we are
the first to employ deep neural networks toward tackling this problem. We
observe that program translation is a modular procedure, in which a sub-tree
of the source tree is translated into the corresponding target sub-tree at
each step. To capture this intuition, we design a tree-to-tree neural network
to translate a source tree into a target one. Meanwhile, we develop an
attention mechanism for the tree-to-tree model, so that when the decoder
expands one non-terminal in the target tree, the attention mechanism locates
the corresponding sub-tree in the source tree to guide the expansion of the
decoder. We evaluate the program translation capability of our tree-to-tree
model against several state-of-the-art approaches. Compared against other
neural translation models, we observe that our approach is consistently better
than the baselines with a margin of up to 15 points. Further, our approach can
improve the previous state-of-the-art program translation approaches by a
margin of 20 points on the translation of real-world projects.
Batch-Instance Normalization for Adaptively Style-Invariant Neural Networks

Real-world image recognition is often challenged by the variability of visual
styles including object textures, lighting conditions, filter effects, etc.
Although these variations have been deemed to be implicitly handled by more
training data and deeper networks, recent advances in image style transfer
suggest that it is also possible to explicitly manipulate the style
information. Extending this idea to general visual recognition problems, we
present Batch-Instance Normalization (BIN) to explicitly normalize unnecessary
styles from images. Considering certain style features play an essential role
in discriminative tasks, BIN learns to selectively normalize only disturbing
styles while preserving useful styles. The proposed normalization module is
easily incorporated into existing network architectures such as Residual
Networks, and surprisingly improves the recognition performance in various
scenarios. Furthermore, experiments verify that BIN effectively adapts to
completely different tasks like object classification and style transfer, by
controlling the trade-off between preserving and removing style variations.
Structural Causal Bandits: Where to Intervene?

We study the problem of identifying the best action in a sequential decision-
making setting when the reward distributions of the arms exhibit non-trivial
dependencies, which are governed by the underlying causal structure of the
domain where the agent is deployed. In this setting, playing an arm
corresponds to intervening on a set of variables and setting them to specific
values. In this paper, we start by showing that whenever the causal model
relating the arms is unknown, the strategy of simultaneously intervening in
all variables can, in general, lead to a sub-optimal policy (regardless of the
number of iterations performed in the environment). We then derive structural
properties implied by the given causal model, which is assumed to be known,
albeit without parametrization. We further propose an algorithm that takes as
input the causal structure and finds a minimal, sound, and complete set of
qualified arms that the agent can play so as to maximize its reward. We
empirically demonstrate that this algorithm leads to optimal, order of
magnitude faster convergence rates when compared with its causal-insensitive
counterparts.
Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog

Goal-oriented dialog has been given attention due to its numerous applications
in artificial intelligence. Goal-oriented dialogue tasks occur when a
questioner asks an action-oriented question and an answerer responds with the
intent of letting the questioner know a correct action to take. To ask the
adequate question, deep learning and reinforcement learning have been recently
applied. However, these approaches struggle to find a competent recurrent
neural questioner, owing to the complexity of learning a series of sentences.
Motivated by theory of mind, we propose Answerer in Questioner's Mind'' (AQM),
a novel algorithm for goal-oriented dialog. With AQM, a questioner asks and
infers based on an approximated probabilistic model of the answerer. The
questioner figures out the answerer’s intention via selecting a plausible
question by explicitly calculating the information gain of the candidate
intentions and possible answers to each question. We test our framework on two
goal-oriented visual dialog tasks:MNIST Counting Dialog'' and ``GuessWhat?!.''
In our experiments, AQM outperforms comparative algorithms by a large margin.
A Unified Feature Disentangler for Multi-Domain Image Translation and Manipulation

We present a novel and unified deep learning framework which is capable of
learning domain-invariant representation from data across multiple domains.
Realized by adversarial training with additional ability to exploit domain-
specific information, the proposed network is able to perform continuous
cross-domain image translation and manipulation, and produces desirable output
images accordingly. In addition, the resulting feature representation exhibits
superior performance of unsupervised domain adaptation, which also verifies
the effectiveness of the proposed model in learning disentangled features for
describing cross-domain data.
Online Learning with an Unknown Fairness Metric

We consider the problem of online learning in the linear contextual bandits
setting, but in which there are also strong individual fairness constraints
governed by an unknown similarity metric. These constraints demand that we
select similar actions or individuals with approximately equal probability
DHPRZ12, which may be at odds with optimizing reward, thus modeling settings
where profit and social policy are in tension. We assume we learn about an
unknown Mahalanobis similarity metric from only weak feedback that identifies
fairness violations, but does not quantify their extent. This is intended to
represent the interventions of a regulator who "knows unfairness when he sees
it" but nevertheless cannot enunciate a quantitative fairness metric over
individuals. Our main result is an algorithm in the adversarial context
setting that has a number of fairness violations that depends only
logarithmically on T, while obtaining an optimal O(sqrt(T)) regret bound to
the best fair policy.
Isolating Sources of Disentanglement in Variational Autoencoders

We decompose the evidence lower bound to show the existence of a term
measuring the total correlation between latent variables. We use this to
motivate the beta-TCVAE (Total Correlation Variational Autoencoder) algorithm,
a refinement and plug-in replacement of the beta-VAE for learning disentangled
representations, requiring no additional hyperparameters during training. We
further propose a principled classifier-free measure of disentanglement called
the mutual information gap (MIG). We perform extensive quantitative and
qualitative experiments, in both restricted and non-restricted settings, and
show a strong relation between total correlation and disentanglement, when the
model is trained using our framework.
Contextual bandits with surrogate losses: Margin bounds and efficient algorithms

We use surrogate losses to obtain several new regret bounds and new algorithms
for contextual bandit learning. Using the ramp loss, we derive a new margin-
based regret bound in terms of standard sequential complexity measures of a
benchmark class of real-valued regression functions. Using the hinge loss, we
derive an efficient algorithm with a $\sqrt{dT}$-type mistake bound against
benchmark policies induced by $d$-dimensional regressors. Under realizability
assumptions, our results also yield classical regret bounds.
Representation Learning for Treatment Effect Estimation from Observational Data

Estimating individual treatment effect (ITE) is a challenging problem in
causal inference, due to missing counterfactuals and the selection bias.
Existing ITE estimation methods mainly focus on balancing the distributions of
control and treated groups, but ignore the local similarity information that
is helpful. In this paper, we propose a local similarity preserved individual
treatment effect (SITE) estimation method based on deep representation
learning. SITE preserves local similarity and balances data distributions
simultaneously, by focusing on several hard samples in each mini-batch.
Experimental results on synthetic and three real-world datasets demonstrate
the advantages of the proposed SITE method, compared with the state-of-the-art
ITE estimation methods.
Representation Balancing MDPs for Off-policy Policy Evaluation

We study the problem of off-policy policy evaluation (OPPE) in RL. In contrast
to prior work, we consider how to estimate both the individual policy value
and average policy value accurately. We draw inspiration from recent work in
causal reasoning, and propose a new finite sample generalization error bound
for value estimates from MDP models. Using this upper bound as an objective,
we develop a learning algorithm of an MDP model with a balanced
representation, and show that our approach can yield substantially lower MSE
in a common synthetic domain and on a challenging real-world sepsis management
problem.
Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

Accurately answering a question about a given image requires combining
observations with general knowledge. While this is effortless for humans,
reasoning with general knowledge remains an algorithmic challenge. To advance
research in this direction, a novel fact-based' visual question answering
(FVQA) task has been introduced recently along with a large set of curated
facts which link two entities, i.e., two possible answers, via a relation.
Given a question-image pair, deep net techniques have been employed to
successively reduce the large set of facts until one of the two entities of
the final remaining fact is predicted as the answer. We observe that a
successive process which considers one fact at a time to form a local decision
is sub-optimal. Instead, we develop an entity graph and a graph convolutional
net method toreason' about the correct answer by jointly considering all
entities. We show on the challenging FVQA dataset that this leads to an
improvement in accuracy of around 10% compared to the state-of-the-art.
Causal Inference on Discrete Data using Hidden Compact Representation

Inferring causal relations from a set of observations is one of the
fundamental problems across several disciplines. For continuous variables,
recently a number of causal discovery methods have demonstrated their
effectiveness in distinguishing the cause from effect by exploring certain
properties of the conditional distribution, but causal discovery on
categorical data still remains to be a challenging problem, because it is
generally not easy to find a compact description of the causal mechanism for
the true causal direction. In this paper we make an attempt to find a way to
solve this problem by assuming a two-stage causal process: the first stage
maps the cause to a hidden variable of a lower cardinality, and the second
stage generates the effect from the hidden representation. In this way, the
causal mechanism admits a simple, compact representation. We show that under
this model, the causal direction is identifiable under some weak technique
conditions on the true causal mechanism. We also provide an effective solution
to recover the above hidden compact representation model under the likelihood
framework. Empirical studies verify the effectiveness of the proposed approach
on both synthetic and real-world data.
Natasha 2: Faster Non-Convex Optimization Than SGD

(this is a theory paper) We design a stochastic algorithm to find
$\varepsilon$-approximate local minima of any smooth nonconvex function in
rate $O(\varepsilon^{-3.25})$, with only oracle access to stochastic
gradients. The best result was essentially $O(\varepsilon^{-4})$ by stochastic
gradient descent (SGD).
Minimax Statistical Learning with Wasserstein distances

As opposed to standard empirical risk minimization (ERM), distributionally
robust optimization aims to minimize the worst-case risk over a larger
ambiguity set containing the original empirical distribution of the training
data. In this work, we describe a minimax framework for statistical learning
with ambiguity sets given by balls in Wasserstein space. In particular, we
prove a generalization bound that involves the covering number properties of
the original ERM problem. As an illustrative example, we provide
generalization guarantees for transport-based domain adaptation problems where
the Wasserstein distance between the source and target domain distributions
can be reliably estimated from unlabeled samples.
Provable Variational Inference for Constrained Log-Submodular Models

Submodular maximization problems appear in several areas of machine learning
and data science, as many useful modelling concepts such as diversity and
coverage satisfy this natural diminishing returns property. Because the data
defining these functions, as well as the decisions made with the computed
solutions are subject to statistical noise and randomness, it is arguably
necessary to go beyond computing a single approximate optimum and quantify its
inherent uncertainty. To this end, we define a rich class of probabilistic
models associated with constrained submodular maximization problems. These
capture log-submodular dependencies of arbitrary order between the variables,
but also satisfy hard combinatorial constraints. Namely, the variables are
assumed to take on one of — possibly exponential — set of states, which form
the bases of a matroid. To perform inference in these models we design novel
variational inference algorithms, which carefully leverage the combinatorial
and probabilistic properties of these objects. In addition to providing
completely tractable and well-understood variational approximations, our
approach results in the minimization of a convex upper bound on the log-
partition function. The bound can be efficiently evaluated using greedy
algorithms and optimized using any first-order method. Moreover, for the case
of facility location and weighted coverage functions, we prove the first
constant factor guarantee in this setting — an efficiently certifiable e/(e-1)
approximation of the log-partition function. Finally, we empirically
demonstrate the effectiveness of our approach on several instances.
Learning Hierarchical Semantic Image Manipulation through Structured Representations

Understanding, reasoning, and manipulating semantic concepts of images have
been a fundamental research problem for decades. Previous work mainly focused
on direct manipulation of natural image manifold through color strokes, key-
points, textures, and holes-to-fill. In this work, we present a novel
hierarchical framework for semantic image manipulation. Key to our
hierarchical framework is that we employ structured semantic layout as our
intermediate representations for manipulation. Initialized with coarse-level
bounding boxes, our layout generator first creates pixel-wise semantic layout
capturing the object shape, object-object interactions, and object-scene
relations. Then our image generator fills in the pixel-level textures guided
by the semantic layout. Such framework allows a user to manipulate images at
object-level by adding, removing, and moving one bounding box at a time.
Experimental evaluations demonstrate the advantages of the hierarchical
manipulation framework over existing image generation and context hole-filing
models, both qualitatively and quantitatively. Benefits of the hierarchical
framework are further demonstrated in applications such as semantic object
manipulation, interactive image editing, and data-driven image manipulation.
Processing of missing data by neural networks

We propose a general, theoretically justified mechanism for processing missing
data by neural networks. Our idea is to replace typical neuron's response in
the first hidden layer by its expected value. This approach can be applied for
various types of networks at minimal cost in their modification. Moreover, in
contrast to recent approaches, it does not require complete data for training.
Experimental results performed on different types of architectures show that
our method gives better results than typical imputation strategies and other
methods dedicated for incomplete data.
Safe Active Learning for Time-Series Modeling with Gaussian Processes

Learning time-series models is useful for many applications, such as
simulation and forecasting. In this study, we consider the problem of actively
learning time-series models, while taking given safety constraints into
account. For time-series modeling, we employ a Gaussian process with nonlinear
exogenous input structure. The proposed approach generates data, i.e. input
and output trajectories, appropriate for time-series model learning by
dynamically exploring the input space. The basic idea behind the proposed
approach is to parametrize the input trajectory as consecutive trajectory
sections, which are determined stepwise given safety requirements and past
observations. We analyze the proposed algorithm and, empirically, evaluate it
on a technical application. The results show the effectiveness of our approach
in a realistic industrial setting.
Optimal Algorithms for Non-Smooth Distributed Optimization in Networks

In this work, we consider the distributed optimization of non-smooth convex
functions using a network of computing units. We investigate this problem
under two regularity assumptions: (1) the Lipschitz continuity of the global
objective function, and (2) the Lipschitz continuity of local individual
functions. Under the local regularity assumption, we provide the first optimal
first-order decentralized algorithm called multi-step primal-dual (MSPD) and
its corresponding optimal convergence rate. A notable aspect of this result is
that, for non-smooth functions, while the dominant term of the error is in
$O(1/\sqrt{t})$, the structure of the communication network only impacts a
second-order term in $O(1/t)$, where $t$ is time. In other words, the error
due to limits in communication resources decreases at a fast rate even in the
case of non-strongly-convex objective functions. Under the global regularity
assumption, we provide a simple yet efficient algorithm called distributed
randomized smoothing (DRS) based on a local smoothing of the objective
function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the
optimal convergence rate, where $d$ is the underlying dimension.
Computing Higher Order Derivatives of Matrix and Tensor Expressions

Optimization is an integral part of most machine learning systems and most
numerical optimization schemes rely on the computation of derivatives.
Therefore, frameworks for computing derivatives are an active area of machine
learning research. Surprisingly, as of yet, no existing framework is capable
of computing higher order matrix and tensor derivatives directly. Here, we
close this fundamental gap and present an algorithmic framework for computing
matrix and tensor derivatives that extends seamlessly to higher order
derivatives. The framework can be used for symbolic as well as for forward and
reverse mode automatic differentiation. Experiments show a speedup between one
and four orders of magnitude over state-of-the-art frameworks when evaluating
higher order derivatives.
Paraphrasing Complex Network: Network Compression via Factor Transfer

Many researchers have sought ways of model compression to reduce the size of a
deep neural network DNN with minimal performance degradation in order to use
DNNs in embedded systems. Among the model compression methods, a method called
knowledge transfer is to train a student network with a stronger teacher
network. In this paper, we propose a novel knowledge transfer method which
uses convolutional operations to paraphrase teacher's knowledge and to
translate it for the student. This is done by two convolutional modules, which
are called a paraphraser and a translator. The paraphraser is trained in an
unsupervised manner to extract the teacher factors which are defined as
paraphrased information of the teacher network. The translator located at the
student network extracts the student factors and helps to translate the
teacher factors by mimicking them. We observed that our student network
trained with the proposed factor transfer method outperforms the ones trained
with conventional knowledge transfer methods.
Analytic solution and stationary phase approximation for the Bayesian lasso and elastic net

The lasso and elastic net linear regression models impose a double-exponential
prior distribution on the model parameters to achieve regression shrinkage and
variable selection, allowing the inference of robust models from large data
sets. However, there has been limited success in deriving estimates for the
full posterior distribution of regression coefficients in these models, due to
a need to evaluate analytically intractable partition function integrals.
Here, the Fourier transform is used to express these integrals as complex-
valued oscillatory integrals over "regression frequencies". This results in an
analytic expansion and stationary phase approximation for the partition
functions of the Bayesian lasso and elastic net, where the non-
differentiability of the double-exponential prior has so far eluded such an
approach. Use of this approximation leads to highly accurate numerical
estimates for the expectation values and marginal posterior distributions of
the regression coefficients, and allows for Bayesian inference of much higher
dimensional models than previously possible.
Demystifying excessively volatile human learning: A Bayesian persistent prior and a neural approximation

Understanding how humans and animals learn about statistical regularities in
stable and volatile environments, and utilize these regularities to make
predictions and decisions, is an important problem in neuroscience and
psychology. Using a Bayesian modeling framework, specifically the Dynamic
Belief Model (DBM), it has previously been shown that humans tend to make the
{\it default} assumption that environmental statistics undergo abrupt,
unsignaled changes, even when environmental statistics are actually stable.
Because exact Bayesian inference in this setting, an example of switching
state space models, is computationally intense, a number of approximately
Bayesian and heuristic algorithms have been proposed to account for
learning/prediction in the brain. Here, we examine a neurally plausible
algorithm, a special case of leaky integration dynamics we denote as EXP (for
exponential filtering), that is significantly simpler than all previously
suggested algorithms except for the delta-learning rule, and which far
outperforms the delta rule in approximating Bayesian prediction performance.
We derive the theoretical relationship between DBM and EXP, and show that EXP
gains computational efficiency by foregoing the representation of inferential
uncertainty (as does the delta rule), but that it nevertheless achieves near-
Bayesian performance due to its ability to incorporate a "persistent prior"
influence unique to DBM and absent from the other algorithms. Furthermore, we
show that EXP is comparable to DBM but better than all other models in
reproducing human behavior in a visual search task, suggesting that human
learning and prediction also incorporates an element of persistent prior. More
broadly, our work demonstrates that when observations are information-poor,
detecting changes or modulating the learning rate is both {\it difficult} and
(thus) {\it unnecessary} for making Bayes-optimal predictions.
Empirical Risk Minimization Under Fairness Constraints

We address the problem of algorithmic fairness: ensuring that sensitive
variables do not unfairly influence the outcome of a classifier. We present an
approach based on empirical risk minimization, which incorporates a fairness
constraint into the learning problem. It encourages the conditional risk of
the learned classifier to be approximately constant with respect to the
sensitive variable. We derive both risk and fairness bounds that support the
statistical consistency of our approach. We specify our approach to kernel
methods and observe that the fairness requirement implies an orthogonality
constraint which can be easily added to these methods. We further observe that
for linear models the constraint translates into a simple data preprocessing
step. Experiments indicate that the method is empirically effective and
performs favorably against state-of-the-art approaches.
Unsupervised Learning of Shape and Pose with Differentiable Point Clouds

We address the problem of learning accurate 3D shape and camera pose from a
collection of unlabeled category-specific images. We train a convolutional
network to predict both the shape and the pose from a single image by
minimizing the reprojection error: given several views of an object, the
projections of the predicted shapes to the predicted camera poses should match
the provided views. To deal with pose ambiguity, we introduce an ensemble of
pose predictors that we then distill it to a single ``student'' model. To
allow for efficient learning of high-fidelity shape representation, we
represent the shapes by point clouds and devise a formulation allowing for
differentiable projection of these. Our experiments show that the distilled
ensemble of pose predictors learns to estimate the pose accurately, while the
point cloud representation allows to predict detailed shape models.
Continuous-time Value Function Approximation in Reproducing Kernel Hilbert Spaces

Motivated by the success of reinforcement learning (RL) for discrete-time
tasks such as AlphaGo and Atari games, there has been a recent surge of
interest in using RL for continuous-time control of physical systems (cf. many
challenging tasks in OpenAI Gym and the DeepMind Control Suite). Since
discretization of time is susceptible to error, it is methodologically more
desirable to handle the system dynamics directly in continuous time. However,
very few techniques exist for continuous-time RL and they lack flexibility in
value function approximation. In this paper, we propose a novel framework for
continuous-time value function approximation based on reproducing kernel
Hilbert spaces. The resulting framework is so flexible that it can accommodate
any kind of kernel-based approach, such as Gaussian processes and the adaptive
projected subgradient method, and it allows us to handle uncertainties and
nonstationarity without prior knowledge about the environment or what basis
functions to employ. We demonstrate the validity of the presented framework
through experiments.
Solving Non-smooth Constrained Programs with Lower Complexity than $\mathcal{O}(1/\varepsilon)$: A Primal-Dual Homotopy Smoothing Approach

We propose a new primal-dual homotopy smoothing algorithm for a linearly
constrained convex program, where neither the primal nor the dual function has
to be smooth or strongly convex. The best known iteration complexity solving
such a non-smooth problem is $\mathcal{O}(\varepsilon^{-1})$. In this paper,
we show that by leveraging a local error bound condition on the dual function,
the proposed algorithm can achieve a better primal convergence time of
$\mathcal{O}\l(\varepsilon^{-2/(2+\beta)}\log_2(\varepsilon^{-1})\r)$, where
$\beta\in(0,1]$ is a local error bound parameter. As an example application,
we show that the distributed geometric median problem, which can be formulated
as a constrained convex program, has its dual function non-smooth but
satisfying the aforementioned local error bound condition with $\beta=1/2$,
therefore enjoying a convergence time of
$\mathcal{O}\l(\varepsilon^{-4/5}\log_2(\varepsilon^{-1})\r)$. This result
improves upon the $\mathcal{O}(\varepsilon^{-1})$ convergence time bound
achieved by existing distributed optimization algorithms. Simulation
experiments also demonstrate the performance of our proposed algorithm.
Heterogeneous Bitwidth Binarization in Convolutional Neural Networks

Recent work has shown that performing inference with fast, very-low-bitwidth
(e.g., 1 to 2 bits) representations of values can yield surprisingly accurate
results. When coupled with FPGAs or custom hardware, the performance of binary
models has been shown to be staggering. We seek to improve upon these designs
by leveraging custom hardware’s ability to perform true bitwise operations. To
this end, we introduce the "middle-out" algorithm, which allows us to jointly
learn the value of each parameter with it’s individual bitwidth, effectively
allowing a model to have a fractional bitwidth. We find that heterogeneous
representations are fundamentally more expressive than their integer
counterparts. We verify this finding by training several models on ImageNet
and show that with an average of 1.4 bits we are able to out perform state-of-
the-art 2-bit architectures.
Unsupervised Learning of Object Landmarks through Conditional Image Generation

In this paper, we consider the problem of learning landmarks for object
categories without any manual annotations. We cast this as the problem of
conditionally generating an image of an object from another one, where the
images differ by acquisition time and/or viewpoint. The process is aided by
providing the generator with a keypoint-like representation extracted from the
target image through a tight bottleneck. This encourages the representation to
distil information about the object geometry, which changes from source to
target, while the appearance, which is shared by the source and target, is
read off from the source alone. Conditioning simplifies the generation task
significantly, to the point that adopting a simple perceptual loss instead of
more sophisticated approaches such as adversarial training is sufficient to
learn landmarks. We show that our method is applicable to a large variety of
datasets --- faces, people, 3D objects, and digits --- without any
modifications. We further demonstrate that we can learn landmarks from
synthetic image deformations or videos, all without manual supervision, while
outperforming state-of-the-art unsupervised landmark detectors.
Probabilistic Neural Programmed Networks for Scene Generation

In this paper we address the text to scene image generation problem.
Generative models that capture the variability in complicated scenes
containing rich semantics is a grand goal of image generation. Complicated
scene images contain rich visual elements, compositional visual concepts, and
complicated relations between objects. Generative models, as an analysis-by-
synthesis process, should encompass the following three core components: 1)
the generation process that composes the scene; 2) what are the primitive
visual elements and how are they composed; 3) the rendering of abstract
concepts into their pixel-level realizations. We propose PNP-Net, a
variational auto-encoder framework that addresses these three challenges: it
flexibly composes images with a dynamic network structure, learns a set of
distribution transformers that can compose distributions based on semantics,
and decodes samples from these distributions into realistic images.
The streaming rollout of deep networks - towards fully model-parallel execution

Deep neural networks, and in particular recurrent networks, are promising
candidates to control autonomous agents that interact in real-time with the
physical world. However, this requires a seamless integration of temporal
features into the network's architecture. For the training of and inference
with recurrent neural networks, they are usually rolled out over time, and
different rollouts exist. Conventionally, during inference the layers of a
network are computed in a sequential manner resulting in sparse temporal
integration of information and long response times. In this study, we present
a theoretical framework to describe the set of all rollouts and demonstrate
their differences in solving specific tasks. We prove that certain rollouts,
also with only skip and no recurrent connections, enable earlier and more
frequent responses, and show empirically that these early responses have
better performance. The streaming rollout maximizes these properties and, in
addition, enables a fully parallel execution of the network reducing the
runtime on massively parallel devices. Additionally, we provide an open-source
toolbox to design, train, evaluate, and online-interact with streaming
rollouts.
KONG: Kernels for ordered-neighborhood graphs

We present novel graph kernels for graphs with node and edge labels that have
ordered neighborhoods, i.e. when neighbor nodes follow an order. Graphs with
ordered neighborhoods are a natural data representation for evolving graphs
where edges are created over time, which induces an order. Combining
convolutional subgraph kernels and string kernels, we design new scalable
algorithms for generation of explicit graph feature maps using sketching
techniques. We obtain precise bounds for the approximation accuracy and
computational complexity of the proposed approaches and demonstrate their
applicability on real datasets. In particular, our experiments demonstrate
that neighborhood ordering results in more informative features. For the
special case of general graphs, i.e. graphs without ordered neighborhoods, the
new graph kernels yield efficient and simple algorithms for the comparison of
label distributions between graphs.
GumBolt: Extending Gumbel trick to Boltzmann priors

Boltzmann machines (BMs) are appealing candidates as powerful priors in
variational autoencoders (VAEs), as they are capable of capturing nontrivial
and multi-modal distributions over discrete variables. However,
indifferentiability of the discrete units prohibits using the
reparameterization trick, essential for low-noise back propagation. The Gumbel
trick resolves this problem in a consistent way by relaxing the variables and
distributions, but it is incompatible with BM priors. Here, we propose the
GumBolt, a model that extends the Gumbel trick to BM priors in VAEs. GumBolt
is significantly simpler than the recently proposed methods with BM prior and
outperforms them by a considerable margin. It achieves state-of-the-art
performance on permutation invariant MNIST and OMNIGLOT datasets in the scope
of models with only discrete latent variables. Moreover, the performance can
be further improved by allowing multi-sampled (importance-weighted) estimation
of log-likelihood in training, which was not possible with previous works.
Neural Networks Trained to Solve Differential Equations Learn General Representations

We introduce a technique based on the singular vector canonical correlation
analysis (SVCCA) for measuring the generality of neural network layers across
a continuously-parametrized set of tasks. We illustrate this method by
studying generality in neural networks trained to solve parametrized boundary
value problems based on the Poisson partial differential equation. We find
that the first hidden layer is general, and that deeper layers are
successively more specific. Next, we validate our method against an existing
technique that measures layer generality using transfer learning experiments.
We find excellent agreement between the two methods, and note that our method
is much faster, particularly for continuously-parametrized problems. Finally,
we visualize the general representations of the first layers, and interpret
them as generalized coordinates over the input domain.
Beauty-in-averageness and its contextual modulations: A Bayesian statistical account

Understanding how humans perceive the likability of high-dimensional
``objects'' such as faces is an important problem in both cognitive science
and AI/ML. Existing models of human preferences generally assume these
preferences to be fixed. However, human assessment of facial attractiveness
have been found to be highly context-dependent. Specifically, the classical
Beauty-in-Averageness (BiA) effect, whereby a face blended from two original
faces is judged to be more attractive than the originals, is significantly
diminished or reversed when the original faces are recognizable, or when the
morph is mixed-race/mixed gender and the attractiveness judgment is preceded
by a race/gender categorization. This effect, dubbed Ugliness-in-Averageness
(UiA), has previously been attributed to a disfluency account, which is both
qualitative and clumsy in explaining BiA. We hypothesize, instead, that these
contextual influences on face processing result from the dependence of
attractiveness perception on an element of statistical typicality, and from an
attentional mechanism that restricts face representation to a task-relevant
subset of features, thus redefining typicality within that subspace.
Furthermore, we propose a principled explanation of why statistically atypical
objects are less likable: they incur greater encoding or processing cost
associated with a greater prediction error, when the brain uses predictive
coding to compare the actual stimulus properties with those expected from its
associated categorical prototype. We use simulations to show our model
provides a parsimonious, statistically grounded, and quantitative account of
contextual dependence of attractiveness. We also validate our model using
experimental data from a gender categorization task. Finally, we make model
predictions for a proposed experiment that can disambiguate the previous
disfluency account and our statistical typicality theory.
Distributed Weight Consolidation: A Brain Segmentation Case Study

Collecting the large datasets needed to train deep neural networks can be very
difficult, particularly for the many applications for which sharing and
pooling data is complicated by practical, ethical, or legal concerns. However,
it may be the case that derivative datasets or predictive models developed
within individual sites can be shared and combined with fewer restrictions.
Training on distributed data and combining the resulting networks is often
viewed as continual learning, but these methods require networks to be trained
sequentially. In this paper, we introduce distributed weight consolidation
(DWC), a continual learning method to consolidate the weights of separate,
distributed neural networks, each trained on an independent datasets. We
perform a brain segmentation case study using DWC to consolidate several
dilated convolutional neural networks trained on independent structural
magnetic resonance imaging (sMRI) datasets from different sites. We found that
DWC led to increased performance on held-out test sets from the different
sites, as well as on a very large and completely independent multi-site
dataset. This demonstrates the feasibility of DWC for combining the knowledge
learned by networks trained on different datasets.
Efficient Projection onto the Perfect Phylogeny Model

Several algorithms build on the perfect phylogeny model to infer evolutionary
trees. This problem is particularly hard, when evolutionary trees are inferred
from the fraction of genomes that have mutations in different positions,
across different samples. Existing algorithms require exhaustive searches over
the space of possible trees. At the center of these algorithms is a projection
problem that assigns a fitness cost to phylogenetic trees. In order to perform
a wide search over the space of the trees, it is critical to solve this
projection problem fast. In this paper, we use Moreau's decomposition for
proximal operators, and a tree reduction scheme, to develop a new algorithm to
compute this projection. Our algorithm terminates with an exact solution in a
finite number of steps, and is extremely fast. In particular, it can search
over all evolutionary trees with fewer than 11 nodes, a size relevant for
several biological problems (more than 2 billion trees) in about 2 hours.
TETRIS: TilE-matching the TRemendous Irregular Sparsity

Compressing neural networks by pruning weights with small magnitudes can
significantly reduce the computation and storage cost. Although pruning makes
the model smaller, it is difficult to get practical speedup in modern
computing platforms such as CPU and GPU due to the irregularity. Recent
efforts on hardware-friendly pruning involve structured sparsity with
different granularity and dimensionality. Simply increasing the sparsity
granularity can lead to better hardware utilization, but it will compromise
the sparsity for maintaining accuracy. In this work, we propose a novel
method, TETRIS, to achieve both better hardware utilization and higher
sparsity. Just like a tile-matching game, we cluster the irregularly
distributed weights with small value into structured blocks by reordering the
input/output dimension and structurally prune them. Results show that it can
achieve comparable sparsity with the irregular element-wise pruning and
demonstrate negligible accuracy loss. Ideal speedup, proportional to the
sparsity, is experimentally demonstrated. Our proposed method provides a new
solution toward algorithm and architecture co-optimization for accuracy-
efficiency trade-off.
Cooperative neural networks (CoNN): Exploiting prior independence structure for improved classification

We propose a new approach, called cooperative neural networks (CoNN), which
use a set of cooperatively trained neural networks to capture latent
representations that exploit prior given independence structure. The model is
more flexible than traditional graphical models based on exponential family
distributions, but incorporates more domain specific prior structure than
traditional deep networks or variational autoencoders. The framework is very
general and can be used to exploit the independence structure of any graphical
model. We illustrate the technique by showing that we can transfer the
independence structure of the popular Latent Dirichlet Allocation (LDA) model
to a cooperative neural network, CoNN-sLDA. Empirical evaluation of CoNN-sLDA
on supervised text classification tasks demonstrate that the theoretical
advantages of prior independence structure can be realized in practice - we
demonstrate a 23 percent reduction in error on the challenging MultiSent data
set compared to state-of-the-art.
Differentially Private Robust PCA

In this paper, we initiate the study of the following problem: given a private
matrix $A \in \R^{n \times d}$, output a rank-$k$ matrix $B$, while satisfying
differential privacy, such that $ \norm{ A - B }_p \leq \alpha
\mathsf{OPT}_k(A) + \gamma,$ where $\norm{ M }_p$ is the entry-wise $\ell_p$
norm and $\mathsf{OPT}k(A):=\min{\mathsf{rank}(X) \leq k} \norm{ A - X}_p$.
It is well known that low-rank approximation w.r.t. entrywise $\ell_p$-norm,
for $p \in [1,2)$, yields robustness to gross outliers in the data. We propose
an algorithm that guarantees $\alpha=\widetilde{O}(k^2),
\gamma=\widetilde{O}(k(n+d)/\varepsilon)$, runs in $\widetilde
O((n+d)\poly~k)$ time and uses $O(k(n+d)\log k)$ space. This is an {\em
exponential improvement} in $\alpha$ over known differentially private
algorithms for $p=2$. We also study extensions to the streaming setting where
entries of the matrix arrive in an arbitrary order and output is produced at
the very end or continually. We also study the related problem of
differentially private robust subspace learning that requires us to output a
rank-$k$ projection matrix $\Pi$ such that $\norm{ A - A \Pi }_p \leq \alpha
\mathsf{OPT}_k(A) + \tau.$
Meta-Learning MCMC Proposals

Effective implementations of sampling-based probabilistic inference often
require manually constructed, model-specific proposals. Inspired by recent
progresses in meta-learning for training learning agents that can generalize
to unseen environments, we propose a meta-learning approach to building
effective and generalizable MCMC proposals. We parametrize the proposal as a
neural network to provide fast approximations to block Gibbs conditionals. The
learned neural proposals generalize to occurrences of common structural motifs
across different models, allowing for the construction of a library of learned
inference primitives that can accelerate inference on unseen models with no
model-specific training required. We explore several applications including
open-universe Gaussian mixture models, in which our learned proposals
outperform a hand-tuned sampler, and a real-world named entity recognition
task, in which our sampler yields higher final F1 scores than classical
single-site Gibbs sampling.
An Information-Theoretic Analysis of Thompson Sampling for Large Action Spaces

Information-theoretic Bayesian regret bounds of Russo and Van Roy capture the
dependence of regret on prior uncertainty. However, this dependence is through
entropy, which can become arbitrarily large as the number of actions
increases. We establish new bounds that depend instead on a notion of rate-
distortion. Among other things, this allows us to recover through information-
theoretic arguments a near-optimal bound for the linear bandit. We also offer
a bound for the logistic bandit that dramatically improves on the best
previously available, though this bound depends on an information-theoretic
statistic that we have only been able to quantify via computation.
The Price of Privacy for Low-rank Factorization

In this paper, we study what price one has to pay to release
\emph{differentially private low-rank factorization} of a matrix. We consider
various settings that are close to the real world applications of low-rank
factorization: (i) the manner in which matrices are updated (row by row or in
an arbitrary manner), (ii) whether matrices are distributed or not, and (iii)
how the output is produced (once at the end of all updates, also known as
\emph{one-shot algorithms} or continually). Even though these settings are
well studied without privacy, surprisingly, there are no private algorithm for
these settings (except when a matrix is updated row by row). We present the
first set of differentially private algorithms for all these settings. Our
algorithms when private matrix is updated in an arbitrary manner promise
differential privacy with respect to two stronger privacy guarantees than
previously studied, use space and time \emph{comparable} to the non-private
algorithm, and achieve \emph{optimal accuracy}. To complement our positive
results, we also prove that the space required by our algorithms is optimal up
to logarithmic factors. When data matrices are distributed over multiple
servers, we give a non-interactive differentially private algorithm with
communication cost independent of dimension. In concise, we give algorithms
that incur {\em optimal cost across all parameters of interest}. We also
perform experiments to verify that all our algorithms perform well in practice
and outperform the best known algorithm until now for large range of
parameters.
Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator

We consider adaptive control of the Linear Quadratic Regulator (LQR), where an
unknown linear system is controlled subject to quadratic costs. Leveraging
recent developments in the estimation of linear systems and in robust
controller synthesis, we present the first provably polynomial time algorithm
that achieves sub-linear regret on this problem. We further study the
interplay between regret minimization and parameter estimation by proving a
lower bound on the expected regret in terms of the exploration schedule used
by any algorithm. Finally, we conduct a numerical study comparing our robust
adaptive algorithm to other methods from the adaptive LQR literature, and
demonstrate the flexibility of our proposed method by extending it to a demand
forecasting problem subject to state constraints.
Bilevel Distance Metric Learning for Robust Image Recognition

Metric learning, aiming to learn a discriminative Mahalanobis distance matrix
M that can effectively reflect the similarity between data samples, has been
widely studied in various image recognition problems. Most of the existing
metric learning methods input the features extracted directly from the
original data in the preprocess phase. What's worse, these features usually
take no consideration of the local geometrical structure of the data and the
noise existed in the data, thus they may not be optimal for the subsequent
metric learning task. In this paper, we integrate both feature extraction and
metric learning into one joint optimization framework and propose a new
bilevel distance metric learning model. Specifically, the lower level
characterizes the intrinsic data structure using graph regularized sparse
coefficients, while the upper level forces the data samples from the same
class to be close to each other and pushes those from different classes far
away. In addition, leveraging the KKT conditions and the alternating direction
method (ADM), we derive an efficient algorithm to solve the proposed new
model. Extensive experiments on various occluded datasets demonstrate the
effectiveness and robustness of our method.
Differentially Private Uniformly Most Powerful Tests for Binomial Data

We derive uniformly most powerful (UMP) tests for simple and one-sided
hypotheses for a population proportion within the framework of Differential
Privacy (DP), optimizing finite sample performance. We show that in general,
DP hypothesis tests for exchangeable data can always be expressed as a
function of the empirical distribution. Using this structure, we prove a
‘Neyman-Pearson lemma’ for binomial data under DP, where the DP-UMP only
depends on the sample sum. Our tests can also be stated as a post-processing
of a random variable, whose distribution we coin “Truncated-Uniform-Laplace”
(Tulap), a generalization of the Staircase and discrete Laplace distributions.
Furthermore, we obtain exact p-values, which are easily computed in terms of
the Tulap random variable. We show that our results also apply to
distribution-free hypothesis tests for continuous data. Our simulation results
demonstrate that our tests have exact type I error, and are more powerful than
current techniques.
Scalable Coordinated Exploration in Concurrent Reinforcement Learning

We consider a team of reinforcement learning agents that concurrently operate
in a common environment, and we develop an approach to efficient coordinated
exploration that is suitable for problems of practical scale. Our approach
builds on the seed sampling concept introduced in Dimakopoulou and Van Roy
(2018) and on a randomized value function learning algorithm from Osband et
al. (2016). We demonstrate that, for simple tabular contexts, the approach is
competitive with those previously proposed in Dimakopoulou and Van Roy (2018)
and with a higher-dimensional problem and a neural network value function
representation, the approach learns quickly with far fewer agents than
alternative exploration schemes.
Integrated accounts of behavioral and neuroimaging data using flexible recurrent neural network models

Neuroscience studies of human decision-making abilities commonly involve
subjects completing a decision-making task while BOLD signals are recorded
using fMRI. Hypotheses are tested about which brain regions mediate the effect
of past experience, such as rewards, on future actions. One standard approach
to this is model-based fMRI data analysis, in which a model is fitted to the
behavioral data, i.e., a subject's choices, and then the neural data are
parsed to find brain regions whose BOLD signals are related to the model's
internal signals. However, the internal mechanics of such purely behavioral
models are not constrained by the neural data, and therefore might miss or
mischaracterize aspects of the brain. To address this limitation, we introduce
a new method using recurrent neural network models that are flexible enough to
be fitted jointly to the behavioral and neural data. We trained a model so
that its internal states were suitably related to neural activity during the
task, while at the same time its output predicted the next action a subject
would execute. We then used the fitted model to create a novel visualization
of the relationship between the activity in brain regions at different times
following a reward and the choices the subject subsequently made. Finally, we
validated our method using a previously published dataset. We showed that the
model was able to recover the underlying neural substrates that were
discovered by explicit model engineering in the previous work, and also
derived new results regarding the temporal pattern of brain activity.
BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training

In distributed machine learning (DML), the network performance between
machines significantly impacts the speed of iterative training. In this paper
we propose BML, a new gradient synchronization algorithm with higher network
performance and lower network cost than the current practice. BML runs on
BCube network, instead of using the traditional Fat-Tree topology. BML
algorithm is designed in such a way that, compared to the parameter server
(PS) algorithm on a Fat-Tree network connecting the same number of server
machines, BML achieves theoretically1/k of the gradient synchronization time,
with k/5 of switches (the typical number of k is 2∼4). Experiments of MNIST
and VGG-19 benchmarks on a testbed with 9 dual-GPU servers show that, BML
reduces the job completion time of DML training by up to 56.4%
Inexact trust-region algorithm on Riemannian manifolds

We consider an inexact variant of the popular Riemannian trust-region
algorithm for structured big-data minimization problems. The proposed
algorithm approximates the gradient and the Hessian in addition to the
solution of a (trust-region) sub-problem. We provide a total complexity bound
to achieve $¥epsilon$-approximate second-order optimality under mild
conditions on inexact gradient and Hessian. Addressing large-scale finite-sum
problems, a sub-sampled algorithm is proposed as a practical algorithm, where
gradient and Hessian are generated by a random sampling technique. Numerical
evaluations of the principal components analysis (PCA) and the matrix
completion (MC) problem demonstrate that the proposed algorithm outperforms
state-of-the-art Riemannian deterministic and stochastic gradient algorithms.
Can We Gain More from Orthogonality Regularizations in Training Deep Networks?

This paper seeks to answer the question: as the (near-) orthogonality of
weights is found to be a favorable property for training deep convolutional
neural networks, how we can enforce it in more effective and easy-to-use ways?
We develop novel orthogonality regularizations on training deep CNNs,
utilizing various advanced analytical tools such as mutual coherence and
restricted isometry property. These plug-and-play regularizations can be
conveniently incorporated into training almost any CNN without extra hassle.
We then benchmark their effects on three state-of-the-art models: ResNet,
WideResNet, and ResNeXt, on CIFAR-10 and CIFAR-100 datasets. We observe
consistent performance gains after applying those proposed regularizations, in
terms of both the final accuracies achieved, and accelerated and more stable
convergences.
Binary Rating Estimation with Graph Side Information

While rich experimental evidences show that one can better estimate users'
unknown ratings with the aid of graph side information such as social graphs,
its theoretical understanding is still lacking. In this work, we study the
binary rating estimation problem to understand the fundamental value of graph
side information. Considering a simple correlation model between a rating
matrix and a graph, we characterize the sharp threshold on the number of
observed entries required to recover the rating matrix (called the optimal
sample complexity) as a function of the quality of graph side information (to
be detailed). To the best of our knowledge, we are the first to reveal when
and by how much the graph side information reduces sample complexity. Further,
we propose a computationally efficient algorithm that achieves the limit. Our
experimental results on synthetic and real-world data demonstrate the
superiority of our algorithm over the state of the arts.
SimplE Embedding for Link Prediction in Knowledge Graphs

Knowledge graphs contain knowledge about the world and provide a structured
representation of this knowledge. Current knowledge graphs contain only a
small subset of what is true in the world. Link prediction approaches aim at
predicting new links for a knowledge graph given the existing links between
the entities. Tensor factorization approaches have proved promising for such
link prediction problems. Proposed in 1927, Canonical Polyadic (CP)
decomposition is among the first tensor factorization approaches. CP generally
performs poorly for link prediction as it learns two independent embedding
vectors for each entity, whereas they are really tied. We present a simple
enhancement (which we call SimplE) of CP to allow the two embeddings of each
entity to be learned dependently. The complexity of SimplE grows linearly with
the size of embeddings. The embeddings learned through SimplE are
interpretable, and certain types of background knowledge in terms of logical
rules can be incorporated into these embeddings through weight tying. We prove
SimplE is fully-expressive and derive a bound on the size of its embeddings
for full expressivity. We show empirically that, despite its simplicity,
SimplE outperforms several state-of-the-art tensor factorization techniques.
Differentially Private Contextual Linear Bandits

We study the contextual linear bandit problem, a version of the standard
stochastic multi-armed bandit (MAB) problem, where a learner sequentially
selects actions to maximize a reward which depends also on a user provided
per-round context. Though the context is chosen arbitrarily or adversarially,
the reward is assumed to be a stochastic function of a feature vector that
encodes the context and selected action. Our goal is to devise private
learners for the contextual linear bandit problem. We first show that using
the standard definition of differential privacy results in linear regret. So
instead, we adopt the notion of joint differential privacy, where we assume
that the action chosen on day t is only revealed to user t and thus needn’t be
kept private that day, only on following days. We give a general scheme
converting the classic linear-UCB algorithm into a joint differentially
private algorithm using the tree-based algorithm. We then apply either
Gaussian noise or Wishart noise to acheive joint-differentially private
algorithms and bound the resulting algorithms’ regrets. In addition, we give
the first lower bound on the additional regret any private algorithms for the
MAB problem must incur.
Submodular Field Grammars: Representation, Inference, and Application to Image Parsing

Natural scenes contain many layers of part-subpart structure, and
distributions over them are thus naturally represented by stochastic image
grammars, with one production per decomposition of a part. Unfortunately, in
contrast to language grammars, where the number of possible split points for a
production $A \rightarrow BC$ is linear in the length of $A$, in an image
there is an exponential number of ways to split a region into subregions. This
makes parsing intractable and requires image grammars to be severely
restricted in practice, for example by allowing only rectangular regions. In
this paper, we address this problem by associating with each production a
submodular Markov random field whose labels are the subparts and whose
labeling segments the current object into these subparts. We call the result a
submodular field grammar (SFG). Finding the MAP split of a region into
subregions is now tractable, and by exploiting this we develop an efficient
approximate algorithm for MAP parsing of images with SFGs. Empirically, we
present promising improvements in accuracy when using SFGs for scene
understanding, and show exponential improvements in inference time compared to
traditional methods, while returning comparable minima.
A Bridging Framework for Model Optimization and Deep Propagation

Optimizing designed mathematical models is the most fundamental methodology in
statistic and learning areas. However, the generally designed schematic models
may hard to extract complex data distributions in real-world scenarios.
Recently, training deep propagations (i.e., networks) has gained promising
performance in some particular applications. Unfortunately, existing network
structures are often built in heuristic way, thus lack of principled
interpretation and solid theoretical support. In this work, we provide a new
framework, named Propagative Convergent Network (PCN), to bridge the gaps
between these two different methodologies (i.e., model optimization and deep
propagation) in a collaborative manner. On the one hand, we demonstrate how to
utilize PCN as a deeply-trained solver for nonconvex model optimization.
Different from other network-based iterations, which often lack theoretical
investigations, we can successfully prove the strict convergence and estimate
the convergence rate of PCN. On the other hand, by relaxing the constraints
and performing end-to-end training, we also develop a PCN-based strategy to
integrate domain knowledge (formulated as models) and data distributions
(learned by networks), resulting in a generic ensemble learning framework for
challenging applications. Extensive experiments verify our theoretical results
and show the superiority of PCN against state-of-the-art methods.
Completing State Representations using Spectral Learning

A central problem in dynamical system modeling is state discovery---that is,
finding a compact summary of the past that captures the information needed to
predict the future. Predictive State Representations (PSRs) enable clever
spectral methods for state discovery; however, while consistent in the limit
of infinite data, these methods often suffer from poor performance in the low
data regime. In this paper we develop a novel algorithm for incorporating
domain knowledge, in the form of an imperfect state representation, as side
information to speed spectral learning for PSRs. We prove theoretical results
characterizing the relevance of a user-provided state representation, and
design spectral algorithms that can take advantage of a relevant
representation. Our algorithm utilizes principal angles to extract the
relevant components of the representation, and is robust to mis-specification.
Empirical evaluation on synthetic HMMs, an aircraft identification domain, and
a gene splice dataset shows that, even with weak domain knowledge, the
algorithm can significantly outperform standard PSR learning.
Optimization of Smooth Functions with Noisy Observations: Local Minimax Rates

We consider the problem of global optimization of an unknown non-convex smooth
function with noisy zeroth-order feedback. We propose a local minimax
framework to study the fundamental difficulty of optimizing smooth functions
with adaptive function evaluations. We show that for functions with fast
growth around their global minima, carefully designed optimization algorithms
can identify a near global minimizer with many fewer queries than worst-case
global minimax theory predicts. For the special case of strongly convex and
smooth functions, our implied convergence rates match the ones developed for
zeroth-order convex optimization problems. On the other hand, we show that in
the worst case no algorithm can converge faster than the minimax rate of
estimating an unknown functions in linf-norm. Finally, we show that non-
adaptive algorithms, although optimal in a global minimax sense, do not attain
the optimal local minimax rate.
Adding One Neuron Can Eliminate All Bad Local Minima

One of the main difficulties in analyzing neural networks is the non-convexity
of the loss function which may have many bad local minima. In this paper, we
study the landscape of neural networks for binary classification tasks. Under
mild assumptions, we prove that after adding one special neuron with a skip
connection to the output, or one special neuron per layer, every local minimum
is a global minimum.
Mean-field theory of graph neural networks in graph partitioning

A theoretical performance analysis of the graph neural network (GNN) is
presented. For classification tasks, the neural network approach has the
advantage in terms of flexibility that it can be employed in a data-driven
manner, whereas Bayesian inference requires the assumption of a specific
model. A fundamental question is then whether GNN has a high accuracy in
addition to this flexibility. Moreover, whether the achieved performance is
predominately a result of the backpropagation or the architecture itself is a
matter of considerable interest. To gain a better insight into these
questions, a mean-field theory of a minimal GNN architecture is developed for
the graph partitioning problem. This demonstrates a good agreement with
numerical experiments.
The Physical Systems Behind Optimization Algorithms

We use differential equations based approaches to provide some {\it
\textbf{physics}} insights into analyzing the dynamics of popular optimization
algorithms in machine learning. In particular, we study gradient descent,
proximal gradient descent, coordinate gradient descent, proximal coordinate
gradient, and Newton's methods as well as their Nesterov's accelerated
variants in a unified framework motivated by a natural connection of
optimization algorithms to physical systems. Our analysis is applicable to
more general algorithms and optimization problems {\it \textbf{beyond}}
convexity and strong convexity, e.g. Polyak-\L ojasiewicz and error bound
conditions (possibly nonconvex).
Top-k lists: Models and Algorithms

The classic Mallows model is a widely-used tool to realize distributions on
permutations. Motivated by common practical situations, in this paper, we
generalize Mallows to model distributions on \topk lists by using a suitable
distance measure between \topk lists. Unlike many earlier work, our model is
both analytically tractable and computationally efficient. We demonstrate this
by studying two basic problems in this model, namely, sampling and
reconstruction, from both algorithmic and practical points of view.
Maximum Causal Tsallis Entropy Imitation Learning

In this paper, we propose a novel maximum causal Tsallis entropy (MCTE)
framework for imitation learning which can efficiently learn a sparse multi-
modal policy distribution from demonstrations. We provide the full
mathematical analysis of the proposed framework. First, the optimal solution
of an MCTE problem is shown to be a sparsemax distribution, whose supporting
set can be adjusted. The proposed method has advantages over a softmax
distribution in that it can exclude unnecessary actions by assigning zero
probability. Second, we prove that an MCTE problem is equivalent to robust
Bayes estimation in the sense of the Brier score. Third, we propose a maximum
causal Tsallis entropy imitation learning (MCTEIL) algorithm with a sparse
mixture density network (sparse MDN) by modeling mixture weights using a
sparsemax distribution. In particular, we show that the causal Tsallis entropy
of an MDN encourages exploration and efficient mixture utilization while
Boltzmann Gibbs entropy is less effective. We validate the proposed method in
two simulation studies and MCTEIL outperforms existing imitation learning
methods in terms of average returns and learning multi-modal policies.
Limited Memory Kelley's Method Converges for Composite Convex and Submodular Objectives

The original simplicial method ({\sc OSM}), a variant of the classic Kelley's
cutting plane method, has been shown to converge to the minimizer of composite
convex and submodular objectives, though no rate of convergence for this
method was known. Moreover, {\sc OSM} is required to solve subproblems in each
iteration whose size grows linearly in the number of iterations. We propose a
limited memory version of Kelley's method ({\sc L-KM}) that is a novel
adaptation of {\sc OSM} and requires limited memory independent of the
iteration (at most $n+1$ constraints for an $n$-dimensional problem), while
maintaining convergence to the optimal solution. We further show that the dual
method of {\sc L-KM} is a special case of the Fully-Corrective Frank-Wolfe
({\sc FCFW}) method with approximate correction, thereby deriving a limited
memory version of {\sc FCFW} method and proving a rate of convergence for {\sc
L-KM}. Though we propose {\sc L-KM} for minimizing composite convex and
submodular objectives, our results on limited memory version of FCFW hold for
general polytopes, which is of independent interest.
Semi-Supervised Learning with Declaratively Specified Entropy Constraints

We propose a technique for declaratively specifying strategies for semi-
supervised learning (SSL). The proposed method can be used to specify
ensembles of semi-supervised learners, as well as agreement constraints and
entropic regularization constraints between these learners, and can be used to
model both well-known heuristics such as co-training, and novel domain-
specific heuristics. Our technique achieves consistent improvements over prior
frameworks for specifying SSL techniques on a suite of well-studied SSL
benchmarks, and obtains a new state-of-the-art result on a difficult relation
extraction task.
End-to-end Symmetry Preserving Inter-atomic Potential Energy Model for Finite and Extended Systems

Machine learning models are changing the paradigm of molecular modeling, which
is a fundamental tool for material science, chemistry, and computational
biology. Of particular interest is the inter-atomic potential energy surface
(PES). Here we develop Deep Potential - Smooth Edition (DeepPot-SE), an end-
to-end machine learning-based PES model, which is able to efficiently
represent the PES for a wide variety of systems with the accuracy of ab initio
quantum mechanics models. By construction, DeepPot-SE is extensive and
continuously differentiable, scales linearly with system size, and preserves
all the natural symmetries of the system. Further, we show that DeepPot-SE
describes finite and extended systems including organic molecules, metals,
semiconductors, and insulators with high fidelity.
Sparsified SGD with Memory

Nowadays machine learning applications require stochastic optimization
algorithms that can be implemented on distributed systems. The communication
overhead of the algorithms is a key bottleneck that hinders perfect
scalability. Various recent works proposed to use quantization or
sparsification techniques to reduce the amount of data that needs to be
communicated, for instance by only sending the most significant entries of the
stochastic gradient (top-k sparsification). Whilst this scheme shows good
performance in practice it eluded theoretical analysis so far. In this work we
analyze a variant of Stochastic Gradient Descent (SGD) with k-sparsification
(for instance top-k or random-k) and show that this scheme converges at the
same rate as vanilla SGD. That is, the communication can be reduced by a
factor of the dimension of the problem (sometimes even more) whilst still
converging at the same rate. We present numerical experiments to illustrate
the theoretical findings and especially the better scalability for distributed
applications.
Exponentiated Strongly Rayleigh Distributions

Strongly Rayleigh (SR) measures are discrete probability distributions over
the subsets of a ground set. They enjoy strong negative dependence properties,
as a result of which they assign higher probability to subsets of diverse
elements. We introduce in this paper Exponentiated Strongly Rayleigh (ESR)
measures, which sharpen (or smoothen) the negative dependence property of SR
measures via a single parameter (the exponent) that can intuitively understood
as an inverse temperature. We develop efficient MCMC procedures for
approximate sampling from ESRs, and obtain explicit mixing time bounds for two
concrete instances: exponentiated versions of Determinantal Point Processes
and Dual Volume Sampling. We illustrate some of the potential of ESRs, by
applying them to a few machine learning tasks; empirical results confirm that
beyond their theoretical appeal, ESR-based models hold significant promise for
these tasks.
Importance Weighting and Varational Inference

Recent work used importance sampling ideas for better variational bounds on
likelihoods. We clarify the applicability of these ideas to pure probabilistic
inference, by showing the resulting Importance Weighted Variational Inference
(IWVI) technique is an instance of augmented variational inference, thus
identifying the looseness in previous work. Experiments confirm IWVI's
practicality for probabilistic inference. As a second contribution, we
investigate inference with elliptical distributions, which improves accuracy
in low dimensions, and convergence in high dimensions.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

We describe a neural network-based system for text-to-speech (TTS) synthesis
that is able to generate speech audio in the voice of many different speakers,
including those unseen during training. Our system consists of three
independently trained components: (1) a speaker encoder network, trained on a
speaker verification task using an independent dataset of noisy speech from
thousands of speakers without transcripts, to generate a fixed-dimensional
embedding vector from seconds of reference speech from a target speaker; (2) a
sequence-to-sequence synthesis network based on Tacotron 2, which generates a
mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-
regressive WaveNet-based vocoder that converts the mel spectrogram into a
sequence of time domain waveform samples. We demonstrate that the proposed
model is able to transfer the knowledge of speaker variability learned by the
discriminatively-trained speaker encoder to the new task, and is able to
synthesize natural speech from speakers that were not seen during training. We
quantify the importance of training the speaker encoder on a large and diverse
speaker set in order to obtain the best generalization performance. Finally,
we show that randomly sampled speaker embeddings can be used to synthesize
speech in the voice of novel speakers dissimilar from those used in training,
indicating that the model has learned a high quality speaker representation.
Expanding Holographic Embeddings for Knowledge Completion

Neural models operating over structured spaces such as knowledge graphs
require a continuous embedding of the discrete elements of this space (such as
entities) as well as the relationships between them. Relational embeddings
with high expressivity, however, have high model complexity, making them
computationally difficult to train. We propose a new family of embeddings for
knowledge graphs that interpolate between a method with high model complexity
and one, namely Holographic embeddings, with low dimensionality and high
training efficiency. This interpolation, termed HolEx, is achieved by
concatenating several linearly perturbed copies of the original holographic
embedding. We formally characterize the number of perturbed copies to provably
recover the full relational interaction matrix between entities, leveraging
ideas from Haar wavelets and compressed sensing. In practice, we find that
using just a handful of perturbation vectors results in a much stronger
knowledge completion system. On the Freebase FB15K dataset, HolEx outperforms
original holographic embeddings by 13.7% on the HITS@10 metric, and the
current state-of-the-art by 3.1% (absolute).
Explaining Deep Learning Models -- A Bayesian Non-parametric Approach

Understanding and interpreting how machine learning (ML) models make decisions
have been a big challenge. While recent research has proposed various
technical approaches to provide some clues as to how an ML model makes
individual predictions, they cannot provide users with an ability to inspect a
model as a complete entity. In this work, we propose a novel technical
approach that augments a Bayesian non-parametric regression mixture model with
multiple elastic nets. Using the enhanced mixture model, we can extract
generalizable insights for a target model through a global approximation. To
demonstrate the utility of our approach, we evaluate it on different ML models
in the context of image recognition. The empirical results indicate that our
proposed approach not only outperforms the state-of-the-art techniques in
explaining individual decisions but also provides users with an ability to
discover the vulnerabilities of the target ML models.
Third-order Smoothness Helps: Faster Stochastic Optimization Algorithms for Finding Local Minima

We propose stochastic optimization algorithms that can find local minima
faster than existing algorithms for nonconvex optimization problems, by
exploiting the third-order smoothness to escape non-degenerate saddle points
more efficiently. More specifically, the proposed algorithm only needs
$\tilde{O}(\epsilon^{-10/3})$ stochastic gradient evaluations to converge to
an approximate local minimum $\mathbf{x}$, which satisfies $|\nabla
f(\mathbf{x})|2\leq\epsilon$ and $\lambda{\min}(\nabla^2 f(\mathbf{x}))\geq
-\sqrt{\epsilon}$ in the general stochastic optimization setting, where
$\tilde{O}(\cdot)$ hides logarithm polynomial terms and constants. This
improves upon the $\tilde{O}(\epsilon^{-7/2})$ gradient complexity achieved by
the state-of-the-art stochastic local minima finding algorithms by a factor of
$\tilde{O}(\epsilon^{-1/6})$. For the nonconvex finite-sum optimization, our
algorithm also outperforms the best known algorithms in a certain regime.
COLA: Decentralized Linear Learning

Decentralized machine learning is a promising emerging technique in view of
global challenges of data ownership and privacy. We consider learning of
linear classification and regression models, in the setting where the training
data is decentralized over many user devices, and the learning algorithm must
run on-device, on an arbitrary communication network, without a central
coordinator. We propose COLA, a new decentralized training algorithm with
strong theoretical guarantees and superior practical performance. Our scheme
overcomes many limitations of existing methods in the distributed setting, and
achieves communication efficiency, scalability, as well as elasticity and
resilience to changes in user's data and participating devices.
MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare

Deep learning models exhibit state-of-the-art performance for many predictive
healthcare tasks using electronic health records (EHR) data, but these models
typically require training data volume that exceeds the capacity of most
healthcare systems. External resources such as medical ontologies are used to
bridge the data volume constraint, but this approach is often not directly
applicable or useful because of inconsistencies with terminology. To solve the
data insufficiency challenge, we leverage the inherent multilevel structure of
EHR data and, in particular, the encoded relationships among medical codes. We
propose Multilevel Medical Embedding (MiME) which learns the multilevel
embedding of EHR data while jointly performing auxiliary prediction tasks that
rely on this inherent EHR structure without the need for external labels. We
conducted two prediction tasks, heart failure prediction and sequential
disease prediction, where MiME outperformed baseline methods in diverse
evaluation settings. In particular, MiME consistently outperformed all
baselines when predicting heart failure on datasets of different volumes,
especially demonstrating the greatest performance improvement (15% relative
gain in PR-AUC over the best baseline) on the smallest dataset, demonstrating
its ability to effectively model the multilevel structure of EHR data.
Adaptive Sampling Towards Fast Graph Representation Learning

Graph Convolutional Networks (GCNs) have become a crucial tool on learning
representations of graph vertices. The main challenge of adapting GCNs on
large-scale graphs is the scalability issue that it incurs heavy cost both in
computation and memory due to the uncontrollable neighborhood expansion across
layers. In this paper, we accelerate the training of GCNs through developing
an adaptive layer-wise sampling method. By constructing the network layer by
layer in a top-down passway, we sample the lower layer conditioned on the top
one, where the sampled neighborhoods are shared by different parent nodes and
the over expansion is avoided owing to the fixed-size sampling. More
importantly, the proposed sampler is adaptive and applicable for explicit
variance reduction, which in turn enhances the training of our method.
Furthermore, we propose a novel and economical approach to promote the message
passing over distant nodes by applying skip connections. Intensive experiments
on several benchmarks verify the effectiveness of our method regarding the
classification accuracy while enjoying faster convergence speed.
Hunting for Discriminatory Proxies in Linear Regression Models

A machine learning model may exhibit discrimination when used to make
decisions involving people. One potential cause for such outcomes is that the
model uses a statistical proxy for a protected demographic attribute. In this
paper we formulate a definition of proxy use for the setting of linear
regression and present algorithms for detecting proxies. Our definition
follows recent work on proxies in classification models, and characterizes a
model's constituent behavior that: 1) correlates closely with a protected
random variable, and 2) is causally influential in the overall behavior of the
model. We show that proxies in linear regression models can be efficiently
identified by solving a second-order cone program, and further extend this
result to account for situations where the use of a certain input variable is
justified as a ``business necessity''. Finally, we present empirical results
on two law enforcement datasets that exhibit varying degrees of racial
disparity in prediction outcomes, demonstrating that proxies shed useful light
on the causes of discriminatory behavior in models.
Towards Robust Detection of Adversarial Examples

Although the recent progress is substantial, deep learning methods can be
vulnerable to the maliciously generated adversarial examples. In this paper,
we present a novel training procedure and a thresholding test strategy,
towards robust detection of adversarial examples. In training, we propose to
minimize the reverse cross-entropy (RCE), which encourages a deep network to
learn latent representations that better distinguish adversarial examples from
normal ones. In testing, we propose to use a thresholding strategy as the
detector to filter out adversarial examples for reliable predictions. Our
method is simple to implement using standard algorithms, with little extra
training cost compared to the common cross-entropy minimization. We apply our
method to defend various attacking methods on the widely used MNIST and
CIFAR-10 datasets, and achieve significant improvements on robust predictions
under all the threat models in the adversarial setting.
Active Matting

Image matting is an ill-posed problem. It requires a user input trimap or some
strokes to obtain an alpha matte of the foreground object. A fine user input
is essential to obtain a good result, which is either time consuming or
suitable for experienced users who know where to place the strokes. In this
paper, we explore the intrinsic relationship between the user input and the
matting algorithm to address the problem of where and when the user should
provide the input. Our aim is to discover the most informative sequence of
regions for user input in order to produce a good alpha matte with minimum
labeling efforts. To this end, we propose an active matting method with
recurrent reinforcement learning. The proposed framework involves human in the
loop by sequentially detecting informative regions for trivial human
judgement. Comparing to traditional matting algorithms, the proposed framework
requires much less efforts, and can produce satisfactory results with just 10
regions. Through extensive experiments, we show that the proposed model
reduces user efforts significantly and achieves comparable performance to
dense trimaps in a user-friendly manner. We further show that the learned
informative knowledge can be generalized across different matting algorithms.
Learning filter widths of spectral decompositions with wavelets

Time series classification using deep neural networks, such as convolutional
neural networks (CNN), operate on the spectral decomposition of the time
series computed using a preprocessing step. This step can include a large
number of hyperparameters, such as window length, filter widths, and filter
shapes, each with a range of possible values that must be chosen using time
and data intensive cross-validation procedures. We propose the wavelet
deconvolution (WD) layer as an efficient alternative to this preprocessing
step that eliminates a significant number of hyperparameters. The WD layer
uses wavelet functions with adjustable scale parameters to learn the spectral
decomposition directly from the signal. Using backpropagation, we show the
scale parameters can be optimized with gradient descent. Furthermore, the WD
layer adds interpretability to the learned time series classifier by
exploiting the properties of the wavelet transform. In our experiments, we
show that the WD layer can automatically extract the frequency content used to
generate a dataset. The WD layer combined with a CNN applied to the phone
recognition task on the TIMIT database achieves a phone error rate of 18.1%,
a relative improvement of 4% over the baseline CNN. Experiments on a dataset
where engineered features are not available showed WD+CNN is the best
performing method. Our results show that the WD layer can improve neural
network based time series classifiers both in accuracy and interpretability by
learning directly from the input signal.
Optimal Byzantine-Resilient Stochastic Gradient Descent

(this is a theory paper) This paper studies the problem of distributed
stochastic optimization in an adversarial setting where, out of $m$ machines
which allegedly compute stochastic gradients every iteration, an
$\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially.
Our main result is a variant of stochastic gradient descent (SGD) which finds
$\varepsilon$-approximate minimizers of convex functions in $T =
\tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2}
\big)$ iterations. In contrast, traditional mini-batch SGD needs $T = O\big(
\frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine
failures. Further, we provide a lower bound showing that, up to logarithmic
factors, our algorithm is information-theoretically optimal both in terms of
sample complexity and time complexity.
PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits

We address the problem of regret minimization in logistic contextual bandits,
where a learner decides among sequential actions or arms given their
respective contexts to maximize binary rewards. Using a fast inference
procedure with P'olya-Gamma distributed augmentation variables, we propose an
improved version of Thompson Sampling, a Bayesian formulation of contextual
bandits with near-optimal performance. Our approach, P'olya-Gamma augmented
Thompson Sampling (PG-TS), achieves state-of-the-art performance on simulated
and real data. PG-TS explores the action space efficiently and exploits high-
reward arms, quickly converging to solutions of low regret. Its explicit
estimation of the posterior distribution of the context feature covariance
leads to substantial empirical gains over approximate approaches. PG-TS is the
first approach to demonstrate the benefits of P'olya-Gamma augmentation in
bandits and to propose an efficient Gibbs sampler for approximating the
analytically unsolvable integral of logistic contextual bandits.
Spectral Filtering for General Linear Dynamical Systems

We give a polynomial-time algorithm for learning latent-state linear dynamical
systems without system identification, and without assumptions on the spectral
radius of the system's transition matrix. The algorithm extends the recently
introduced technique of spectral filtering, previously applied only to systems
with a symmetric transition matrix, using a novel convex relaxation to allow
for the efficient identification of phases.
On Learning Intrinsic Rewards for Policy Gradient Methods

In many sequential decision making tasks, it is challenging to design reward
functions that help an RL agent efficiently learn behavior that is considered
good by the agent designer. A number of different formulations of the reward-
design problem, or close variants thereof, have been proposed in the
literature. In this paper we build on the Optimal Rewards Framework of Singh
et al. that defines the optimal intrinsic reward function as one that when
used by an RL agent achieves behavior that optimizes the task-specifying or
extrinsic reward function. Previous work in this framework has shown how good
intrinsic reward functions can be learned for lookahead search based planning
agents. Whether it is possible to learn intrinsic reward functions for
learning agents remains an open problem. In this paper we derive a novel
algorithm for learning intrinsic rewards for policy-gradient based learning
agents. We compare the performance of an augmented agent that uses our
algorithm to provide additive intrinsic rewards to an A2C-based policy learner
(for Atari games) and a PPO-based policy learner (for Mujoco domains) with a
baseline agent that uses the same policy learners but with only extrinsic
rewards. Our results show improved performance on most but not all of the
domains.
Boolean Decision Rules via Column Generation

This paper considers the learning of Boolean rules in either disjunctive
normal form (DNF, OR-of-ANDs, equivalent to decision rule sets) or conjunctive
normal form (CNF, AND-of-ORs) as an interpretable model for classification. An
integer program is formulated to optimally trade classification accuracy for
rule simplicity. Column generation (CG) is used to efficiently search over an
exponential number of candidate clauses (conjunctions or disjunctions) without
the need for heuristic rule mining. This approach also bounds the gap between
the selected rule set and the best possible rule set on the training data. To
handle large datasets, we propose an approximate CG algorithm using
randomization. Compared to three recently proposed alternatives, the CG
algorithm dominates the accuracy-simplicity trade-off in 8 out of 16 datasets.
When maximized for accuracy, CG is competitive with rule learners designed for
this purpose, sometimes finding significantly simpler solutions that are no
less accurate.
Adversarial Text Generation via Feature-Mover's Distance

Generative adversarial networks (GANs) have achieved significant success in
generating real-valued data. However, the discrete nature of text hinders the
application of GAN to text-generation tasks. Instead of using the standard GAN
objective, we propose to improve text-generation GAN via a novel approach
inspired by optimal transport. Specifically, we consider matching the latent
feature distributions of real and synthetic sentences using a novel metric,
termed the feature-mover's distance (FMD). This formulation leads to a highly
discriminative critic and easy-to-optimize objective, overcoming the mode-
collapsing and brittle-training problems in existing methods. Extensive
experiments are conducted on a variety of tasks to evaluate the proposed model
empirically, including unconditional text generation, style transfer from non-
parallel text, and unsupervised cipher cracking. The proposed model yields
superior performance, demonstrating wide applicability and effectiveness.
Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions

Error bound conditions (EBC) are properties that characterize the growth of an
objective function when a point is moved away from the optimal set. They have
recently received increasing attention in the field of optimization for
developing optimization algorithms with fast convergence. However, the studies
of EBC in statistical learning are hitherto still limited. The main
contributions of this paper are two-fold. First, we develop fast and
intermediate rates of empirical risk minimization (ERM) under EBC for risk
minimization with Lipschitz continuous, and smooth convex random functions.
Second, we establish fast and intermediate rates of an efficient stochastic
approximation (SA) algorithm for risk minimization with Lipschitz continuous
random functions, which requires only one pass of $n$ samples and adapts to
EBC. For both approaches, the convergence rates span a full spectrum between
$\widetilde O(1/\sqrt{n})$ and $\widetilde O(1/n)$ depending on the power
constant in EBC, and could be even faster than $O(1/n)$ in special cases for
ERM. Moreover, these convergence rates are automatically adaptive without
using any knowledge of EBC. Overall, this work not only strengthens the
understanding of ERM for statistical learning but also brings new fast
stochastic algorithms for solving a broad range of statistical learning
problems.
Learning Bounds for Greedy Approximation with Multiple Explicit Feature Maps

Large-scale Mercer kernels can be approximated using low-dimensional feature
maps for efficient risk minimization. Due to the inherent trade-off between
the feature map sparsity and the approximation accuracy, the key problem is to
identify promising feature map components (bases) leading to satisfactory out-
of-sample performance. In this work, we tackle this problem by efficiently
choosing such bases from multiple kernels in a greedy fashion. Our method
sequentially selects these bases from a set of candidate bases using a
correlation metric. We prove that the out-of-sample performance depends on
three types of errors, one of which (spectral error) relates to spectral
properties of the best model in the Hilbert space associated to the combined
kernel. The result verifies that when the underlying data model is sparse
enough, i.e., the spectral error is negligible, one can control the test error
with a small number of bases, scaling poly-logarithmically with data. Our
empirical results show that given a fixed number of bases, the method can
achieve a lower test error with a smaller time cost, compared to the state-of-
the-art in data-dependent random features.
A Mathematical Model For Optimal Decisions In A Representative Democracy

Direct democracy is a special case of an ensemble of classifiers, where every
person (classifier) votes. This fails when the average voter competence
(classifier accuracy) falls below 50%, which happens in noisy settings where
voters have limited information. Representative democracy, where voters choose
representatives to vote, is a specific way to improve the ensemble of
classifiers. We introduce a mathematical model for studying representative
democracy, in particular understanding the parameters of a representative
democracy that gives maximum decision making capability. Our main result
states that under general and natural conditions,
\begin{enumerate}\itemsep-1pt \vspace*{-4pt} \item For fixed voting cost, the
optimal number of representatives is \emph{linear}. \item For polynomial cost,
the optimal number of representatives is \emph{logarithmic}. \end{enumerate}
This work sets the mathematical foundation for studying the quality-quantity
tradeoff in a representative democracy-type ensemble (fewer highly qualified
representatives versus more less qualified representatives).
Negotiable Reinforcement Learning for Pareto Optimal Sequential Decision-Making

It is commonly believed that an agent making decisions on behalf of two or
more principals who have different utility functions should adopt a Pareto
optimal policy, i.e., a policy that cannot be improved upon for one principal
without making sacrifices for another. Harsanyi's theorem shows that when the
principals have a common prior on the outcome distributions of all policies a
Pareto optimal policy for the agent is one that maximizes a fixed, weighted
linear combination of the principals’ utilities. In this paper, we derive a
more precise generalization for the sequential decision setting in the case of
principals with different priors on the dynamics of the environment. We refer
to this generalization as the Negotiable Reinforcement Learning (NRL)
framework. In this more general case, the relative weight given to each
principal’s utility should evolve over time according to how well the agent’s
observations conform with that principal’s prior. To gain insight into the
dynamics of this new framework, we implement a simple NRL agent and
empirically examine its behavior in a simple environment.
Non-metric Similarity Graphs for Maximum Inner Product Search

In this paper we address the problem of Maximum Inner Product Search (MIPS)
that is currently the computational bottleneck in a large number of machine
learning applications. While being similar to the nearest neighbor search
(NNS), the MIPS problem was shown to be more challenging, as the inner product
is not a proper metric function. We propose to solve the MIPS problem with the
usage of similarity graphs, i.e., graphs where each vertex is connected to the
vertices that are the most similar in terms of some similarity function.
Originally, the framework of similarity graphs was proposed for metric spaces
and in this paper we naturally extend it to the non-metric MIPS scenario. We
demonstrate that, unlike existing approaches, similarity graphs do not require
any data transformation to reduce MIPS to the NNS problem and should be used
for the original data. Moreover, we explain why such a reduction is
detrimental for similarity graphs. By an extensive comparison to the existing
approaches, we show that the proposed method is a game-changer in terms of the
runtime/accuracy trade-off for the MIPS problem.
Recurrently Controlled Recurrent Networks

Recurrent neural networks (RNNs) such as long short-term memory and gated
recurrent units are pivotal building blocks across a broad spectrum of
sequence modeling problems. This paper proposes a recurrently controlled
recurrent network (RCRN) for expressive and powerful sequence encoding. More
concretely, the key idea behind our approach is to learn the recurrent gating
functions using recurrent networks. Our architecture is split into two
components - a controller cell and a listener cell whereby the recurrent
controller actively influences the compositionality of the listener cell. We
conduct extensive experiments on a myriad of tasks in the NLP domain such as
sentiment analysis (SST, IMDb, Amazon reviews, etc.), question classification
(TREC), entailment classification (SNLI, SciTail), answer selection (WikiQA,
TrecQA) and reading comprehension (NarrativeQA). Across all 26 datasets, our
results demonstrate that RCRN not only consistently outperforms BiLSTMs but
also stacked BiLSTMs, suggesting that our controller architecture might be a
suitable replacement for the widely adopted stacked architecture.
Additionally, RCRN achieves state-of-the-art results on several well-
established datasets.
Fast greedy algorithms for dictionary selection with generalized sparsity constraints

In dictionary selection, several atoms are selected from finite candidates
that successfully approximate given data points in the sparse representation.
We propose a novel efficient greedy algorithm for dictionary selection. Not
only does our algorithm work much faster than the known methods, but it can
also handle more complex sparsity constraints, such as average sparsity. Using
numerical experiments, we show that our algorithm outperforms the known
methods for dictionary selection, achieving competitive performances with
dictionary learning algorithms in a smaller running time.
Data-Efficient Model-based Reinforcement Learning with Deep Probabilistic Dynamics Models

Model-based reinforcement learning (RL) algorithms can attain excellent sample
efficiency, but often lag behind the best model-free algorithms in terms of
asymptotic performance, especially those with high-capacity parametric
function approximators, such as deep networks. In this paper, we study how to
bridge this gap, by employing uncertainty-aware dynamics models, proposing a
new algorithm called probabilistic ensembles with trajectory sampling (PETS).
We justify PETS with an empirical analysis of design decisions for both
uncertainty-aware dynamics models and uncertainty propagation methods for
planning. Our experimental comparison to state-of-the-art model-based and
model-free deep RL algorithms shows that our approach matches the asymptotic
performance of model-free algorithms on several challenging benchmark tasks,
while requiring significantly fewer samples.
A Smoother Way to Train Structured Prediction Models

We present a framework allowing one to perform smoothing on the inference used
by structured prediction methods. Smoothing breaks the non-smoothness inherent
to structured prediction objectives, without the need to resort to convex
duality, and paves the way to the use of fast primal gradient-based
optimization algorithms. We illustrate the proposed framework by developing an
novel primal incremental gradient-based optimization algorithm for the
structural support vector machine. The algorithm blends an extrapolation
scheme for acceleration and an adaptive smoothing scheme for gradient-based
optimization. We establish its worse-case complexity bounds. We present
experiment results on two real-world problems, namely named entity recognition
and visual object localization. Experimental results show that the proposed
framework allows one to develop competitive primal optimization algorithms for
structured prediction efficiently leveraging inference routines.
Context-dependent upper-confidence bounds for directed exploration

Directed exploration strategies for reinforcement learning are critical for
learning an optimal policy in a minimal number of interactions with the
environment. Many algorithms use optimism to direct exploration, either
through visitation estimates or upper confidence bounds, as opposed to data-
inefficient strategies like $\epsilon$-greedy that use random, undirected
exploration. Most such data-efficient exploration methods, however, require
significant computation, typically relying on a learned model to guide
exploration. Least-squares methods have the potential to provide some of the
data-efficiency benefits of model-based approaches---because they summarize
past interactions---with the computation closer to that of model-free
approaches. In this work, we provide a novel computationally efficient,
incremental exploration strategy, leveraging this property of least-squares
temporal difference learning (LSTD). We derive upper confidence bounds on the
action-values learned by LSTD, with context-dependent (or state-dependent)
noise variance. Such context-dependent noise focuses exploration on a subset
of variable states, and allows for reduced exploration in other states. We
empirically demonstrate that our algorithm can converge more quickly than
other incremental exploration strategies using confidence estimates on action-
values.
A Unified View of Piecewise Linear Neural Network Verification

The success of Deep Learning and its potential use in many safety-critical
applications has motivated research on formal verification of Neural Network
(NN) models. Despite the reputation of learned NN models to behave as black
boxes and the theoretical hardness of proving their properties, researchers
have been successful in verifying some classes of models by exploiting their
piecewise linear structure and taking insights from formal methods such as
Satisifiability Modulo Theory. These methods are however still far from
scaling to realistic neural networks. To facilitate progress on this crucial
area, we make two key contributions. First, we present a unified framework
that encompasses previous methods. This analysis results in the identification
of new methods that combine the strengths of multiple existing approaches,
accomplishing a speedup of two orders of magnitude compared to the previous
state of the art. Second, we propose a new data set of benchmarks which
includes a collection of previously released testcases. We use the benchmark
to provide the first experimental comparison of existing algorithms and
identify the factors impacting the hardness of verification problems.
Hierarchical Graph Representation Learning with Differentiable Pooling

Recently, graph neural networks (GNNs) have revolutionized the field of graph
representation learning through effectively learned node embeddings, and
achieved state-of-the-art results in tasks such as node classification and
link prediction. However, current GNN methods are inherently flat and do not
learn hierarchical representations of graphs---a limitation that is especially
problematic for the task of graph classification, where the goal is to predict
the label associated with an entire graph. Here we propose DiffPool, a
differentiable graph pooling module that can generate hierarchical
representations of graphs and can be combined with various graph neural
network architectures in an end-to-end fashion. DiffPool learns a
differentiable soft cluster assignment for nodes at each layer of a deep GNN,
mapping nodes to a set of clusters, which then form the coarsened input for
the next GNN layer. Our experimental results show that combining existing GNN
methods with DiffPool yields an average improvement of 5-10% accuracy on graph
classification benchmarks, compared to all existing pooling approaches,
achieving a new state-of-the-art on four out of five benchmark datasets.
Non-Ergodic Alternating Proximal  Augmented Lagrangian Algorithms with Optimal Rates

We develop novel non-ergodic alternating proximal augmented Lagrangian
algorithms (NAPALA) to solve nonsmooth constrained convex optimization
problems. Our approach relies on a novel combination of the augmented
Lagrangian framework, partial alternating/linearization scheme, Nesterov's
acceleration techniques, and homotopy strategies. Our algorithms have several
new features compared to existing primal-dual methods. First, they have a
Nesterov's acceleration step on the primal variables compared to the dual one
in several existing methods. Second, they have an optimal $O(1/k)$-rate in
non-ergodic sense without any smoothness or strong convexity-type assumption,
where $k$ is the iteration counter. Third, all the parameters are updated
explicitly without heuristic tuning. Finally, they have $O(1/k^2)$-ergodic or
non-ergodic rate if one objective term is strongly convex while the problem
remains nonstrongly convex. We verify our theoretical development via
different numerical examples and compare our methods with some existing state-
of-the-art algorithms.
Information-based Adaptive Stimulus Selection to Optimize Communication Efficiency in Brain-Computer Interfaces

In stimulus-driven brain-computer interfaces (BCIs), the main hurdle that
limits the development of adaptive stimulus selection algorithms is the
availability of objective functions that lead to tractable solutions to allow
for algorithm implementation within the stringent time requirements of real-
time BCI processing. In this work, we present a simple analytical solution of
an information-based objective function by transforming the high-dimensional
stimulus space into a one-dimensional space that parameterizes the objective
function - the prior probability mass of the stimulus under consideration,
irrespective of its contents. We demonstrate the utility of our adaptive
stimulus selection algorithm in improving BCI performance with results from
simulation and online human experiments.
Porcupine Neural Networks: Approximating Neural Network Landscapes

Neural networks have been used prominently in several machine learning and
statistics applications. In general, the underlying optimization of neural
networks is non-convex which makes analyzing their performance challenging. In
this paper, we take another approach to this problem by constraining the
network such that the corresponding optimization landscape has good
theoretical properties without significantly compromising performance. In
particular, for two-layer neural networks we introduce Porcupine Neural
Networks (PNNs) whose weight vectors are constrained to lie over a finite set
of lines. We show that most local optima of PNN optimizations are global while
we have a characterization of regions where bad local optimizers may exist.
Moreover, our theoretical and empirical results suggest that an unconstrained
neural network can be approximated using a polynomially-large PNN.
Fairness Through Computationally-Bounded Awareness

We study the problem of fair classification within the versatile framework of
Dwork et al. [ITCS '12], which assumes the existence of a metric that measures
similarity between pairs of individuals. Unlike earlier work, we do not assume
that the entire metric is known to the learning algorithm; instead, the
learner can query this arbitrary metric a bounded number of times. We propose
a new notion of fairness called metric multifairness and show how to achieve
this notion in our setting. Metric multifairness is parameterized by a
similarity metric d on pairs of individuals to classify and a rich collection
C of (possibly overlapping) "comparison sets" over pairs of individuals. At a
high level, metric multifairness guarantees that similar subpopulations are
treated similarly, as long as these subpopulations are identified within the
class C.
Adaptive Negative Curvature Descent with Applications in Non-convex Optimization

Negative curvature descent (NCD) method has been utilized to design
deterministic or stochastic algorithms for non-convex optimization aiming at
finding second-order stationary points or local minima. In existing studies,
NCD needs to approximate the smallest eigen-value of the Hessian matrix with a
sufficient precision (e.g., $\epsilon_2\ll 1$) in order to achieve a
sufficiently accurate second-order stationary solution (i.e.,
$\lambda_{\min}(\nabla^2 f(\x))\geq -\epsilon_2)$. One issue with this
approach is that the target precision $\epsilon_2$ is usually set to be very
small in order to find a high quality solution, which increases the complexity
for computing a negative curvature. To address this issue, we propose an
adaptive NCD to allow for an adaptive error dependent on the current
gradient's magnitude in approximating the smallest eigen-value of the Hessian,
and to encourage competition between a noisy NCD step and gradient descent
step. We consider the applications of the proposed adaptive NCD for both
deterministic and stochastic non-convex optimization, and demonstrate that it
can help reduce the the overall complexity in computing the negative
curvatures during the course of optimization without sacrificing the iteration
complexity.
Is Q-Learning Provably Efficient?

Model-free reinforcement learning (RL) algorithms directly parameterize and
update value functions or policies, bypassing the modeling of the environment.
They are typically simpler, more flexible to use, and thus more prevalent in
modern deep RL than model-based approaches. However, empirical work has
suggested that they require large numbers of samples to learn. The theoretical
question of whether not model-free algorithms are in fact \emph{sample
efficient} is one of the most fundamental questions in RL. The problem is
unsolved even in the basic scenario with finitely many states and actions. We
prove that, in an episodic MDP setting, Q-learning with UCB exploration
achieves regret $\tlO(\sqrt{H^3 SAT})$ where $S$ and $A$ are the numbers of
states and actions, $H$ is the number of steps per episode, and $T$ is the
total number of steps. Our regret matches the optimal regret up to a single
$\sqrt{H}$ factor. Thus we establish the sample efficiency of a classical
model-free approach. Moreover, to the best of our knowledge, this is the first
model-free analysis to establish $\sqrt{T}$ regret \emph{without} requiring
access to a ``simulator.''
Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections

We present a new algorithm to generate minimal, stable, and symbolic
corrections to an input that will cause a neural network with ReLU activations
to change its output. We argue that such a correction is a useful way to
provide feedback to a user when the network's output is different from a
desired output. Our algorithm generates such a correction by solving a series
of linear constraint satisfaction problems. The technique is evaluated on
three neural network models: one predicting whether an applicant will pay a
mortgage, one predicting whether a first-order theorem can be proved
efficiently by a solver using certain heuristics, and the final one judging
whether a drawing is an accurate rendition of a canonical drawing of a cat.
Measures of distortion for machine learning

Given data from a general metric space, one of the standard machine learning
pipelines is to first embed the data into a Euclidean space and subsequently
apply out of the box machine learning algorithms to analyze the data. The
quality of such an embedding is typically described in terms of a distortion
measure. In this paper, we show that many of the existing distortion measures
behave in an undesired way, when considered from a machine learning point of
view. We investigate desirable properties of distortion measures and formally
prove that most of the existing measures fail to satisfy these properties.
These theoretical findings are supported by simulations, which for example
demonstrate that existing distortion measures are not robust to noise or
outliers and cannot serve as good indicators for classification accuracy. As
an alternative, we suggest a new measure of distortion, called
$\sigma$-distortion. We can show both in theory and in experiments that it
satisfies all desirable properties and is a better candidate to evaluate
distortion in the context of machine learning.
On the Local Minima of the Empirical Risk

Population risk is always of primary interest in machine learning; however,
learning algorithms only have access to the empirical risk. Even for
applications with nonconvex non-smooth losses (such as modern deep networks),
the population risk is generally significantly more well behaved from an
optimization point of view than the empirical risk. In particular, sampling
can create many spurious local minima. We consider a general framework which
aims to optimize a smooth nonconvex function $F$ (population risk) given only
access to an approximation $f$ (empirical risk) that is pointwise close to $F$
(i.e., $\norm{F-f}_{\infty} \le \nu$). Our objective is to find the
$\epsilon$-approximate local minima of the underlying function $F$ while
avoiding the shallow local minima---arising because of the tolerance
$\nu$---which exist only in $f$. We propose a simple algorithm based on
stochastic gradient descent (SGD) on a smoothed version of $f$ that is
guaranteed to achieve our goal as long as $\nu \le O(\epsilon^{1.5}/d)$. We
also provide an almost matching lower bound showing that our algorithm
achieves optimal error tolerance $\nu$ among all algorithms making a
polynomial number of queries of $f$. As a concrete example, we show that our
results can be directly used to give sample complexities for learning a ReLU
unit.
Densely Connected Attention Propagation for Reading Comprehension

We propose DecaProp (Densely Connected Attention Propagation), a new densely
connected neural architecture for reading comprehension (RC). There are two
distinct characteristics of our model. Firstly, our model densely connects all
pairwise layers of the network, modeling relationships between passage and
query across all hierarchical levels. Secondly, the dense connectors in our
network are learned via attention instead of standard residual skip-
connectors. To this end, we propose novel Bidirectional Attention Connectors
(BAC) for efficiently forging connections throughout the network. We conduct
extensive experiments on four challenging RC benchmarks. Our proposed approach
achieves state-of-the-art results on all four, outperforming existing
baselines by up to 2.6%14.2% in absolute F1 score.
Bandit Learning with Positive Externalities

In many platforms, user arrivals exhibit a self-reinforcing behavior: future
user arrivals are likely to have preferences similar to users who were
satisfied in the past. In other words, arrivals exhibit {\em positive
externalities}. We study multiarmed bandit (MAB) problems with positive
externalities. We show that the self-reinforcing preferences may lead standard
benchmark algorithms such as UCB to exhibit linear regret. We develop a new
algorithm, Balanced Exploration (BE), which explores arms carefully to avoid
suboptimal convergence of arrivals before sufficient evidence is gathered. We
also introduce an adaptive variant of BE which successively eliminates
suboptimal arms. We analyze their asymptotic regret, and establish optimality
by showing that no algorithm can perform better.
Learning Confidence Sets using Support Vector Machines

The goal of confidence-set learning in the binary classification setting is to
construct two sets, each with a specific probability guarantee to cover a
class. An observation outside the overlap of the two sets is deemed to be from
one of the two classes, while the overlap is an ambiguity region which could
belong to either class. Instead of plug-in approaches, we propose a support
vector classifier to construct confidence sets in a flexible manner.
Theoretically, we show that the proposed learner can control the non-coverage
rates and minimize the ambiguity with high probability. Efficient algorithms
are developed and numerical studies illustrate the effectiveness of the
proposed method.
Efficient Neural Network Robustness Certification with General Activation Functions

Finding minimum distortion of adversarial examples and thus certifying
robustness in neural networks classifiers is known to be a challenging
problem. Nevertheless, recently it has been shown to be possible to give a
non-trivial certified lower bound of minimum distortion, and some recent
progress has been made towards this direction by exploiting the piece-wise
linear nature of ReLU activations. However, a generic robustness certification
for \textit{general} activation functions still remains largely unexplored. To
address this issue, in this paper we introduce CROWN, a general framework to
certify robustness of neural networks with general activation functions. The
novelty in our algorithm consists of bounding a given activation function with
linear and quadratic functions, hence allowing it to tackle general activation
functions including but not limited to the four popular choices: ReLU, tanh,
sigmoid and arctan. In addition, we facilitate the search for a tighter
certified lower bound by \textit{adaptively} selecting appropriate surrogates
for each neuron activation. Experimental results show that CROWN on ReLU
networks can notably improve the certified lower bounds compared to the
current state-of-the-art algorithm Fast-Lin, while having comparable
computational efficiency. Furthermore, CROWN also demonstrates its
effectiveness and flexibility on networks with general activation functions,
including tanh, sigmoid and arctan. To the best of our knowledge, CROWN is the
first framework that can efficiently certify non-trivial robustness for
general activation functions in neural networks.
Hessian-based Analysis of Large Batch Training and Robustness to Adversaries

Large batch size training of Neural Networks has been shown to incur accuracy
loss when trained with the current methods. The exact underlying reasons for
this are still not completely understood. Here, we study large batch size
training through the lens of the Hessian operator and robust optimization. In
particular, we perform a Hessian based study to analyze exactly how the
landscape of the loss function changes when training with large batch size. We
compute the true Hessian spectrum, without approximation, by back-propagating
the second derivative. Extensive experiments on multiple networks show that
saddle-points are not the cause for generalization gap of large batch size
training, and the results consistently show that large batch converges to
points with noticeably higher Hessian spectrum. Furthermore, we show that
robust training allows one to favor flat areas, as points with large Hessian
spectrum show poor robustness to adversarial perturbation. We further study
this relationship, and provide empirical and theoretical proof that the inner
loop for robust training is a saddle-free optimization problem \textit{almost
everywhere}. We present detailed experiments with five different network
architectures, including a residual network, tested on MNIST, CIFAR-10, and
CIFAR-100 datasets.
Neural Edit Operations for Biological Sequences

The evolution of biological sequences, such as proteins or DNAs, is driven by
the three basic edit operations: substitution, insertion, and deletion.
Motivated by the recent progress of neural network models for biological
tasks, we implement two neural network architectures that can treat such edit
operations. The first proposal is the "edit invariant neural networks" based
on differentiable Needleman-Wunsch algorithms. The second is the use of deep
CNNs with concatenations. Our analysis shows that CNNs can recognize "star-
free regular expressions", and that deeper CNNs can recognize more complex
regular expressions including insertion/deletion of characters. The
experimental results for the protein secondary structure prediction task
suggest the importance of the insertion/deletion. The test accuracy on the
widely-used CB513 dataset is 71.5%, which is 1.2-points better than of the
current state-of-the-art result.
Objective and efficient inference for couplings in neuronal networks

Inferring directional couplings from the spike data of networks is desired in
various scientific fields such as neuroscience. Here, we apply a recently
proposed objective procedure to the spike data obtained from the Hodgkin--
Huxley type models and ¥textit{in vitro} neuronal networks cultured in a
circular structure. As a result, we succeed in reconstructing synaptic
connections accurately from the evoked activity as well as the spontaneous
one. To obtain the results, we invent an analytic formula approximately
implementing a method of screening relevant couplings. This significantly
reduces the computational cost of the screening method employed in the
proposed objective procedure, making it possible to treat large-size systems
as in this study.
Learning from Group Comparisons: Exploiting Higher Order Interactions

We study the problem of learning from group comparisons, with applications in
predicting outcomes of sports and online games. Most of the previous works in
this area focus on learning individual effects---they assume each player has
an underlying score, and the ''ability'' of the team is modeled by the sum of
team members' scores. Therefore, all the current approaches cannot model
deeper interaction between team members: some players perform much better if
they play together, and some players perform poorly together. In this paper,
we propose a new model that takes the player-interaction effects into
consideration. However, under certain circumstances, the total number of
individuals can be very large, and number of player interactions grows
quadratically, which makes learning intractable. In this case, we propose a
latent factor model, and show that the sample complexity of our model is
bounded under mild assumptions. Finally, we show that our proposed models have
much better prediction power on several E-sports datasets, and furthermore can
be used to reveal interesting patterns that cannot be discovered by previous
methods.
Supervising Unsupervised Learning

We introduce a framework to leverage knowledge acquired from a repository of
(heterogeneous) supervised datasets to new unsupervised datasets. Our
perspective avoids the subjectivity inherent in unsupervised learning by
reducing it to supervised learning, and provides a principled way to evaluate
unsupervised algorithms. We demonstrate the versatility of our framework via
rigorous agnostic bounds on a variety of unsupervised problems. In the context
of clustering, our approach helps choose the number of clusters and the
clustering algorithm, remove the outliers, and provably circumvent the
Kleinberg's impossibility result. Experimental results across hundreds of
problems demonstrate significant improvements in performance on unsupervised
data with simple algorithms, despite the fact that our problems come from
heterogeneous domains. Additionally, our framework lets us leverage deep
networks to learn common features from many such small datasets, and perform
zero shot learning with massive performance gains.
Nonparametric Bayesian Lomax delegate racing for survival analysis with competing risks

We propose Lomax delegate racing (LDR) to explicitly model the mechanism of
survival under competing risks and to interpret how the covariates accelerate
or decelerate the time to event. LDR explains non-monotonic covariate effects
by racing a potentially infinite number of sub-risks, and consequently relaxes
the ubiquitous proportional-hazards assumption which may be too restrictive.
Moreover, LDR is naturally able to model censoring and missing event times or
types. For inference, we develop a Gibbs sampler under data augmentation for
moderately sized data, along with a stochastic gradient descent maximum a
posteriori inference algorithm for big data applications. Illustrative
experiments are provided on both synthetic and real datasets, and comparison
with various benchmark algorithms for survival analysis with competing risks
demonstrates distinguished performance of LDR.
Adversarially Robust Generalization Requires More Data

Machine learning models are often susceptible to adversarial perturbations of
their inputs. Even small perturbations can cause state-of-the-art classifiers
with high "standard" accuracy to produce an incorrect prediction with high
confidence. To better understand this phenomenon, we study adversarially
robust learning from the viewpoint of generalization. We show that already in
a simple natural data model, the sample complexity of robust learning can be
significantly larger than that of "standard" learning. This gap is information
theoretic and holds irrespective of the training algorithm or the model
family. We complement our theoretical results with experiments on popular
image classification datasets and show that a similar gap exists here as well.
We postulate that the difficulty of training robust classifiers stems, at
least partially, from this inherently larger sample complexity.
Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents

Evolution strategies (ES) are a family of black-box optimization algorithms
able to train deep neural networks roughly as well as Q-learning and policy
gradient methods on challenging deep reinforcement learning (RL) problems, but
are much faster (e.g. hours vs. days) because they parallelize better.
However, many RL problems require directed exploration because they have
reward functions that are sparse or deceptive (i.e. contain local optima), and
it is unknown how to encourage such exploration with ES. Here we show that
algorithms that have been invented to promote directed exploration in small-
scale evolved neural networks via populations of exploring agents,
specifically novelty search (NS) and quality diversity (QD) algorithms, can be
hybridized with ES to improve its performance on sparse or deceptive deep RL
tasks, while retaining scalability. Our experiments confirm that the resultant
new algorithms, NS-ES and two QD algorithms, NSR-ES and NSRA-ES, avoid local
optima encountered by ES to achieve higher performance on Atari and simulated
robots learning to walk around a deceptive trap. This paper thus introduces a
family of fast, scalable algorithms for reinforcement learning that are
capable of directed exploration. It also adds this new family of exploration
algorithms to the RL toolbox and raises the interesting possibility that
analogous algorithms with multiple simultaneous paths of exploration might
also combine well with existing RL algorithms outside ES.
Practical exact algorithm for trembling-hand equilibrium refinements in games

Nash equilibrium strategies have the known weakness that they do not prescribe
rational play in situations that are reached with zero probability according
to the strategies themselves, for example, if players have made mistakes.
Trembling-hand refinements---such as extensive-form perfect equilibria and
quasi-perfect equilibria---remedy this problem in sound ways. Despite their
appeal, they have not received attention in practice since no known algorithm
for computing them scales beyond toy instances. In this paper, we design an
exact polynomial-time algorithm for finding trembling-hand equilibria in zero-
sum extensive-form games. It is several orders of magnitude faster than the
best prior ones, numerically stable, and quickly solves game instances with
tens of thousands of nodes in the game tree. It enables, for the first time,
the use of trembling-hand refinements in practice.
Power-law efficient neural codes provide general link between perceptual bias and discriminability

Recent work in theoretical neuroscience has shown that information-theoretic
"efficient" neural codes, which allocate neural resources to maximize the
mutual information between stimuli and neural responses, give rise to a lawful
relationship between perceptual bias and discriminability that is observed
across a wide variety of psychophysical tasks in human observers (Wei &
Stocker 2017). Here we generalize these results to show that the same law
arises under a much larger family of optimal neural codes, introducing a
unifying framework that we call power-law efficient coding. Specifically, we
show that the same lawful relationship between bias and discriminability
arises whenever Fisher information is allocated proportional to any power of
the prior distribution. This family includes neural codes that are optimal for
minimizing Lp error for any p, indicating that the lawful relationship
observed in human psychophysical data does not require information-
theoretically optimal neural codes. Furthermore, we derive the exact constant
of proportionality governing the relationship between bias and
discriminability for different power laws (which includes information-
theoretically optimal codes, where the power is 2, and so-called discrimax
codes, where power is 1/2), and different choices of optimal decoder. As a
bonus, our framework provides new insights into "anti-Bayesian" perceptual
biases, in which percepts are biased away from the center of mass of the
prior. We derive an explicit formula that clarifies precisely which
combinations of neural encoder and decoder can give rise to such biases.
Active Geometry-Aware Visual Recognition in Cluttered Scenes

Cross-object occlusions remain an important source of failures for current
state-of-the-art object detectors. Actively selecting camera views for undoing
occlusions and recovering missing information has been identified as an
important field of research since as early as 1980's, under the name active
vision. Yet, 1980's active vision was not equipped with deep neural detectors,
memory modules, or view selection policies, and often attempted tasks and
imagery that would appear elementary with current detectors, even from a
single camera view. On the other hand, the recent resurrection of active view
selection policies has focused on reconstructing or classifying isolated
objects. This work presents a paradigm for active object recognition under
heavy occlusions, as was the original premise. It introduces a geometry-aware
3D neural memory that accumulates information of the full scene across
multiple camera views into a 3D feature tensor in a geometrically consistent
manner: information regarding the same 3D physical point is placed nearby in
the tensor using egomotion-aware feature warping and (learned) depth-aware
unprojection operations. Object detection, segmentation, and 3D reconstruction
is then carried out directly using the accumulated 3D feature memory. The
proposed model does not need to commit early to object detections, as current
geometry-unaware object detection in 2D videos, and generalizes much better
than geometry-unaware LSTM/GRU memories. The lack of geometric constraints on
previous architectures appears to be the bottleneck for handling combinatorial
explosions of visual data due to cross-object occlusions. The proposed model
handles heavy occlusions even when trained with very little training data, by
moving the head of the active observer between nearby views, seamlessly
combining geometry with learning from experience.
Unsupervised Adversarial Invariance

Data representations that contain all the information about target variables
but are invariant to nuisance factors benefit supervised learning algorithms
by preventing them from learning associations between these factors and the
targets, thus reducing overfitting. We present a novel unsupervised invariance
induction framework for neural networks that learns a split representation of
data through competitive training between the prediction task and a
reconstruction task coupled with disentanglement, without needing any labeled
information about nuisance factors or domain knowledge. We describe an
adversarial instantiation of this framework and provide analysis of its
working. Our unsupervised model outperforms state-of-the-art methods, which
are supervised, at inducing invariance to inherent nuisance factors,
effectively using synthetic data augmentation to learn invariance, and domain
adaptation. Our method can be applied to any prediction task, eg.,
binary/multi-class classification or regression, without loss of generality.
Content preserving text generation with attribute controls

In this work, we address the problem of modifying textual attributes of
sentences. Given an input sentence and a set of attribute labels, we attempt
to generate sentences that are compatible with the conditioning information.
To ensure that the model generates content compatible sentences, we introduce
a reconstruction loss which interpolates between auto-encoding and back-
translation loss components. We propose an adversarial loss to enforce
generated samples to be attribute compatible and realistic. Through
quantitative, qualitative and human evaluations we demonstrate that our model
is capable of generating fluent sentences that better reflect the conditioning
information compared to prior methods. We further demonstrate that the model
is capable of simultaneously controlling multiple attributes.
Multi-armed Bandits with Compensation

We propose and study the known-compensation multi-arm bandit (KCMAB) problem,
where a system controller offers a set of arms to many short-term players for
$T$ steps. In each step, one short-term player arrives to the system. Upon
arrival, the player greedily selects an arm with the current best average
reward and receives a stochastic reward associated with the arm. In order to
incentivize players to explore other arms, the controller provides proper
payment compensation to players. The objective of the controller is to
maximize the total reward collected by players while minimizing the
compensation. We first give a compensation lower bound $\Theta(\sum_i
{\Delta_i\log T\over KL_i})$, where $\Delta_i$ and $KL_i$ are the expected
reward gap and Kullback-Leibler (KL) divergence between distributions of arm
$i$ and the best arm, respectively. We then analyze three algorithms to solve
the KCMAB problem, and obtain their regrets and compensations. We show that
the algorithms all achieve $O(\log T)$ regret and $O(\log T)$ compensation
that match the theoretical lower bound. Finally, we use experiments to show
the behaviors of those algorithms.
GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training

Data parallelism can boost the training speed of convolutional neural networks
(CNN), but could suffer from significant communication costs caused by
gradient aggregation. To alleviate this problem, several scalar quantization
techniques have been developed to compress the gradients. But these techniques
could perform poorly when used together with decentralized aggregation
protocols like ring all-reduce (RAR), mainly due to their inability to
directly aggregate compressed gradients. In this paper, we empirically
demonstrate the strong linear correlations between CNN gradients, and propose
a gradient vector quantization technique, named GradiVeQ, to exploit these
correlations through principal component analysis (PCA) for substantial
gradient dimension reduction. GradiveQ enables direct aggregation of
compressed gradients, hence allows us to build a distributed learning system
that parallelizes GradiveQ gradient compression and RAR communications.
Extensive experiments on popular CNNs demonstrate that applying GradiveQ
slashes the wall-clock gradient aggregation time of the original RAR by more
than 5x without noticeable accuracy loss, and reduce the end-to-end training
time by almost 50%. The results also show that \GradiveQ is compatible with
scalar quantization techniques such as QSGD (Quantized SGD), and achieves a
much higher speed-up gain under the same compression ratio.
Multi-agent Online Learning with Asynchronous Feedback Loss

We consider a game-theoretical multi-agent learning problem where the feedback
information can be lost and rewards are given by a broad class of games known
as variationally stable games. We propose a simple variant of the online
gradient descent algorithm, called reweighted online gradient descent (ROGD)
and show that in variationally stable games, if each agent adopts reweighted
online gradient descent learning dynamics, then almost sure convergence to the
set of Nash equilibria is guaranteed, even when the feedback loss is
asynchronous and arbitrarily corrrelated among agents. We then extend the
framework to deal with unknown feedback loss probabilities by using an
estimator (constructed from past data) in its replacement. Finally, we further
extend the framework to accommodate both asynchronous loss and stochastic
rewards and establish that multi-agent ROGD learning still converges to the
set of Nash equilibria in such settings. Together, we make meaningful progress
towards the broad open problem of convergence of no-regret algorithms to Nash
in general continuous games and contribute to the broad landscape of multi-
agent online learning under imperfect information.
Scalable methods for 8-bit training of neural networks

Quantized Neural Networks (QNNs) are often used to improve network efficiency
during the inference phase, i.e. after the network has been trained. Extensive
research in the field suggests many different quantization schemes. Still, the
number of bits required, as well as the best quantization scheme, are yet
unknown. Our theoretical analysis suggests that most of the training process
is robust to substantial precision reduction, and points to only a few
specific operations that require higher precision. Armed with this knowledge,
we quantize the model parameters, activations and layer gradients to 8-bit,
leaving at higher precision only the final step in the computation of the
weight gradients. Additionally, as QNNs require batch-normalization to be
trained at high precision, we introduce Range Batch-Normalization (BN) which
has significantly higher tolerance to quantization noise and improved
computational complexity. Our simulations show that Range BN is equivalent to
the traditional batch norm if a precise scale adjustment, which can be
approximated analytically, is applied. To the best of the authors' knowledge,
this work is the first to quantize the weights, activations, as well as a
substantial volume of the gradients stream, in all layers (including batch
normalization) to 8-bit while showing state-of-the-art results over the
ImageNet-1K dataset.
Dropping Symmetry for Fast Symmetric Nonnegative Matrix Factorization

Symmetric nonnegative matrix factorization (NMF)---a special but important
class of the general NMF---is demonstrated to be useful for data analysis and
in particular for various clustering tasks. Unfortunately, designing fast
algorithms for Symmetric NMF is not as easy as for the nonsymmetric
counterpart, the later admitting the splitting property that allows efficient
alternating-type algorithms. To overcome this issue, we transfer the symmetric
NMF to a nonsymmetric one, then we can adopt the idea from the state-of-the-
art algorithms for nonsymmetric NMF to design fast algorithms solving
symmetric NMF. We rigorously establish that solving nonsymmetric reformulation
returns a solution for symmetric NMF and then apply fast alternating based
algorithms for the corresponding reformulated problem. Furthermore, we show
these fast algorithms admit strong convergence guarantee in the sense that the
generated sequence is convergent at least at a sublinear rate and it converges
globally to a critical point of the symmetric NMF. We conduct experiments on
both synthetic data and image clustering to support our result.
Link Prediction Based on Graph Neural Networks

Link prediction is a key problem for network-structured data. Link prediction
heuristics use some score functions, such as common neighbors and Katz index,
to measure the likelihood of links. They have obtained wide practical uses due
to their simplicity, interpretability, and (often) scalability. However,
heuristic methods have strong assumptions on when two nodes are likely to have
a link, which limits their effectiveness in networks where these assumptions
fail. In this regard, a more reasonable way should be learning suitable
heuristics'' from networks instead of using predefined ones. By extracting a local subgraph around each target link, we aim to learn a function mapping the subgraph patterns to link existence, thus automatically learning a heuristic'' that suits the current network. In this paper, we study this
heuristic learning problem for link prediction. We first propose a
$\gamma$-decaying heuristic theory. By unifying a wide range of heuristics
into a single framework, we prove that all these heuristics can be well
approximated from local subgraphs. Our results show that local subgraphs
reserve rich information related to link existence.
Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

How humans make repeated choices among options with imperfectly known reward
outcomes is an important problem in psychology and neuroscience. This is often
studied using multi-armed bandits, which are also frequently studied in
machine learning. We present data from a human stationary bandit experiment,
in which we vary the abundance and variability of reward availability (mean
and variance of reward rate distributions). Surprisingly, subjects have
significantly underestimated prior mean of reward rates -- elicited at the end
of a bandit game, when they are asked to estimate the reward rates of arms
never chosen during the game. Previously, human learning in the bandit task
was found to be well captured by a Bayesian ideal learning model, the Dynamic
Belief Model (DBM), albeit under an incorrect generative assumption of the
temporal structure -- humans assume reward rates can change over time even
when they are actually fixed. We find that the "pessimism bias" in the bandit
task is well captured by the prior mean of DBM when fitted to human choices;
but it is poorly captured by the prior mean of the Fixed Belief Model (FBM),
an alternative Bayesian model that (correctly) assumes reward rates to be
constant. This pessimism bias is also incompletely captured by a simple
reinforcement learning model (RL) commonly used in neuroscience and
psychology, in terms of fitted initial option values. While it seems highly
sub-optimal, and thus mysterious, that humans have an underestimated prior
reward expectation, our simulations show that an underestimated prior mean
helps to maximize long-term gain, if the observer assumes volatility when
reward rates are actually stable and uses a softmax decision policy instead of
the optimal one (obtainable by dynamic programming). This raises the
intriguing possibility that the brain underestimates reward rates to
compensate for the incorrect non-stationarity assumption in the generative
model and a suboptimal decision policy.
Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model

In this paper we consider the problem of computing an $\epsilon$-optimal
policy of a discounted Markov Decision Process (DMDP) provided we can only
access its transition function through a generative sampling model that given
any state-action pair samples from the transition function in $O(1)$ time.
Given such a DMDP with states $\states$, actions $\actions$, discount factor
$\gamma\in(0,1)$, and rewards in range $[0, 1]$ we provide an algorithm which
computes an $\epsilon$-optimal policy with probability $1 - \delta$ where {\it
both} the run time spent and number of sample taken is upper bounded by \[
O\left[\frac{|\cS||\cA|}{(1-\gamma)^3 \epsilon^2} \log
\left(\frac{|\cS||\cA|}{(1-\gamma)\delta \epsilon} \right)
\log\left(\frac{1}{(1-\gamma)\epsilon}\right)\right] ~. \] For fixed values
of $\epsilon$, this improves upon the previous best known bounds by a factor
of $(1 - \gamma)^{-1}$ and matches the sample complexity lower bounds proved
in \cite{azar2013minimax} up to logarithmic factors. We also extend our method
to computing $\epsilon$-optimal policies for finite-horizon MDP with a
generative model and provide a nearly matching sample complexity lower bound.
ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions

Convolutional neural networks (CNNs) have shown great capability of solving
various artificial intelligence tasks. However, the increasing model size has
raised challenges in employing them in resource-limited applications. In this
work, we propose to compress deep models by using channel-wise convolutions,
which replace dense connections among feature maps with sparse ones in CNNs.
Based on this novel operation, we build light-weight CNNs known as
ChannelNets. ChannelNets use three instances of channel-wise convolutions;
namely group channel-wise convolutions, depth-wise separable channel-wise
convolutions, and the convolutional classification layer. Compared to prior
CNNs designed for mobile devices, ChannelNets achieve a significant reduction
in terms of the number of parameters and computational cost without loss in
accuracy. Notably, our work represents the first attempt to compress the
fully-connected classification layer, which usually accounts for about 25% of
total parameters in compact CNNs. Experimental results on the ImageNet dataset
demonstrate that ChannelNets achieve consistently better performance compared
to prior methods.
Causal Inference and Mechanism Clustering of a Mixture of Additive Noise Models

The inference of the causal relationship between a pair of observed variables
is a fundamental problem in science and approaches exploiting certain kind of
independence in one single causal model are most commonly used. In practice,
however, observations are often collected from multiple sources with
heterogeneous causal models, which renders causal analysis results obtained by
a single model skeptical. In this paper, we generalize the Additive Noise
Model (ANM) to a mixture model, which consists of a finite number of ANMs, and
provide the condition of its causal identifiability. To conduct model
estimation, we propose a Gaussian Process Partially Observable Model (GPPOM)
to learn the latent parameter associated with each observation, which enables
us to not only infer the casual direction but also cluster observations
according to their underlying generating mechanisms. Experiments on synthetic
and real data demonstrate the effectiveness of our proposed approach.
Contour location via entropy reduction leveraging multiple information sources

We introduce an algorithm to locate contours of functions that are expensive
to evaluate. The problem of locating contours arise in many applications,
including classification, constrained optimization, and analysis of
performance of mechanical and dynamical systems (reliability, probability of
failure, stability, etc.). Our algorithm locates contours using information
from multiple sources, which are available in the form of relatively
inexpensive, biased, and possibly noisy approximations to the original
function. Considering multiple information sources can lead to significant
cost savings. We also introduce the concept of contour entropy, a formal
measure of uncertainty about the location of the zero contour of a function
approximated by a statistical surrogate model. Our algorithm locates contours
efficiently by maximizing the reduction of contour entropy per unit cost.
Assessing Generative Models via Precision and Recall

Recent advances in generative modeling have led to an increased interest in
the study of statistical divergences as means of model comparison. Commonly
used evaluation methods such as Frechet Inception Distance (FID) correlate
well with the perceived quality of samples and they also show sensitivity to
dropping modes from the target distribution. However, these metrics are unable
to distinguish different failure cases since they inherently only yield one-
dimensional scores. We propose a novel definition of precision and recall for
distributions which disentangles the divergence into two separate dimensions.
The proposed notion is intuitive, retains desirable properties, and naturally
leads to an efficient algorithm that can be used to evaluate generative
models. We relate this notion to total variation as well as to recent
evaluation metrics such as Inception Score and FID. To demonstrate the
practical utility of the proposed approach, we perform an empirical study on
several variants of Generative Adversarial Networks and the Variational
Autoencoder. In an extensive set of experiments, we show that the proposed
metric is able to disentangle the quality of samples from the coverage of the
target distribution.
Multiple-Step Greedy Policies in Approximate and Online Reinforcement Learning

Multiple-step lookahead policies have demonstrated high empirical competence
in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model
Predictive Control. In a recent work (Efroni et al., 2018), multiple-step
greedy policies and their use in vanilla Policy Iteration algorithms were
proposed and analyzed. In this work, we study multiple-step greedy algorithms
in more practical setups. We begin by highlighting a counter-intuitive
difficulty, arising with soft-policy updates: even in the absence of
approximations, and contrary to the 1-step-greedy case, monotonic policy
improvement is not guaranteed unless the update stepsize is sufficiently
large. Taking particular care about this difficulty, we formulate and analyze
online and approximate algorithms that use such a multi-step greedy operator.
A Convex Duality Framework for GANs

Generative adversarial network (GAN) is a minimax game between a generator
mimicking the true model and a discriminator distinguishing the samples
produced by the generator from the real training samples. Given an
unconstrained discriminator able to approximate any function, this game
reduces to finding the generative model minimizing a divergence measure, e.g.
the Jensen-Shannon divergence, to the data distribution. However, in practice
the discriminator is constrained to be in a smaller class F such as neural
nets. Then, a natural question is how the minimum divergence interpretation
changes as we constrain F. In this work, we address this question by
developing a convex duality framework for analyzing GANs. For a convex set F,
this duality framework interprets f-GANs as finding the generative model with
the minimum f-divergence to the distributions penalized to match the moments
of the data distribution, with the moments specified by the discriminators in
F. We show a similar result also holds for Wasserstein GANs. As a byproduct,
we apply this duality framework to a specific mixture of an f-divergence and a
Wasserstein distance. Unlike f-divergences we prove this mixed distance enjoys
a continuous behavior in the generative model. This result suggests
regularizing the discriminator in f-GANs by either constraining its Lipschitz
constant or by adversarially training it using Wasserstein risk minimization.
We provide numerical experiments supporting our theoretical results.
Horizon-Independent Minimax Linear Regression

We consider a linear regression game: at each round, an adversary reveals a
covariate vector, the learner predicts a real value, the adversary reveals a
label, and the learner suffers the squared prediction error. The aim is to
minimize the difference between the cumulative loss and that of the linear
predictor that is best in hindsight. Previous work demonstrated that the
minimax optimal strategy is easy to compute recursively from the end of the
game; this requires the entire sequence of covariate vectors in advance. We
show that, once provided with a measure of the scale of the problem, we can
invert the recursion and play the minimax strategy without knowing the future
covariates. Further, we show that this forward recursion remains optimal even
against adaptively chosen labels and covariates, provided that the adversary
adheres to a set of constraints that prevent misrepresentation of the scale of
the problem. This strategy is horizon-independent, i.e. it incurs no more
regret than the optimal strategy that knows in advance the number of rounds of
the game. We also provide an interpretation of the minimax algorithm as a
follow-the-regularized-leader strategy with a data-dependent regularizer, and
obtain an explicit expression for the minimax regret.
Exploiting Numerical Sparsity for Efficient Learning : Faster Eigenvector Computation and Regression

In this paper, we obtain improved running times for regression and top
eigenvector computation for numerically sparse matrices. Given a data matrix
$\mat{A} \in \R^{n \times d}$ where every row $a \in \R^d$ has $|a|_2^2 \leq
L$ and numerical sparsity $\leq s$, i.e. $|a|_1^2 / |a|_2^2 \leq s$, we
provide faster algorithms for these problems for many parameter settings. For
top eigenvector computation, when $\gap &gt; 0$ is the relative gap between the
top two eigenvectors of $\mat{A}^\top \mat{A}$ and $r$ is the stable rank of
$\mat{A}$ we obtain a running time of $\otilde(nd + r(s + \sqrt{r s}) /
\gap^2)$ improving upon the previous best unaccelerated running time of $O(nd

r d / \gap^2)$. As $r \leq d$ and $s \leq d$ our algorithm everywhere
improves or matches the previous bounds for all parameter settings. For
regression, when $\mu &gt; 0$ is the smallest eigenvalue of $\mat{A}^\top
\mat{A}$ we obtain a running time of $\otilde(nd + (nL / \mu) \sqrt{s nL /
\mu})$ improving upon the previous best unaccelerated running time of
$\otilde(nd + n L d / \mu)$. This result expands when regression can be solved
in nearly linear time from when $L/\mu = \otilde(1)$ to when $L / \mu =
\otilde(d^{2/3} / (sn)^{1/3})$. Furthermore, we obtain similar improvements
even when row norms and numerical sparsities are non-uniform and we show how
to achieve even faster running times by accelerating using approximate
proximal point \cite{frostig2015regularizing} / catalyst
\cite{lin2015universal}. Our running times depend only on the size of the
input and natural numerical measures of the matrix, i.e. eigenvalues and
$\ell_p$ norms, making progress on a key open problem regarding optimal
running times for efficient large-scale learning.

Experimental Design for Cost-Aware Learning of Causal Graphs

We consider the minimum cost intervention design problem: Given the essential
graph of a causal graph and a cost to intervene on a variable, identify the
set of interventions with minimum total cost that can learn any causal graph
with the given essential graph. We first show that this problem is NP-hard. We
then prove that we can achieve a constant factor approximation to this
problem. We then constrain the sparsity of each intervention. We create an
algorithm that returns an intervention design that is nearly optimal in terms
of size for sparse graphs with sparse interventions and we discuss how to use
it when there are costs on the vertices.
Task-Driven Convolutional Recurrent Models of the Visual System

Feed-forward convolutional neural networks (CNNs) are currently state-of-the-
art for object classification tasks such as ImageNet. Further, they are
quantitatively accurate models of temporally-averaged responses of neurons in
the primate brain's visual system. However, biological visual systems have two
ubiquitous architectural features not shared with typical CNNs: local
recurrence within cortical areas, and long-range feedback from downstream
areas to upstream areas. Here we explored the role of recurrence in improving
classification performance. We found that standard forms of recurrence
(vanilla RNNs and LSTMs) do not perform well within deep CNNs on the ImageNet
task. In contrast, custom cells that incorporated two structural features,
bypassing and gating, were able to boost task accuracy substantially. We
extended these design principles in an automated search over thousands of
model architectures, which identified novel local recurrent cells and long-
range feedback connections useful for object recognition. Moreover, these
task-optimized ConvRNNs explained the dynamics of neural activity in the
primate visual system better than feedforward networks, suggesting a role for
the brain's recurrent connections in performing difficult visual behaviors.
Meta-Reinforcement Learning of Structured Exploration Strategies

Exploration is a fundamental challenge in reinforcement learning (RL). Many
current exploration methods for deep RL use task-agnostic objectives, such as
information gain or bonuses based on state visitation. However, many practical
applications of RL involve learning more than a single task, and prior tasks
can be used to inform how exploration should be performed in new tasks. In
this work, we study how prior tasks can inform an agent about how to explore
effectively in new situations. We introduce a novel gradient-based fast
adaptation algorithm -- model agnostic exploration with structured noise
(MAESN) -- to learn exploration strategies from prior experience. The prior
experience is used both to initialize a policy and to acquire a latent
exploration space that can inject structured stochasticity into a policy,
producing exploration strategies that are informed by prior knowledge and are
more effective than random action-space noise. We show that MAESN is more
effective at learning exploration strategies when compared to prior meta-RL
methods, RL without learned exploration strategies, and task-agnostic
exploration methods. We evaluate our method on a variety of simulated tasks:
locomotion with a wheeled robot, locomotion with a quadrupedal walker, and
object manipulation.
Sample Efficient Stochastic Gradient Iterative Hard Thresholding Method for Stochastic Sparse Linear Regression with Limited Attribute Observation

We develop new stochastic gradient methods for efficiently solving sparse
linear regression in a partial attribute observation setting, where learners
are only allowed to observe a fixed number of actively chosen attributes per
example at training and prediction times. It is shown that the methods achieve
essentially a sample complexity of $O(1/\varepsilon)$ to attain an error of
$\varepsilon$ under a variant of restricted eigenvalue condition, and the rate
has better dependency on the problem dimension than existing methods.
Particularly, if the smallest magnitude of the non-zero components of the
optimal solution is not too small, the rate of our proposed {\it Hybrid}
algorithm can be boosted to near the minimax optimal sample complexity of {\it
full information} algorithms. The core ideas are (i) efficient construction of
an unbiased gradient estimator by the iterative usage of the hard thresholding
operator for configuring an exploration algorithm; and (ii) an adaptive
combination of the exploration and an exploitation algorithms for quickly
identifying the support of the optimum and efficiently searching the optimal
parameter in its support. Experimental results are presented to validate our
theoretical findings and the superiority of our proposed methods.
Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance

Large amounts of labeled data are typically required to train deep learning
models. For many real-world problems, however, acquiring additional data can
be expensive or even impossible. We present semi-supervised deep kernel
learning (SSDKL), a semi-supervised regression model based on minimizing
predictive variance in the posterior regularization framework. SSDKL combines
the hierarchical representation learning of neural networks with the
probabilistic modeling capabilities of Gaussian processes. By leveraging
unlabeled data, we show improvements on a diverse set of real-world regression
tasks over supervised deep kernel learning and semi-supervised methods such as
VAT and mean teacher adapted for regression.
Generalizing to Unseen Domains via Adversarial Data Augmentation

We are concerned with learning models that generalize well to different unseen
domains. We consider a worst-case formulation over data distributions that are
near the source domain in the feature space. Only using training data from the
source domain, we propose an iterative procedure that augments the dataset
with examples from a fictitious target domain that is "hard" under the current
model. We show that our iterative scheme is an adaptive data augmentation
method where we append adversarial examples at each iteration. For softmax
losses, we show that our method is a data-dependent regularization scheme that
behaves differently from classical regularizers (e.g., ridge or lasso) that
regularize towards zero. On digit recognition and semantic segmentation tasks,
we empirically observe that our method learns models that improve performance
across a priori unknown data distributions.
Hyperbolic Neural Networks

Hyperbolic spaces have recently gained momentum in the context of machine
learning due to their high capacity and tree-likeliness properties. However,
the representational power of hyperbolic geometry is not yet on par with
Euclidean geometry, firstly because of the absence of corresponding hyperbolic
neural network layers. Here, we bridge this gap in a principled manner by
combining the formalism of Möbius gyrovector spaces with the Riemannian
geometry of the Poincaré model of hyperbolic spaces. As a result, we derive
hyperbolic versions of important deep learning tools: multinomial logistic
regression, feed-forward and recurrent neural networks. This allows to embed
sequential data and perform classification in the hyperbolic space.
Empirically, we show that, even if hyperbolic optimization tools are limited,
hyperbolic sentence embeddings either outperform or are on par with their
Euclidean variants on textual entailment and noisy-prefix recognition tasks.
Breaking the Curse of Horizon: Infinite-Horizon Off-policy Estimation

We consider the off-policy estimation problem of estimating the expected
reward of a target policy using samples collected by a different behavior
policy. Importance sampling (IS) has been a key technique to derive (nearly)
unbiased estimators, but is known to suffer an excessively high variance in
long-horizon problems. In the extreme case where one tries to estimate the
average per-step reward in infinite-horizon problems, the variance of an IS-
based estimator may even be unbounded. In this paper, we propose a new off-
policy estimation method that applies IS directly on the stationary state-
visitation distributions to avoid the exploding variance issue faced by
existing estimators.Our key contribution is a novel approach to estimating the
density ratio of two stationary distributions, with data sampled from only the
behavior distribution. We develop a mini-max loss function for the estimation
problem, and derive a closed-form solution for the case of RKHS. We support
our method with both theoretical and empirical analysis.
Learning Task Specifications from Demonstrations

Real world applications often naturally decompose into several sub-tasks. In
many settings (e.g., robotics) demonstrations provide a natural way to specify
the sub-tasks. However, most methods for learning from demonstrations either
do not provide guarantees that the artifacts learned for the subtasks can be
safely recombined or limit the types of composition available. Motivated by
this deficit, we consider the problem of inferring binary non-Markovian
rewards, also known as logical trace properties or \emph{specifications}, from
demonstrations provided by an agent operating in an uncertain, stochastic
environment. Crucially, specifications admit well-defined composition rules
that are typically easy to interpret. In this paper, we formulate the
specification inference task as a maximum a posteriori (MAP) probability
inference problem, apply the principle of maximum entropy to derive an
analytic demonstration likelihood model and give an efficient approach to
search for the most likely specification in a large candidate pool of a
specifications. In our experiments, we demonstrate how learning specifications
can help avoid common reward hacking bugs that often occur due to ad-hoc
reward composition.
Learning an olfactory topography from neural activity in piriform cortex

A major difficulty afflicting the study of olfactory perception is the lack of
any obvious spatial organization or topography governing the relationship
between odorants or the percepts they elicit. Here we develop a Gaussian
process latent variable model to extract such a topography directly from
olfactory responses measured in piriform cortex. Our approach seeks to map
odorants to points in a low-dimensional embedding space, where the distance
between points in this embedding space relates to the similarity of population
responses they elicit. The model is specified by an explicit continuous
mapping from a latent embedding space to the space of high-dimensional neural
population activity patterns via a set of nonlinear neural tuning curves, each
parametrized by a Gaussian process, followed by a low-rank model of
correlated, odor-dependent Gaussian noise. We apply this model to large-scale
calcium fluorescence imaging measurements of population activity in layers 2
and 3 of mouse piriform cortex following presentation of a diverse set of
odorants. We show that we can learn a low-dimensional embedding of each odor,
and a smooth tuning curve over the latent embedding space that accurately
captures each neuron's response to different odorants. The model captures both
signal and noise correlations across more than 500 neurons. We perform a co-
smoothing analysis to show that the model can accurately predict responses of
a population held-out neurons to test odorants.
Fully Understanding The Hashing Trick

Feature hashing, also known as {\em the hashing trick}, introduced by
Weinberger et al. (2009), is one of the key techniques used in scaling-up
machine learning algorithms. Loosely speaking, feature hashing uses a random
sparse projection matrix $A : \mathbb{R}^n \to \mathbb{R}^m$ (where $m \ll n$)
in order to reduce the dimension of the data from $n$ to $m$ while
approximately preserving the Euclidean norm. Every column of $A$ contains
exactly one non-zero entry, equals to either $-1$ or $1$. Weinberger et al.
showed tail bounds on $|Ax|2^2$. Specifically they showed that for every
$\varepsilon, \delta$, if $|x|{\infty} / |x|_2$ is sufficiently small,
and $m$ is sufficiently large, then \begin{equation*}\Pr[ ; | ;|Ax|_2^2 -
|x|_2^2; | < \varepsilon |x|2^2 ;] \ge 1 - \delta ;.\end{equation*}
These bounds were later extended by Dasgupta et al. (2010) and most recently
refined by Dahlgaard et al. (2017), however, the true nature of the
performance of this key technique, and specifically the correct tradeoff
between the pivotal parameters $|x|{\infty} / |x|_2, m, \varepsilon,
\delta$ remained an open question. We settle this question by giving tight
asymptotic bounds on the exact tradeoff between the central parameters, thus
providing a complete understanding of the performance of feature hashing. We
complement the asymptotic bound with empirical data, which shows that the
constants "hiding" in the asymptotic notation are, in fact, very close to $1$,
thus further illustrating the tightness of the presented bounds in practice.
Evolved Policy Gradients

We propose a metalearning approach for learning gradient-based reinforcement
learning (RL) algorithms. The idea is to evolve a differentiable loss
function, such that an agent, which optimizes its policy to minimize this
loss, will achieve high rewards. The loss is parametrized via temporal
convolutions over the agent's experience. Because this loss is highly flexible
in its ability to take into account the agent's history, it enables fast task
learning. Empirical results show that our evolved policy gradient algorithm
(EPG) achieves faster learning on several randomized environments compared to
an off-the-shelf policy gradient method. We also demonstrate that EPG's
learned loss can generalize to out-of-distribution test time tasks, and
exhibits qualitatively different behavior from other popular metalearning
algorithms.
The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network

An important factor contributing to the success of deep learning has been the
remarkable ability to optimize large neural networks using simple first-order
optimization algorithms like stochastic gradient descent. While the efficiency
of such methods depends crucially on the local curvature of the loss surface,
very little is actually known about how this geometry depends on network
architecture and hyperparameters. In this work, we extend a recently-developed
framework for studying spectra of nonlinear random matrices to characterize an
important measure of curvature, namely the eigenvalues of the Fisher
information matrix. We focus on a single-hidden-layer neural network with
Gaussian data and weights and provide an exact expression for the spectrum in
the limit of infinite width. We find that linear networks suffer worse
conditioning than nonlinear networks and that nonlinear networks are
generically non-degenerate. We also predict and demonstrate empirically that
by adjusting the nonlinearity, the spectrum can be tuned so as to improve the
efficiency of first-order optimization methods.
Learning Concave Conditional Likelihood Models for Improved Analysis of Tandem Mass Spectra

The most widely used technology to identify the proteins present in a complex
biological sample is tandem mass spectrometry, which quickly produces a large
collection of spectra representative of the peptides (i.e., protein
subsequences) present in the original sample. In this work, we greatly expand
the parameter learning capabilities of a dynamic Bayesian network (DBN)
peptide-scoring algorithm, Didea, by deriving emission distributions for which
its conditional log-likelihood scoring function remains concave. We show that
this class of emission distributions, called Convex Virtual Emissions (CVEs),
naturally generalizes the log-sum-exp function while rendering both maximum
likelihood estimation and conditional maximum likelihood estimation concave
for a wide range of Bayesian networks. Utilizing CVEs in Didea allows
efficient learning of a large number of parameters while ensuring global
convergence, in stark contrast to Didea’s previous parameter learning
framework (which could only learn a single parameter using a costly grid
search) and other trainable models (which only ensure convergence to local
optima). The newly trained scoring function substantially outperforms the
state-of-the-art in both scoring function accuracy and downstream Fisher
kernel analysis. Furthermore, we significantly improve Didea’s runtime
performance through successive optimizations to its message passing schedule
and derive explicit connections between Didea’s new concave score and related
MS/MS scoring functions.
Differentially Private k-Means with Constant Multiplicative Error

We design new differentially private algorithms for the Euclidean k-means
problem, both in the centralized model and in the local model of differential
privacy. In both models, our algorithms achieve significantly improved error
guarantees than the previous state-of-the-art. In addition, in the local
model, our algorithm significantly reduces the number of interaction rounds.
Although the problem has been widely studied in the context of differential
privacy, all of the existing constructions achieve only super constant
approximation factors. We present, for the first time, efficient private
algorithms for the problem with constant multiplicative error. Furthermore, we
show how to modify our algorithms so they compute private corsets for k-means
clustering in both models.
Policy Optimization via Importance Sampling

Policy optimization is an effective reinforcement learning approach to solve
continuous control tasks. Recent achievements have shown that alternating on-
line and off-line optimization is a successful choice for efficient trajectory
reuse. However, deciding when to stop optimizing and collect new trajectories
is non-trivial as it requires to account for the variance of the objective
function estimate. In this paper, we propose a novel model-free policy search
algorithm, POIS, applicable in both control-based and parameter-based
settings. We first derive a high-confidence bound for importance sampling
estimation and then we define a surrogate objective function which is
optimized off-line using a batch of trajectories. Finally, the algorithm is
tested on a selection of continuous control tasks, with both linear and deep
policies, and compared with the state-of-the-art policy optimization methods.
Adversarial Logit Pairing

We introduce adversarial logit pairing (ALP), a new regularization technique
designed to increase robustness to adversarial examples. We first demonstrate
that ALP is a generalization of weight decay. Then, using the ImageNet
dataset, we empirically show that the performance of the state of the art
adversarial training defense from Madry et al. degrades on high dimensional
input spaces. Next, we show that ALP achieves the state of the art defense on
ImageNet against PGD white box attacks, with an accuracy improvement from 1.5%
to 27.9%. Unlike previous work on adversarial training, we achieve this
improvement without an increase in model size. Finally, we show that examples
generated from an ALP-trained model are the current state-of-the-art transfer
attack. This transfer attack successfully damages the current state of the art
defense against black box attacks on ImageNet (Tramer et al.), dropping its
accuracy from 66.6% to 47.1%. With this new accuracy drop, adversarial logit
pairing ties with Tramer et al. for the state of the art on black box attacks
on ImageNet.
Estimating Learnability in the Sublinear Data Regime

We consider the problem of estimating how well a model class is capable of
fitting a distribution of labeled data. We show that it is often possible to
accurately estimate this ``learnability'' even when given an amount of data
that is too small to reliably learn any accurate model. Our first result
applies to the setting where the data is drawn from a $d$-dimensional
distribution with isotropic covariance, and the label of each datapoint is an
arbitrary noisy function of the datapoint. In this setting, we show that with
$O(\sqrt{d})$ samples, one can accurately estimate the fraction of the
variance of the label that can be explained via the best linear function of
the data. We extend these techniques to a binary classification, and show that
the prediction error of the best linear classifier can be accurately estimated
given $O(\sqrt{d})$ labeled samples. For comparison, in both the linear
regression and binary classification settings, even if there is no noise in
the labels, a sample size linear in the dimension, $d$, is required to
\emph{learn} any function correlated with the underlying model. We further
extend our estimation approach to the setting where the data distribution has
an (unknown) arbitrary covariance matrix, allowing these techniques to be
applied to settings where the model class consists of a linear function
applied to a nonlinear embedding of the data. We demonstrate the practical
viability of our approaches on synthetic and real data. This ability to
estimate the explanatory value of a set of features (or dataset), even in the
regime in which there is too little data to realize that explanatory value,
may be relevant to the scientific and industrial settings for which data
collection is expensive and there are many potentially relevant feature sets
that could be collected.
Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation

We introduce algorithmic assurance, the problem of testing whether machine
learning algorithms are conforming to their intended design goal. We address
this problem by proposing an efficient framework for algorithmic testing. To
provide assurance, we need to efficiently discover scenarios where an
algorithm decision deviates maximally from its intended gold standard. We
mathematically formulate this task as an optimisation problem of an expensive,
black-box function. We use an active learning approach based on Bayesian
optimisation to solve this optimisation problem. We extend this framework to
algorithms with vector-valued outputs by making appropriate modification in
Bayesian optimisation via the Hedge algorithm. We theoretically analyse our
methods for convergence. Using two real-world applications, we demonstrate the
efficiency of our methods. The significance of our problem formulation and
initial solutions is that it will serve as the foundation in assuring humans
about machines making complex decisions.
Community Exploration: From Offline Optimization to Online Learning

We introduce the community exploration problem that has various real-world
applications such as online advertising. In the problem, an explorer allocates
limited budget to explore communities so as to maximize the number of members
he could meet. We provide a systematic study of the community exploration
problem, from offline optimization to online learning. For the offline setting
where the sizes of communities are known, we prove that the greedy methods for
both of non-adaptive exploration and adaptive exploration are optimal. For the
online setting where the sizes of communities are not known and need to be
learned from the multi-round explorations, we propose an ``upper confidence''
like algorithm that achieves the logarithmic regret bounds. By combining the
feedback from different rounds, we can achieve a constant regret bound.
A Dual Framework for Low-rank Tensor Completion

One of the popular approaches for low-rank tensor completion is to use the
latent trace norm regularization. However, most existing works in this
direction learn a sparse combination of tensors. In this work, we fill this
gap by proposing a variant of the latent trace norm that helps in learning a
non-sparse combination of tensors. We develop a dual framework for solving the
proposed low-rank tensor completion problem. In this framework, we first show
a novel characterization of the solution space with an interesting
factorization of the optimal solution. This allows to propose two scalable
optimization formulations. The problems are shown to lie on a Cartesian
product of Riemannian spectrahedron manifolds. We exploit the versatile
Riemannian optimization framework for proposing computationally efficient
trust region algorithms. The experiments illustrate the efficacy of the
proposed algorithms on several real-world datasets across applications.
Low-rank Interaction with Sparse Additive Effects Model for Large Data Frames

Many applications of machine learning involve the analysis of large data
frames -- matrices collecting heterogeneous measurements (binary, numerical,
counts, etc.) across samples -- with missing values. Low-rank models, as
studied by Udell et al. (2016), are popular in this framework for tasks such
as visualization, clustering and missing value imputation. Yet, available
methods with statistical guarantees and efficient optimization do not allow
explicit modeling of main additive effects such as row and column, or
covariate effects. In this paper, we introduce a low-rank interaction and
sparse additive effects (LORIS) model which combines matrix regression on a
dictionary and low-rank design, to estimate main effects and interactions
simultaneously. We provide statistical guarantees in the form of upper bounds
on the estimation error of both components. Then, we introduce a mixed
coordinate gradient descent (MCGD) method which provably converges sub-
linearly to an optimal solution and is computationally efficient for large
scale data sets. We show on simulated and survey data that the method has a
clear advantage over current practices.
Inference Aided Reinforcement Learning for Incentive Mechanism Design in Crowdsourcing

Incentive mechanisms for crowdsourcing are designed to incentivize financially
self-interested workers to generate and report high-quality labels. Existing
mechanisms are often developed as one-shot static solutions, assuming a
certain level of knowledge about worker models (expertise levels, costs for
exerting efforts, etc.). In this paper, we propose a novel inference aided
reinforcement mechanism that acquires data sequentially and requires no such
prior assumptions. Specifically, we first design a Gibbs sampling augmented
Bayesian inference algorithm to estimate workers' labeling strategies from the
collected labels at each step. Then we propose a reinforcement incentive
learning (RIL) method, building on top of the above estimates, to uncover how
workers respond to different payments. RIL dynamically determines the payment
without accessing any ground-truth labels. We theoretically prove that RIL is
able to incentivize rational workers to provide high-quality labels both at
each step and in the long run. Empirical results show that our mechanism
performs consistently well under both rational and non-fully rational
(adaptive learning) worker models. Besides, the payments offered by RIL are
more robust and have lower variances compared to existing one-shot mechanisms.
Middle-Out Decoding

Despite being virtually ubiquitous, sequence-to-sequence models are challenged
by their lack of diversity and inability to be externally controlled. In this
paper, we speculate that a fundamental shortcoming of sequence generation
models is that the decoding is done strictly from left-to-right, meaning that
outputs values generated earlier have a profound effect on those generated
later. To address this issue, we propose a novel middle-out decoder
architecture that begins from an initial middle-word and simultaneously
expands the sequence in both directions. To facilitate information flow and
maintain consistent decoding, we introduce a dual self-attention mechanism
that allows us to model complex dependencies between the outputs. We
illustrate the performance of our model on the task of video captioning, as
well as a synthetic sequence de-noising task. Our middle-out decoder achieves
significant improvements on de-noising and competitive performance in the task
of video captioning, while quantifiably improving the caption diversity.
Furthermore, we perform a qualitative analysis that demonstrates our ability
to effectively control the generation process of our decoder.
First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time

(This is a theory paper) In this paper, we consider first-order methods for
solving stochastic non-convex optimization problems. The key building block of
the proposed algorithms is first-order procedures to extract negative
curvature from the Hessian matrix through a principled sequence starting from
noise, which are referred to {\it NEgative-curvature-Originated-from-Noise or
NEON} and are of independent interest. Based on this building block, we design
purely first-order stochastic algorithms for escaping from non-degenerate
saddle points with a much better time complexity (almost linear time in the
problem's dimensionality). In particular, we develop a general framework of
{\it first-order stochastic algorithms} with a second-order convergence
guarantee based on our new technique and existing algorithms that may only
converge to a first-order stationary point. For finding a nearly {\it second-
order stationary point} $\x$ such that $|\nabla F(\x)|\leq \epsilon$ and
$\nabla^2 F(\x)\geq -\sqrt{\epsilon}I$ (in high probability), the best time
complexity of the presented algorithms is $\widetilde O(d/\epsilon^{3.5})$,
where $F(\cdot)$ denotes the objective function and $d$ is the dimensionality
of the problem. To the best of our knowledge, this is the first theoretical
result of first-order stochastic algorithms with an almost linear time in
terms of problem's dimensionality for finding second-order stationary points,
which is even competitive with existing stochastic algorithms hinging on the
second-order information.
To Trust Or Not To Trust A Classifier

Knowing when a classifier's prediction can be trusted is useful in many
applications and critical for safely using AI. While the bulk of the effort in
machine learning research has been towards improving classifier performance,
understanding when a classifier's predictions should and should not be trusted
has received far less attention. The standard approach is to use the
classifier's discriminant or confidence score; however, we show there exists a
considerably more effective alternative. We propose a new score, called the
{\it trust score}, which measures the agreement between the classifier and a
modified nearest-neighbor classifier on the testing example. We show
empirically that high (low) trust scores produce surprisingly high precision
at identifying correctly (incorrectly) classified examples, consistently
outperforming the classifier's confidence score as well as many other
baselines. Further, under some mild distributional assumptions, we show that
if the trust score for an example is high (low), the classifier will likely
agree (disagree) with the Bayes-optimal classifier. Our guarantees consist of
non-asymptotic rates of statistical consistency under various nonparametric
settings and build on recent developments in topological data analysis.
Reparameterization Gradient for Non-differentiable Models

We present a new algorithm for stochastic variational inference that targets
at models with non-differentiable densities. One of the key challenges in
stochastic variational inference is to come up with a low-variance estimator
of the gradient of a variational objective. We tackle the challenge by
generalizing the reparameterization trick, one of the most effective
techniques for addressing the variance issue for differentiable models, so
that the trick works for non-differentiable models as well. Our algorithm
splits the space of latent variables into regions where the density of the
variables is differentiable, and their boundaries where the density may fail
to be differentiable. For each differentiable region, the algorithm applies
the standard reparameterization trick and estimates the gradient restricted to
the region. For each potentially non-differentiable boundary, it uses a form
of manifold sampling and computes the direction for variational parameters
that, if followed, would increase the boundary’s contribution to the
variational objective. The sum of all the estimates becomes the gradient
estimate of our algorithm. Our estimator enjoys the reduced variance of the
reparameterization gradient while remaining unbiased even for non-
differentiable models. The experiments with our preliminary implementation
confirm the benefit of reduced variance and unbiasedness.
A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization

We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth
finite-sum problems. In particular, the objective function is given by the
summation of a differentiable (possibly nonconvex) component, together with a
possibly non-differentiable but convex component. We propose a proximal
stochastic gradient algorithm based on variance reduction, called ProxSVRG+.
Our main contribution lies in the analysis of ProxSVRG+. It recovers several
existing convergence results (in terms of the number of stochastic gradient
oracle calls and proximal operations), and improves/generalizes them. In
particular, ProxSVRG+ generalizes the best results given by the SCSG
(stochastically controlled stochastic gradient) algorithm, recently proposed
by [Lei et al., NIPS'17] for the smooth nonconvex case. ProxSVRG+ is more
straightforward than SCSG and yields simpler analysis. Moreover, ProxSVRG+
outperforms the deterministic proximal gradient descent (ProxGD) for a wide
range of minibatch sizes, which partially solves an open problem proposed in
[Reddi et al., NIPS'16]. Also, ProxSVRG+ uses much less proximal oracle calls
than ProxSVRG [Reddi et al., NIPS'16] if minibatch size $b
Multimodal Generative Models for Scalable Weakly-Supervised Learning

Multiple modalities often co-occur when describing natural phenomena. Learning
a joint representation of these modalities should yield deeper and more useful
representations. Previous generative approaches to multi-modal input either do
not learn a joint distribution or require additional computation to handle
missing data. Here, we introduce a multimodal variational autoencoder (MVAE)
that uses a product-of-experts inference network and a sub-sampled training
paradigm to solve the multi-modal inference problem. Notably, our model shares
parameters to efficiently learn under any combination of missing modalities.
We apply the MVAE on four datasets and show that we match state-of-the-art
performance using many fewer parameters. In addition, we show that the MVAE is
directly applicable to weakly-supervised learning, and is robust to incomplete
supervision. We then consider a case study of learning image
transformations—edge detection, colorization, facial landmark segmentation,
etc.—as a set of modalities. We find appealing results across this range of
tasks.
How Much Restricted Isometry is Needed In Nonconvex Matrix Recovery?

When the linear measurements of an instance of low-rank matrix recovery
satisfy a restricted isometry property (RIP) --- i.e. they are approximately
norm-preserving --- the problem is known to contain no spurious local minima,
so exact recovery is guaranteed. In this paper, we show that moderate RIP is
not enough to eliminate spurious local minima, so existing results can only
hold for near-perfect RIP. In fact, counterexamples are ubiquitous: every $x$
is the spurious local minimum of a rank-1 instance of matrix recovery that
satisfies RIP. One specific counterexample has RIP constant $\delta=1/2$, but
causes randomly initialized stochastic gradient descent (SGD) to fail 12% of
the time. SGD is frequently able to avoid and escape spurious local minima,
but this empirical result shows that it can occasionally be defeated by their
existence. Hence, while exact recovery guarantees will likely require a proof
of no spurious local minima, arguments based solely on norm preservation will
only be applicable to a narrow set of nearly-isotropic instances.
Impossibility of deducing preferences and rationality from human policy

Inverse reinforcement learning (IRL) attempts to infer human rewards or
preferences from observed behavior. Since human planning systematically
deviates from rationality, several approaches have been tried to account for
specific human shortcomings. However, there has been little analysis of the
general problem of inferring the reward of a human of unknown rationality. The
observed behavior can, in principle, be decomposed into two components: a
reward function and a planning algorithm, both of which have to be inferred
from behavior. This paper presents a No Free Lunch theorem, showing that,
without making `normative' assumptions beyond the data, nothing about the
human reward function can be deduced from human behavior. Unlike most No Free
Lunch theorems, this cannot be alleviated by regularising with simplicity
assumptions. We show that the simplest hypotheses which explain the data are
generally degenerate.
Manifold Structured Prediction

Structured prediction provides a general framework to deal with supervised
problems where the outputs have semantically rich structure. While classical
approaches consider finite, albeit potentially huge, output spaces, in this
paper we discuss how structured prediction can be extended to a continuous
scenario. Specifically, we study a structured prediction approach to manifold-
valued regression. We characterize a class of problems for which the
considered approach is statistically consistent and study how geometric
optimization can be used to compute the corresponding estimator. Promising
experimental results on both simulated and real data complete our study.
Fast Greedy MAP Inference for Determinantal Point Process to Improve Recommendation Diversity

The determinantal point process (DPP) is an elegant probabilistic model of
repulsion with applications in various machine learning tasks including
summarization and search. However, the maximum a posteriori (MAP) inference
for DPP which plays an important role in many applications is NP-hard, and
even the popular greedy algorithm can still be too computationally expensive
to be used in large-scale real-time scenarios. To overcome the computational
challenge, in this paper, we propose a novel algorithm to greatly accelerate
the greedy MAP inference for DPP. In addition, our algorithm also adapts to
scenarios where the repulsion is only required among nearby few items in the
result sequence. We apply the proposed algorithm to generate relevant and
diverse recommendations. Experimental results show that our proposed algorithm
is significantly faster than state-of-the-art competitors, and provides a
better relevance-diversity trade-off on several public datasets, which is also
confirmed in an online A/B test.
Learning Others' Intentional Models in Multi-Agent Settings Using Interactive POMDPs

Interactive partially observable Markov decision processes (I-POMDPs) provide
a principled framework for planning and acting in a partially observable,
stochastic and multi-agent environment. It extends POMDPs to multi-agent
settings by including models of other agents in the state space and forming a
hierarchical belief structure. In order to predict other agents' actions using
I-POMDPs, we propose an approach that effectively uses Bayesian inference and
sequential Monte Carlo sampling to learn others' intentional models which
ascribe to them beliefs, preferences and rationality in action selection.
Empirical results show that our algorithm accurately learns models of the
other agent and has superior performance than other methods. Our approach
serves as a generalized Bayesian learning algorithm that learns other agents'
beliefs, and transition, observation and reward functions. It also effectively
mitigates the belief space complexity due to the nested belief hierarchy.
Contextual Pricing for Lipschitz Buyers

We investigate the problem of learning a Lipschitz function from binary
feedback. In this problem, a learner is trying to learn a Lipschitz function
$f:[0,1]^d \rightarrow [0,1]$ over the course of $T$ rounds. On round $t$, an
adversary provides the learner with an input $x_t$, the learner submits a
guess $y_t$ for $f(x_t)$, and learns whether $y_t &gt; f(x_t)$ or $y_t \leq
f(x_t)$. The learner's goal is to minimize their total loss
$\sum_t\ell(f(x_t), y_t)$ (for some loss function $\ell$). The problem is
motivated by \textit{contextual dynamic pricing}, where a firm must sell a
stream of differentiated products to a collection of buyers with non-linear
valuations for the items and observes only whether the item was sold or not at
the posted price. For the symmetric loss $\ell(f(x_t), y_t) = \vert f(x_t) -
y_t \vert$, we provide an algorithm for this problem achieving total loss
$O(\log T)$ when $d=1$ and $O(T^{(d-1)/d})$ when $d&gt;1$, and show that both
bounds are tight (up to a factor of $\sqrt{\log T}$). For the pricing loss
function $\ell(f(x_t), y_t) = f(x_t) - y_t {\bf 1}\{y_t \leq f(x_t)\}$ we
show a regret bound of $O(T^{d/(d+1)})$ and show that this bound is tight. We
present improved bounds in the special case of a population of linear buyers.
Online Improper Learning with an Approximation Oracle

We study the following question: given an efficient approximation algorithm
for an optimization problem, can we learn efficiently in the same setting? We
give a formal affirmative answer to this question in the form of a reduction
from online learning to offline approximate optimization using an efficient
algorithm that guarantees near optimal regret. The algorithm is efficient in
terms of the number of oracle calls to a given approximation oracle – it makes
only logarithmically many such calls per iteration. This resolves an open
question by Kalai and Vempala, and by Garber. Furthermore, our result applies
to the more general improper learning problems.
Bandit Learning in Concave N-Person Games

This paper examines the long-run behavior of learning with bandit feedback in
non-cooperative concave games. The bandit framework accounts for extremely
low-information environments where the agents may not even know they are
playing a game; as such, the agents’ most sensible choice in this setting
would be to employ a no-regret learning algorithm. In general, this does not
mean that the players' behavior stabilizes in the long run: no-regret learning
may lead to cycles, even with perfect gradient information. However, if a
standard monotonicity condition is satisfied, our analysis shows that no-
regret learning based on mirror descent with bandit feedback converges to Nash
equilibrium with probability 1. We also derive an upper bound for the
convergence rate of the process that nearly matches the best attainable rate
for single-agent bandit stochastic optimization.
On Fast Leverage Score Sampling and Optimal Learning

Leverage score sampling provides an appealing way to perform approximate
computations for large matrices. Indeed, it allows to derive faithful
approximations with a complexity adapted to the problem at hand. Yet,
performing leverage scores sampling is a challenge in its own right and
further approximations are typically needed. In this paper, we study the
problem of leverage score sampling for positive definite matrices defined by a
kernel. Our contribution is twofold. First we provide a novel algorithm for
leverage score sampling. We provide theoretical guarantees as well as
empirical results proving that the proposed algorithm is currently the fastest
and most accurate solution to this problem. Second, we analyze the properties
of the proposed method in a downstream supervised learning task. Combining
several algorithmic ideas, we derive the fastest solver for kernel ridge
regression and Gaussian process regression currently available. Also in this
case, theoretical findings are corroborated by experimental results.
Unsupervised Video Object Segmentation for Deep Reinforcement Learning

We present a new technique for deep reinforcement learning that automatically
detects moving objects and uses the relevant information for action selection.
The detection of moving objects is done in an unsupervised way by exploiting
structure from motion. Instead of directly learning a policy from raw images,
the agent first learns to detect and segment moving objects by exploiting flow
information in video sequences. The learned representation is then used to
focus the policy of the agent on the moving objects. Over time, the agent
identifies which objects are critical for decision making and gradually builds
a policy based on relevant moving objects. This approach, which we call
Motion-Oriented REinforcement Learning (MOREL), is demonstrated on a suite of
Atari games where the ability to detect moving objects reduces the amount of
interaction needed with the environment to obtain a good policy. Furthermore,
the resulting policy is more interpretable than policies that directly map
images to actions or values with a black box neural network. We can gain
insight into the policy by inspecting the segmentation and motion of each
object detected by the agent. This allows practitioners to confirm whether a
policy is making decisions based on sensible information.
Efficient inference for time-varying behavior during learning

The process of learning new behaviors is of great interest to various domains
of neuroscience and artificial intelligence. However, most standard analyses
of training data either treat behavior as fixed or track only coarse
performance statistics (e.g., accuracy, choice bias), providing limited
insight into evolving behavioral strategies. To overcome these limitations, we
propose a dynamic psychophysical model that efficiently tracks trial-to-trial
changes in behavior over the course of training. Based on a dynamic logistic
regression model, our model infers a high-dimensional time-varying weight
vector that expresses the dynamic dependencies of behavior on task stimuli and
common task-irrelevant variables including choice history, sensory history,
reward history, and choice bias. Our implementation scales to the largest
behavioral datasets, allowing us to infer 500K parameters (e.g. 10 weights
over 50K trials) in a few hours on a desktop computer. We optimize
hyperparameters with the decoupled Laplace approximation, an efficient method
for maximizing marginal likelihood that allows us to estimate directly from
the data how quickly each weight evolves. To illustrate performance and
utility, we apply our method to psychophysical data from both human subjects
and rats learning the same delayed sensory discrimination task. We
successfully track the dynamics of psychophysical weights during training,
capturing day-to-day and trial-to-trial fluctuations that underly changes in
performance, choice bias, and dependencies on task history. Finally, we
leverage the flexibility of our model to investigate why rats frequently make
mistakes on easy trials, demonstrating that this lapse phenomenon occurs due
to sub-optimal weighting of task covariates.
Learning convex polytopes with margin

We present a near-optimal algorithm for properly learning convex polytopes in
the realizable PAC setting from data with a margin. Our first contribution is
to identify distinct generalizations of the notion of {\em margin} from
hyperplanes to polytopes and to understand how they relate geometrically; this
result may be of interest beyond the learning setting. Our novel learning
algorithm constructs a consistent polytope as an intersection of about $t \log
t$ halfspaces in time polynomial in $t$ (where $t$ is the number of halfspaces
forming an optimal polytope). This is an exponential improvement over the
state of the art [Arriaga and Vempala, 2006]. We also improve over the super-
polynomial-in-$t$ algorithm of Klivans and Servedio [2008], while achieving a
better sample complexity. Finally, we provide the first nearly matching
hardness-of-approximation lower bound, whence our claim of near optimality.
Critical initialisation for deep signal propagation in noisy rectifier neural networks

Stochastic regularisation is an important weapon in the arsenal of a deep
learning practitioner. However, despite recent theoretical advances, our
understanding of how noise influences signal propagation in deep neural
networks remains limited. By extending recent work based on mean field theory,
we develop a new framework for signal propagation in stochastic regularised
neural networks. Our noisy signal propagation theory can incorporate several
common noise distributions, including additive and multiplicative Gaussian
noise as well as dropout. We use this framework to investigate initialisation
strategies for noisy ReLU networks. We show that no critical initialisation
strategy exists using additive noise, with signal propagation exploding
regardless of the selected noise distribution. For multiplicative noise (e.g.
dropout), we identify alternative critical initialisation strategies that
depend on the second moment of the noise distribution. Simulations confirm
that our proposed initialisation is able to stably propagate signals in deep
networks, while using an initialisation disregarding noise fails to do so. In
experiments on real-world data, the impact of this is that noisy ReLU models
using dropout with our initialisation are able to train on MNIST and CIFAR-10
at large depths where the standard initialisation strategy fails.
Insights on representational similarity in neural networks with canonical correlation

Comparing different neural network representations and determining how
representations evolve over time remain challenging open questions in our
understanding of the function of neural networks. Comparing representations in
neural networks is fundamentally difficult as the structure of representations
varies greatly, even across groups of networks trained on identical tasks, and
over the course of training. Here, we develop projection weighted CCA
(Canonical Correlation Analysis) as a tool for understanding neural networks,
building off of SVCCA, a recently proposed method. We first improve the core
method, showing how to differentiate between signal and noise, and then apply
this technique to compare across a group of CNNs, demonstrating that networks
which generalize converge to more similar representations than networks which
memorize, that wider networks converge to more similar solutions than narrow
networks, and that trained networks converge to distinct clusters with diverse
representations. We also investigate the representational dynamics of RNNs,
across both training and sequential timesteps, finding that RNNs converge in a
bottom-up pattern over the course of training and that the hidden state is
highly variable over the course of a sequence, even when accounting for linear
transforms. Together, these results provide new insights into the function of
CNNs and RNNs, and demonstrate the utility of using CCA to understand
representations.
Variational Inference with Tail Adapted f-Divergence

Variational inference with alpha-divergences has been widely used in modem
probabilistic machine learning. Compared to Kullback-Leibler (KL) divergence,
a major advantage of using alpha-divergences is their mass-covering property.
However, alpha-divergences require importance sampling to estimate and
optimize, which can be extremely ineffective when the importance weights have
a heavy tail. In this paper, we propose new variants of f-divergences that
adaptively change with the tail of the importance ratios. Compared to alpha-
divergences, our approach theoretically guarantees finite mean of importance
weights and simultaneously produce overdispersed approximations. We test our
methods on Bayesian neural networks and Reinforcement learning in which our
method is applied to improve a recent soft actor-critic (SAC) algorithm. Our
results show that our approach yields significant advantages compared with
classic KL and alpha-divergence based VI.
Mental Sampling in Multimodal Representations

Both resources in the natural environment and concepts in a semantic space are
distributed "patchily", with large gaps in between the patches. To describe
people's internal and external foraging behavior, various random walk models
have been proposed. In particular, internal foraging has been modeled as
sampling: in order to gather relevant information for making a decision,
people draw samples from a mental representation using random-walk algorithms
such as Markov chain Monte Carlo (MCMC). However, two common empirical
observations argue against people using simple sampling algorithms such as
MCMC for internal foraging. First, the distance between samples is often best
described by a Levy flight distribution: the probability of the distance
between two successive locations follows a power-law on the distances. Second,
humans and other animals produce long-range, slowly decaying autocorrelations
characterized as 1/f-like fluctuations, instead of the 1/f^2 fluctuations
produced by random walks. We propose that mental sampling is not done by
simple MCMC, but is instead adapted to multimodal representations and is
implemented by Metropolis-coupled Markov chain Monte Carlo (MC3), one of the
first algorithms developed for sampling from multimodal distributions. MC3
involves running multiple Markov chains in parallel but with target
distributions of different temperatures, and it swaps the states of the chains
whenever a better location is found. Heated chains more readily traverse
valleys in the probability landscape to propose moves to far-away peaks, while
the colder chains make the local steps that explore the current peak or patch.
We show that MC3 generates distances between successive samples that follow a
Levy flight distribution and produce 1/f-like autocorrelations, providing a
single mechanistic account of these two puzzling empirical phenomena of
internal foraging.
Adversarially Robust Optimization with Gaussian Processes

In this paper, we consider the problem of Gaussian process (GP) optimization
with an added robustness requirement: The returned point may be perturbed by
an adversary, and we require the function value to degrade as little as
possible as a result. This problem is motivated by settings where the
underlying functions during optimization and implementation stages are
different (e.g., due to time variations), or when one is interested in finding
an entire region of good inputs rather than only a single point. We show that
standard GP optimization algorithms do not exhibit the desired robustness
properties, and give a novel confidence bound based algorithm StableOPT for
this purpose. We rigorously establish the required number of samples for
StableOPT to find a near-optimal point, and we complement this guarantee with
an algorithm-independent lower bound. We experimentally demonstrate a variety
of potential applications of interest on real world data sets, and we show
that StableOPT consistently succeeds in finding a stable maximizer where
several baseline methods fail.
Learning to Multitask

Multitask learning has shown promising performance in many applications and
many multitask models have been proposed. In order to identify an effective
multitask model for a given multitask problem, we propose a learning framework
called learning to multitask (L2MT). To achieve the goal, L2MT exploits
historical multitask experience which is organized as a training set consists
of several tuples, each of which contains a multitask problem with multiple
tasks, a multitask model and the relative test error. Based on such training
set, L2MT first uses a proposed layerwise graph neural network to learn task
embeddings for all the tasks in a multitask problem and then learns an
estimation function to estimate the relative test error based on task
embeddings and the representation of the multitask model based on a unified
formulation. Given a new multitask problem, the estimation function is used to
identify a suitable multitask model. Experiments on benchmark datasets show
the effectiveness of the proposed L2MT method.
Loss Functions for Multiset Prediction

We study the problem of multiset prediction. The goal of multiset prediction
is to train a predictor that maps an input to a multiset consisting of
multiple items. Unlike existing problems in supervised learning, such as
classification, ranking and sequence generation, there is no known order among
items in a target multiset, and each item in the multiset may appear more than
once, making this problem extremely challenging. In this paper, we propose a
novel multiset loss function by viewing this problem from the perspective of
sequential decision making. The proposed multiset loss function is empirically
evaluated on two families of datasets, one synthetic and the other real, with
varying levels of difficulty, against various baseline loss functions
including reinforcement learning, sequence, and aggregated distribution
matching loss functions. The experiments reveal the effectiveness of the
proposed loss function over the others.
Computing Kantorovich-Wasserstein Distances on $d$-dimensional histograms using $(d+1)$-partite graphs

This paper presents a novel method to compute the exact Kantorovich-
Wasserstein distance between a pair of $d$-dimensional histograms having $n$
bins each. We prove that this problem is equivalent to an uncapacitated
minimum cost flow problem on a $(d+1)$-partite graph with $(d+1)n$ nodes and
$dn^{\frac{d+1}{d}}$ arcs, whenever the cost is separable along the principal
$d$-dimensional directions. We show numerically the benefits of our approach
by computing the Kantorovich-Wasserstein distance of order 2 among two sets of
instances: gray scale images and $d$-dimensional biomedical histograms. On
these types of instances, our approach is competitive with state-of-the-art
optimal transport algorithms.
Neural Interaction Transparency (NIT): Disentangling Learned Interactions for Improved Interpretability

Neural networks are known to model statistical interactions, but they entangle
the interactions at intermediate hidden layers for shared representation
learning. We propose a framework, DI, that Disentangles Interactions by
counteracting the shared learning across different interactions to obtain
their intrinsic lower-order and interpretable structure. This is done through
a novel regularizer that directly penalizes interaction order. We show that
disentangling interactions reduces a feedforward neural network to a
generalized additive model with interactions, which can lead to transparent
models that perform comparably to the state-of-the-art models. DI is also
flexible and efficient; it can learn generalized additive models with maximum
K-order interactions by training only $O(1)$ models.
CapProNet: Deep Feature Learning via Orthogonal Projections onto Capsule Subspaces

In this paper, we formalize the idea behind capsule nets of using a capsule
vector rather than a neuron activation to predict the label of samples. To
this end, we propose to learn a group of capsule subspaces onto which an input
feature vector is projected. Then the lengths of resultant capsules are used
to score the probability of belonging to different classes. We train such a
Capsule Projection Network (CapProNet) by learning an orthogonal projection
matrix for each capsule subspace, and show that each capsule subspace is
updated until it contains input feature vectors corresponding to the
associated class. With low dimensionality of capsule subspace as well as an
iterative method to estimate the matrix inverse, only a small negligible
computing overhead is incurred to train the network. Experiment results on
image datasets show the presented network can greatly improve the performance
of state-of-the-art Resnet backbones by $10-20%$ with almost the same
computing cost.
Gamma-Poisson Dynamic Matrix Factorization Embedded with Metadata Influence

A conjugate Gamma-Poisson model for Dynamic Matrix Factorization incorporated
with metadata influence (mGDMF for short) is proposed to effectively and
efficiently model massive, sparse and dynamic data in recommendations.
Modeling recommendation problems with a massive amount of ratings and very
sparse or even no ratings on some users/items in a dynamic setting is very
demanding and poses critical challenges for well-studied matrix factorization
models due to the large-scale, sparse and dynamic nature of the data. Our
proposed mGDMF tackles these challenges by introducing three strategies: (1)
constructing a stable Gamma-Markov chain model that smoothly drifts over time
by combining both static and dynamic latent features of data; (2)
incorporating the user/item metadata into the model to tackle sparse ratings;
and (3) undertaking stochastic variational inference to efficiently handle
massive data. mGDMF is conjugate, dynamic and scalable. Experiments show that
mGDMF significantly (both effectively and efficiently) outperforms the state-
of-the-art static and dynamic models on large, sparse and dynamic data.
Masking: A New Perspective of Noisy Supervision

It is important to learn classifiers under noisy labels due to their
ubiquities. As noisy labels are corrupted from ground-truth labels by an
unknown noise transition matrix, the accuracy of classifiers can be improved
by estimating this matrix, without introducing either sample-selection or
regularization biases. However, such estimation is often inexact, which
inevitably degenerates the accuracy of classifiers. The inexact estimation is
due to either a heuristic trick, or the brutal-force learning by deep networks
under a finite dataset. In this paper, we present a human-assisted approach
called ``\textit{masking}''. The masking conveys human cognition of invalid
class transitions, and naturally speculates the structure of the noise
transition matrix. Given the structure information, we only learn the sparse
noise transition probability to reduce the estimation burden. To instantiate
this approach, we derive a structure-aware probabilistic model, which
incorporates a structure prior. During the model realization, we solve the
challenges from structure extraction and alignment in principle. Empirical
results on benchmark datasets with three noise structures show that, our
approach can improve the robustness of classifiers significantly.
On GANs and GMMs

A longstanding problem in machine learning is to find unsupervised methods
that can learn the statistical structure of high dimensional signals. In
recent years, GANs have gained much attention as a possible solution to the
problem, and in particular have shown the ability to generate remarkably
realistic high resolution sampled images. At the same time, many authors have
pointed out that GANs may fail to model the full distribution ("mode
collapse") and that using the learned models for anything other than
generating samples may be very difficult. In this paper, we examine the
utility of GANs in learning statistical models of images by comparing them to
perhaps the simplest statistical model, the Gaussian Mixture Model. First, we
present a simple method to evaluate generative models based on relative
proportions of samples that fall into predetermined bins. Unlike previous
automatic methods for evaluating models, our method does not rely on an
additional neural network nor does it require approximating intractable
computations. Second, we compare the performance of GANs to GMMs trained on
the same datasets. While GMMs have previously been shown to be successful in
modeling small patches of images, we show how to train them on full sized
images despite the high dimensionality. Our results show that GMMs can
generate realistic samples (although less sharp than those of GANs) but also
capture the full distribution which GANs fail to do. Furthermore, GMMs allow
efficient inference and explicit representation of the underlying statistical
structure. Finally, we discuss how a pix2pix network can be used to add high-
resolution details to GMM samples while maintaining the basic diversity.
Practical Deep Stereo (PDS): Toward applications-friendly deep stereo matching

End-to-end deep-learning networks recently demonstrated extremely good
performance for stereo matching. However, existing networks are difficult to
use for practical applications since (1) they are memory-hungry and unable to
process even modest-size images, (2) they have to be fully re-trained to
handle a different disparity range. The Practical Deep Stereo (PDS) network
that we propose addresses both issues: First, its architecture relies on novel
bottleneck modules that drastically reduce the memory footprint in inference,
and additional design choices allow to handle greater image size during
training. This results in a model that leverages large image context to
resolve matching ambiguities. Second, a novel sub-pixel cross-entropy loss
combined with a MAP estimator make this network less sensitive to ambiguous
matches, and applicable to any disparity range without re-training. We compare
PDS to state-of-the-art methods published over the recent months, and
demonstrate its superior performance on FlyingThings3D and KITTI sets.
A Bayes-Sard Cubature Method

This paper focusses on the formulation of numerical integration as an
inferential task. To date, research effort has largely focussed on the
development of Bayesian cubature, whose distributional output provides
uncertainty quantification for the integral. However, the point estimators
associated to Bayesian cubature can be inaccurate and acutely sensitive to the
prior when the domain is high-dimensional. To address these drawbacks we
introduce Bayes–Sard cubature, a probabilistic framework that combines the
flexibility of Bayesian cubature with the robustness of classical cubatures
which are well-established. This is achieved by considering a Gaussian process
model for the integrand whose mean is a parametric regression model, with an
improper flat prior on each regression coefficient. The features in the
regression model consist of test functions which are guaranteed to be exactly
integrated, with remaining degrees of freedom afforded to the non-parametric
part. The asymptotic convergence of the Bayes–Sard cubature method is
established and the theoretical results are numerically verified. In
particular, we report two orders of magnitude reduction in error compared to
Bayesian cubature in the context of a high-dimensional financial integral.
Dual Swap Disentangling

Learning interpretable disentangled representations is a crucial yet
challenging task. In this paper, we propose a weakly semi-supervised method,
termed as \emph{Dual Swap Disentangling~(DSD)}, for disentangling using both
labeled and unlabeled data. Unlike conventional weakly supervised methods that
rely on full annotations on the group of samples, we require only limited
annotations on paired samples that indicate their shared attribute like the
color. Our model takes the form of a dual autoencoder structure. To achieve
disentangling using the labeled pairs, we follow a encoding-swap-decoding''
process, where we first swap the parts of their encodings corresponding to the
shared attribute, and then decode the obtained hybrid codes to reconstruct the
original input pairs. For unlabeled pairs, we follow theencoding-swap-
decoding'' process twice on designated encoding parts and enforce the final
outputs to approximate the input pairs. By isolating parts of the encoding and
swapping them back and forth, we impose the dimension-wise modularity and
portability of the encodings of the unlabeled samples, which implicitly
encourages disentangling under the guidance of labeled pairs. This dual swap
mechanism, tailored for semi-supervised setting, turns out to be very
effective. Experiments on image datasets from a wide domain show that our
model yields state-of-the-art disentangling performances.
Diverse Ensemble Evolution: Curriculum based Data-Model Marriage

We study how to train an ensemble of models based on their changing diversity
requirements and expertise. Previous ensemble methods usually determine
diversity before training starts, either by resampling training data or via
random initialization. Instead, we propose ``Diverse Ensemble Evolution
(DivE$^2$),'' a method that assigns data to models at each training epoch,
based on their capabilities and a scheduled diversity requirement. DivE$^2$
starts with selecting easy-to-learn samples for every model, and slowly moves
to selecting models with accurate predictions for a data sample. To expand the
realm of expertise for each model while enforcing diversity over all models,
we propose an intra-model diversity term on data assigned to each model, and
an inter-model diversity term to penalize redundancy over data assigned to
different models. We formulate such data assignment problem as a generalized
bipartite matching problem with two partition matroid constraints. DivE$^2$
solves a sequence of continuous-combinatorial optimizations with slowly
varying objectives and constraints. The combinatorial part handles the data
assignment with submodular maximization, and the continuous part updates model
based on the assigned data. In experiments, DivE$^2$ outperforms other
ensemble training methods under several inference techniques while maintaining
competitive efficiency.
Binary Classification from Positive-Confidence Data

Reducing labeling costs in supervised learning is a critical issue in many
practical machine learning applications. In this paper, we consider positive-
confidence (Pconf) classification, the problem of training a binary classifier
only from positive data equipped with confidence. Pconf classification can be
regarded as a discriminative extension of one-class classification (which is
aimed at describing'' the positive class by clustering-related methods), with
ability to tune hyper-parameters forclassifying'' positive and negative
samples. Pconf classification is also related to positive-unlabeled (PU)
classification (which uses hard-labeled positive data and unlabeled data), but
the difference is that it enables us to avoid estimating the class priors,
which is a critical bottleneck in typical PU classification methods. For the
Pconf classification problem, we provide a simple empirical risk minimization
framework and give a formulation for linear-in-parameter models that can be
implemented easily and computationally efficiently. We also theoretically
establish the consistency and an estimation error bound for Pconf
classification, and demonstrate the practical usefulness of the proposed
method for deep neural networks through experiments.
Deep Generative Models for Distribution-Preserving Lossy Compression

We propose and study the problem of distribution-preserving lossy compression.
Motivated by the recent advances in extreme image compression which allow to
maintain artifact-free reconstructions even at very low bitrates, we propose
to optimize the rate-distortion tradeoff under the constraint that the
reconstructed samples follow the distribution of the training data. Such a
compression system recovers both ends of the spectrum: On one hand, at zero
bitrate it learns a generative model of the data, and at high enough bitrates
it achieves perfect reconstruction. Furthermore, for intermediate bitrates it
smoothly interpolates between matching the distribution of the training data
and perfectly reconstructing the training samples. We study several methods to
approximately solve the proposed optimization problem, including a novel
combination of Wasserstein GAN and Wasserstein Autoencoder, and present strong
theoretical and empirical results for the proposed compression system.
Exact natural gradient in deep linear networks and its application to the nonlinear case

Stochastic gradient descent (SGD) remains the method of choice for deep
learning, despite the limitations arising for ill-behaved objective functions.
In cases where it could be estimated, the natural gradient has proven very
effective at mitigating the catastrophic effects of pathological curvature in
the objective function, but little is known theoretically about its
convergence properties, and it has yet to find a practical implementation that
would scale to very deep and large networks. Here, we derive an exact
expression for the natural gradient in deep linear networks, which exhibit
pathological curvature similar to the nonlinear case, and provide for the
first time an analytical solution for its convergence rate, showing that
natural gradient descent converges exponentially fast to the global minimum in
parameter space. Our expression for the natural gradient is surprisingly
simple, computationally tractable, and explains why some approximations
proposed previously work well in practice. This opens new avenues for
approximating the natural gradient in the nonlinear case, and we show that,
promisingly, our online natural gradient descent outperforms SGD for MNIST
autoencoders while sharing its computational simplicity.
Constructing Fast Network through Deconstruction of Convolution

Convolutional neural networks have achieved great success in various vision
tasks; however, they incur heavy resource costs. By using deeper and wider
networks, network accuracy can be improved rapidly. However, in an environment
with limited resources (e.g., mobile applications), heavy networks may not be
usable. This study shows that naive convolution can be deconstructed into a
shift operation and pointwise convolution. To cope with various convolutions,
we propose a new shift operation called active shift layer (ASL) that
formulates the amount of shift as a learnable function with shift parameters.
This new layer can be optimized end-to-end through backpropagation and it can
provide optimal shift values. Finally, we apply this layer to a light and fast
network that surpasses existing state-of-the-art networks.
Memory Replay GANs: Learning to Generate New Categories without Forgetting

Previous works on sequential learning address the problem of forgetting in
discriminative models. In this paper we consider the case of generative
models. In particular, we investigate generative adversarial networks (GANs)
in the task of learning new categories in a sequential fashion. We first show
that sequential fine tuning renders the network unable to properly generate
images from previous categories (i.e. forgetting). Addressing this problem, we
propose Memory Replay GANs (MeRGANs), a conditional GAN framework that
integrates a memory replay generator. We study two methods to prevent
forgetting by leveraging these replays, namely joint training with replay and
replay alignment. Qualitative and quantitative experimental results in MNIST,
SVHN and LSUN datasets show that our memory replay approach can generate
competitive images while significantly mitigating the forgetting of previous
categories.
Automating Bayesian optimization with Bayesian optimization

Bayesian optimization is a powerful tool for global optimization of expensive
functions. One of its key components is the underlying probabilistic model
used for the objective function f. In practice, however, it is often unclear
how one should appropriately choose a model, especially when gathering data is
expensive. In this work, we introduce a novel automated Bayesian optimization
approach that dynamically selects promising models for explaining the observed
data using Bayesian Optimization in the model space. Crucially, we account for
the uncertainty in the model choice; our method is capable of using multiple
models to represent its current belief about f and subsequently using this
information for decision making. We argue, and demonstrate empirically, that
our approach automatically finds suitable models for the objective function,
which ultimately results in more-efficient optimization.
Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling

Learning the minimum/maximum mean among a finite set of distributions is a
fundamental sub-problem in planning, game tree search and reinforcement
learning. We formalize this learning task as the problem of sequentially
testing how the minimum mean among a finite set of distributions compares to a
given threshold. We develop refined non-asymptotic lower bounds, which show
that optimality mandates very different sampling behavior for a low vs high
true minimum. We show that Thompson Sampling and the intuitive Lower
Confidence Bounds policy each nail only one of these cases. We develop a novel
approach that we call Murphy Sampling. Even though it entertains exclusively
low true minima, we prove that MS is optimal for both possibilities. We then
design advanced self-normalized deviation inequalities, fueling more
aggressive stopping rules. We complement our theoretical guarantees by
experiments showing that MS works best in practice.
Distributed Learning without Distress: Privacy-Preserving Empirical Risk Minimization

We consider the scenario where several independent data owners wish to
collaboratively learn a model over their sensitive data sets, without exposing
their private data. Our approach combines differential privacy with secure
multi-party computation to achieve private distributed machine learning. In
particular, we explore two popular methods of differential privacy, output
perturbation and gradient perturbation, and advance the state-of-the-art for
both methods in the distributed learning setting. In our output perturbation
method, the parties combine their local models within a secure computation and
then add the required differential privacy noise within the secure computation
before revealing the model. In our gradient perturbation method, the data
owners collaboratively train a global model via an iterative learning
algorithm. At each iteration, the parties aggregate their local gradients
within a secure computation, adding sufficient noise to ensure privacy before
the gradient updates are revealed. In both cases, we show that the noise can
be reduced in the multi-party setting by adding the noise inside the secure
computation after aggregation, asymptotically improving upon the best previous
results. Thorough experiments on real world data sets corroborate our theory.
A no-regret generalization of hierarchical softmax to extreme multi-label classification

Extreme multi-label classification (XMLC) is the problem of tagging an
instance with a small subset of relevant labels chosen from an extremely large
pool of possible labels. Large label spaces can be efficiently handled by
organizing labels as a tree, like in the hierarchical softmax (HSM) approach.
In this paper, we investigate Probabilistic Label Trees (PLTs) that have been
recently devised for tackling XMLC problems. We show that PLT is a no-regret
multi-label generalization of HSM when precision@k is used as a model
evaluation metric. Critically, we prove that pick-one-label heuristic---a
reduction technique from multi-label to multi-class that is routinely used
along with HSM---is not consistent in general. We also show that our
implementation of PLTs, referred to as XT, obtains significantly better
results than HSM with the pick-one-label heuristic and XML-CNN, a deep network
specifically designed for XMLC problems. Moreover, XT is competitive to many
state-of-the-art approaches in terms of statistical performance, model size
and prediction time which makes it amenable to deploy in an online system.
Efficient Formal Safety Analysis of Neural Networks

Neural networks are increasingly deployed in real-world safety-critical
domains such as autonomous driving, aircraft collision avoidance, and malware
detection. However, these networks have been shown to often mispredict on
inputs with minor adversarial or even accidental perturbations. Consequences
of such errors can be disastrous and even potentially fatal as shown by the
recent Tesla autopilot crash. Thus, there is an urgent need for formal
analysis systems that can rigorously check neural networks for violations of
different safety properties such as robustness against adversarial
perturbations within a certain L-norm of a given image. An effective safety
analysis system for a neural network must be able to either ensure that a
safety property is satisfied by the network or find a counterexample, i.e., an
input for which the network will violate the property. Unfortunately, most
existing techniques for performing such analysis struggle to scale beyond very
small networks and the ones that can scale to larger networks suffer from high
false positives and cannot produce concrete counterexamples in case of a
property violation. In this paper, we present a new efficient approach for
rigorously checking different safety properties of neural networks that
significantly outperforms existing approaches by multiple orders of magnitude.
Our approach can check different safety properties and find concrete
counterexamples for networks that are 10x larger than the ones supported by
existing analysis techniques. We believe that our approach to estimating tight
output bounds of a network for a given input range can also help improve the
explainability of neural networks and guide the training process of more
robust neural networks.
Bayesian Distributed Stochastic Gradient Descent

We introduce Bayesian distributed stochastic gradient descent (BDSGD), a high-
throughput algorithm for training deep neural networks on parallel clusters.
This algorithm uses amortized inference in a deep generative model to perform
joint posterior predictive inference of mini-batch gradient computation times
in a compute cluster specific manner. Specifically, our algorithm mitigates
the straggler effect in synchronous, gradient-based optimization by choosing
an optimal cutoff beyond which mini-batch gradient messages from slow workers
are ignored. In our experiments, we show that eagerly discarding the mini-
batch gradient computations of stragglers not only increases throughput but
actually increases the overall rate of convergence as a function of wall-clock
time by virtue of eliminating idleness. The principal novel contribution and
finding of this work goes beyond this by demonstrating that using the
predicted run-times from a generative model of cluster worker performance
improves substantially over the static-cutoff prior art, leading to reduced
deep neural net training times on large computer clusters.
Visualizing the Loss Landscape of Neural Nets

Neural network training relies on our ability to find "good" minimizers of
highly non-convex loss functions. It is well known that certain network
architecture designs (e.g., skip connections) produce loss functions that
train easier, and well-chosen training parameters (batch size, learning rate,
optimizer) produce minimizers that generalize better. However, the reasons for
these differences, and their effect on the underlying loss landscape, is not
well understood. In this paper, we explore the structure of neural loss
functions, and the effect of loss landscapes on generalization, using a range
of visualization methods. First, we introduce a simple "filter normalization"
method that helps us visualize loss function curvature, and make meaningful
side-by-side comparisons between loss functions. Then, using a variety of
visualizations, we explore how network architecture affects the loss
landscape, and how training parameters affect the shape of minimizers.
The Limits of Post-Selection Generalization

While statistics and machine learning offers numerous methods for ensuring
generalization, these methods often fail in the presence of post selection---
the common practice in which the choice of analysis depends on previous
interactions with the same dataset. A recent line of work has introduced
powerful, general purpose algorithms that ensure a property called post hoc
generalization (Cummings et al., COLT'16), which says that no person when
given the output of the algorithm should be able to find any statistic for
which the data differs significantly from the population it came from. In this
work we show several limitations on the power of algorithms satisfying post
hoc generalization. First, we show a tight lower bound on the error of any
algorithm that satisfies post hoc generalization and answers adaptively chosen
statistical queries, showing a strong barrier to progress in post selection
data analysis. Second, we show that post hoc generalization is not closed
under composition, despite many examples of such algorithms exhibiting strong
composition properties.
Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation

Generating novel graph structures that optimize given objectives while obeying
some given underlying rules is fundamental for chemistry, biology and social
science research. This is especially important in the task of molecular graph
generation, whose goal is to discover novel molecules with desired properties
such as drug-likeness and synthetic accessibility, while obeying physical laws
such as chemical valency. However, designing models that finds molecules that
optimize desired properties while incorporating highly complex and non-
differentiable rules remains to be a challenging task. Here we propose Graph
Convolutional Policy Network (GCPN), a general graph convolutional network
based model for goal-directed graph generation through reinforcement learning.
The model is trained to optimize domain-specific rewards and adversarial loss
through policy gradient, and acts in an environment that incorporates domain-
specific rules. Experimental results show that GCPN can achieve 61%
improvement on chemical property optimization over state-of-the-art baselines
while resembling known molecules, and achieve 184% improvement on the
constrained property optimization task.
On Controllable Sparse Alternatives to Softmax

Converting an n-dimensional vector to a probability distribution over n
objects is a commonly used component in many machine learning tasks like
multiclass classification, multilabel classification, attention mechanisms
etc. For this, several probability mapping functions have been proposed and
employed in literature such as softmax, sum-normalization, spherical softmax,
and sparsemax, but there is very little understanding in terms how they relate
with each other. Further, none of the above formulations offer an explicit
control over the degree of sparsity. To address this, we develop a unified
framework that encompasses all these formulations as special cases. This
framework ensures simple closed-form solutions and existence of sub-gradients
suitable for learning via backpropagation. Within this framework, we propose
two novel sparse formulations, sparseflex and sparsehourglass, that seek to
provide a control over the degree of desired sparsity. We further develop
novel convex loss functions that help induce the behavior of aforementioned
formulations in the multilabel classification setting, showing improved
performance. We also demonstrate empirically that the proposed formulations,
when used to compute attention weights, achieve better or comparable
performance on standard seq2seq tasks like neural machine translation and
abstractive summarization.
L4: Practical loss-based stepsize adaptation for deep learning

We propose a stepsize adaptation scheme for stochastic gradient descent. It
operates directly with the loss function and rescales the gradient in order to
make fixed predicted progress on the loss. We demonstrate its capabilities by
conclusively improving the performance of Adam and Momentum optimizers. The
enhanced optimizers with default hyperparameters consistently outperform their
constant stepsize counterparts, even the best ones, without a measurable
increase in computational cost. The performance is validated on multiple
architectures including dense nets, CNNs, ResNets, and the recurrent
Differential Neural Computer on classical datasets MNIST, fashion MNIST,
CIFAR10 and others.
Learning Latent Subspaces in Variational Autoencoders

We present a method for training variational autoencoders on labelled datasets
which encode information corresponding to the labels in explicitly
predetermined subspaces of the latent space. We motivate our model from both
an information theoretic perspective as well as a adversarial game
perspective. By separating labelled information into a less complicated space
we allow the model to more easily disentangle representations. This provides a
form of semi-supervised learning of attributes. Since these subspaces can be
chosen a priori, setting them to be low-dimensional provides a form of
dimensionality reduction. We demonstrate the utility of our model on attribute
manipulation tasks with several image datasets.
Turbo Learning for Captionbot and Drawingbot

We study in this paper the problems of both image captioning and text-to-image
generation, and present a novel turbo learning approach to jointly training an
image-to-text generator (a.k.a. captionbot) and a text-to-image generator
(a.k.a. drawingbot). The key idea behind the joint training is that image-to-
text generation and text-to-image generation as dual problems can form a
closed loop to provide informative feedback to each other. Based on such
feedback, we introduce a new loss metric by comparing the original input with
the output produced by the closed loop. In addition to the old loss metrics
used in captionbot and drawingbot, this extra loss metric makes the jointly
trained captionbot and drawingbot better than the separately trained
captionbot and drawingbot. Furthermore, the turbo-learning approach enables
semi-supervised learning since the closed loop can provide peudo-labels for
unlabeled samples. Experimental results on the COCO dataset demonstrate that
the proposed turbo learning can significantly improve the performance of both
captionbot and drawingbot by a large margin.
Learning to Teach with Dynamic Loss Functions

Teaching is critical to human society: it is with teaching that prospective
students are educated and human civilization can be inherited and advanced. A
good teacher not only provides his/her students with qualified teaching
materials (e.g., textbooks), but also sets up appropriate learning objectives
(e.g., course projects and exams) considering different situations of a
student. When it comes to artificial intelligence, treating machine learning
models as students, the loss functions that are optimized act as perfect
counterparts of the learning objective set by the teacher. In this work, we
explore the possibility of imitating human teaching behaviors by dynamically
and automatically outputting appropriate loss functions to train machine
learning models. Different from typical learning settings in which the loss
function of a machine learning model is predefined and fixed, in our
framework, the loss function of a machine learning model (we call it student)
is defined by another machine learning model (we call it teacher). The
ultimate goal of teacher model is cultivating the student to have better
performance measured on development dataset. Towards that end, similar to
human teaching, the teacher, a parametric model, dynamically outputs different
loss functions that will be used and optimized by its student model at
different training stages. We develop an efficient learning method for the
teacher model that makes gradient based optimization possible, exempt of the
ineffective solutions such as policy optimization. We name our method as
``learning to teach with dynamic loss functions'' (L2T-DLF for short).
Extensive experiments on real world tasks including image classification and
neural machine translation demonstrate that our method significantly improves
the quality of various student models.
Multi-View Silhouette and Depth Decomposition for High Resolution 3D Object Representation

We consider the problem of scaling deep generative shape models to high-
resolution. Drawing motivation from the canonical view representation of
objects, we introduce a novel method for the fast up-sampling of 3D objects in
voxel space through networks that perform super-resolution on the six
orthographic depth projections. This allows us to efficiently generate high-
resolution objects, without the cubic computational costs associated with
voxel data. We further decompose the learning problem into silhouette and
depth prediction to capture both structure and fine detail, easing the burden
of an individual network generating sharp edges. We evaluate our work on
multiple experiments concerning high-resolution 3D objects, and show our
system is capable of accurately producing objects at resolutions as large as
512x512x512 -- the highest resolution reported for this task, to our
knowledge. We achieve state-of-the-art performance on 3D object reconstruction
from RGB images on the ShapeNet dataset, and further demonstrate the first
effective 3D super-resolution method.
Size-Noise Tradeoffs in Generative Networks

Generative networks can modify their noise distributions to look like other
noise distributions. Firstly, we demonstrate a construction that allows ReLU
networks to increase the dimensionality of their noise distribution by
implementing a ``space-filling'' function based on iterated tent maps. We show
this construction is optimal by analyzing the number of affine pieces in
functions computed by multivariate ReLU networks. We also develop a toolkit of
techniques for function approximation with neural networks, including a Taylor
series approximation and a binary search gadget for computing function
inverses. This toolkit provides efficient ways (using $\polylog(1/\epsilon)$
nodes) for networks to pass between univariate normal and uniform
distributions.
Online Adaptive Methods, Universality and Acceleration

We present a novel method for convex unconstrained optimization that, without
any modifications ensures: (1) accelerated convergence rate for smooth
objectives, (2) standard convergence rate in the general (non-smooth) setting,
and (3) standard convergence rate in the stochastic optimization setting. To
the best of our knowledge, this is the first method that simultaneously
applies to all of the above settings. At the heart of our method is an
adaptive learning rate rule that employs importance weights, in the spirit of
adaptive online learning algorithms [duchi2011adaptive,levy2017online],
combined with an update that linearly couples two sequences, in the spirit of
[AllenOrecchia2017]. An empirical examination of our method demonstrates its
applicability to the above mentioned scenarios and corroborates our
theoretical findings.
Compact Generalized Non-local Network

The non-local module is designed for capturing long-range spatio-temporal
dependencies in images and videos. Although having shown excellent
performance, it lacks the mechanism to model the interactions between
positions across channels, which are of vital importance in recognizing fine-
grained objects and actions. To address this limitation, we generalize the
non-local module and take the correlations between the positions of any two
channels into account. This extension unifies second-order feature pooling and
achieves state-of-the-arts performance on a variety of fine-grained
classification tasks. However, it also leads to an explosion in the
computational complexity. To alleviate such an issue, we further propose its
compact representation to reduce the high-dimensional feature space and large
computation burden involved. Moreover, we try to group the channels and do our
generalized non-local method within each group. Experimental results
illustrate the significant improvements and practical applicability of the
generalized non-local module on both fine-grained object and video
classification. Code will be made publicly available to ease the future
research.
On the Local Hessian in Back-propagation

Back-propagation (BP) is the foundation for successfully training deep neural
networks. However, BP sometimes has difficulties in propagating a learning
signal deep enough effectively, e.g., the vanishing gradient phenomenon.
Meanwhile, BP often works well when combining with ``designing tricks'' like
orthogonal initialization, batch normalization and skip connection. There is
no clear understanding on what is essential to the efficiency of BP. In this
paper, we take one step towards clarifying this problem. We view BP as a
solution of back-matching propagation which minimizes a sequence of back-
matching losses each corresponding to one block of the network. We study the
Hessian of the local back-matching loss (local Hessian) and connect it to the
efficiency of BP. It turns out that those designing tricks facilitate BP by
improving the spectrum of local Hessian. In addition, we can utilize the local
Hessian to balance the training pace of each block and design new training
algorithms. Based on a scalar approximation of local Hessian, we propose a
scale-amended SGD algorithm. We apply it to train neural networks with batch
normalization, and achieve favorable results over vanilla SGD. This
corroborates the importance of local Hessian from another side.
The Everlasting Database: Statistical Validity at a Fair Price

The problem of handling adaptivity in data analysis, intentional or not,
permeates a variety of fields, including test-set overfitting in ML challenges
and the accumulation of invalid scientific discoveries. We propose a mechanism
for answering an arbitrarily long sequence of potentially adaptive statistical
queries, by charging a price for each query and using the proceeds to collect
additional samples. Crucially, we guarantee statistical validity without any
assumptions on how the queries are generated. We also ensure with high
probability that the cost for $M$ non-adaptive queries is $O(\log M)$, while
the cost to a potentially adaptive user who makes $M$ queries that do not
depend on any others is $O(\sqrt{M})$.
Lipschitz-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks

High sensitivity of neural networks against malicious perturbations on inputs
causes security concerns. To take a steady step towards robust classifiers, we
aim to create neural network models provably defended from perturbations.
Prior certification work requires strong assumptions on network structures and
massive computational costs, and thus, their applications are limited. Based
on the relationship between the Lipschitz constants and prediction margins, we
present a computationally efficient calculation technique that lower-bounds
the size of adversarial perturbations that can deceive networks, and that is
widely applicable to various complicated networks. Moreover, we propose an
efficient training procedure, which robustifies networks and significantly
improves the provably guarded areas around data points. In experimental
evaluations, our method showed its ability to provide a non-trivial guarantee
and improve robustness for even large networks.
Proximal SCOPE for Distributed Sparse Learning

Distributed sparse learning with a cluster of multiple machines has attracted
much attention in machine learning, especially for large-scale applications
with high-dimensional data. One popular way to implement sparse learning is to
use L1 regularization. In this paper, we propose a novel method, called
proximal SCOPE(pSCOPE), for distributed sparse learning with L1
regularization. pSCOPE is based on a cooperative autonomous local learning
(CALL) framework. In the CALL framework of pSCOPE, we find that the data
partition affects the convergence of the learning procedure, and subsequently
we define a metric to measure the goodness of a data partition. Based on the
defined metric, we theoretically prove that pSCOPE is convergent with a linear
convergence rate if the data partition is good enough. We also prove that
better data partition implies faster convergence rate. Furthermore, pSCOPE is
also communication efficient. Experimental results on real data sets show that
pSCOPE can outperform other state-of-the-art distributed methods for sparse
learning.
On Coresets for Logistic Regression

Coresets are one of the central methods to facilitate the analysis of large
data. We continue a recent line of research applying the theory of coresets to
logistic regression. First, we show the negative result that no strongly
sublinear sized coresets exist for logistic regression. To deal with
intractable worst-case instances we introduce a complexity measure $\mu(X)$,
which quantifies the hardness of compressing a data set for logistic
regression. $\mu(X)$ has an intuitive statistical interpretation that may be
of independent interest. For data sets with bounded $\mu(X)$-complexity, we
show that a novel sensitivity sampling scheme produces the first provably
sublinear $(1\pm\eps)$-coreset. We illustrate the performance of our method by
comparing to uniform sampling as well as to state of the art methods in the
area. The experiments are conducted on real world benchmark data for logistic
regression.
Approximating Real-Time Recurrent Learning with Random Kronecker Factors

Despite all the impressive advances of recurrent neural networks, sequential
data is still in need of better modelling. Truncated backpropagation through
time (TBPTT), the learning algorithm most widely used in practice, suffers
from the truncation bias, which drastically limits its ability to learn long-
term dependencies.The Real Time Recurrent Learning algorithm (RTRL) addresses
this issue, but its high computational requirements make it infeasible in
practice. The Unbiased Online Recurrent Optimization algorithm (UORO)
approximates RTRL with a smaller runtime and memory cost, but with the
disadvantage of obtaining noisy gradients that also limit its practical
applicability. In this paper we propose the Kronecker Factored RTRL (KF-RTRL)
algorithm that uses a Kronecker product decomposition to approximate the
gradients for a large class of RNNs. We show that KF-RTRL is an unbiased and
memory efficient online learning algorithm. Our theoretical analysis shows
that, under reasonable assumptions, the noise introduced by our algorithm is
not only stable over time but also asymptotically much smaller than the one of
the UORO algorithm. We also confirm these theoretical results experimentally.
Further, we show empirically that the KF-RTRL algorithm captures long-term
dependencies and almost matches the performance of TBPTT on real world tasks
by training Recurrent Highway Networks on a synthetic string memorization task
and on the Penn TreeBank task, respectively. These results indicate that RTRL
based approaches might be a promising future alternative to TBPTT.
Contamination Attacks in Multi-Party Machine Learning

Machine learning is data hungry; the more data a model has access to in
training, the more likely it is to perform well at inference time. Distinct
parties may want to combine their local data to gain the benefits of a model
trained on a large corpus of data. We consider such a case, where the local
data remains private and only the model trained on the joint data is revealed.
We show that there exists attacks that are stealthy and can compromise the
integrity of the model. We then show how adversarial training can defend
against such attacks.
An Improved Analysis of Alternating Minimization for Structured Multi-Response Regression

Multi-response linear models aggregate a set of vanilla linear models by
assuming correlated noise across them, which has an unknown covariance
structure. To find the coefficient vector, estimators with a joint
approximation of the noise covariance are often preferred than the simple
linear regression in view of their superior empirical performance, which can
be generally solved by alternating-minimization type procedures. Due to the
non-convex nature of such joint estimators, the theoretical justification of
their efficiency is typically challenging. The existing analyses fail to fully
explain the empirical observations due to the assumption of resampling on the
alternating procedures, which requires access to fresh samples in each
iteration. In this work, we present a resampling-free analysis for the
alternating minimization algorithm applied to the multi-response regression.
In particular, we focus on the high-dimensional setting of multi-response
linear models with structured coefficient parameter, and the statistical error
of the parameter can be expressed by the complexity measure, Gaussian width,
which is related to the assumed structure. More importantly, to the best of
our knowledge, our result reveals for the first time that the alternating
minimization with random initialization can achieve the same performance as
the well-initialized one when solving this multi-response regression problem.
Experimental results support our theoretical developments.
Incorporating Context into Language Encoding Models for fMRI

Language encoding models help explain language processing in the human brain
by learning functions that predict brain responses from the language stimuli
that elicited them. Current word embedding-based approaches treat each
stimulus word independently and thus ignore the influence of context on
language understanding. In this work we instead build encoding models using
rich contextual representations derived from an LSTM language model. Our
models show a significant improvement in encoding performance relative to
state-of-the-art embeddings in nearly every brain area. By varying the amount
of context used in the models and providing the models with distorted context,
we show that this improvement is due to a combination of better word
embeddings learned by the LSTM language model and contextual information. We
are also able to use our models to map context sensitivity across the cortex.
These results suggest that LSTM language models learn high-level
representations that are related to representations in the human brain.
CatBoost: unbiased boosting with categorical features

This paper presents the key algorithmic techniques behind CatBoost, a new
gradient boosting toolkit. Their combination leads to CatBoost outperforming
other publicly available boosting implementations in terms of quality on a
variety of datasets. Two critical algorithmic advances introduced in CatBoost
are the implementation of ordered boosting, a permutation-driven alternative
to the classic algorithm, and an innovative algorithm for processing
categorical features. Both techniques were created to fight a prediction shift
caused by a special kind of target leakage present in all currently existing
implementations of gradient boosting algorithms. In this paper, we provide a
detailed analysis of this problem and demonstrate that proposed algorithms
solve it effectively, leading to excellent empirical results.
Query K-means Clustering and the Double Dixie Cup Problem

We consider the problem of approximate $K$-means clustering with outliers and
side information provided by same-cluster queries and possibly noisy answers.
Our solution shows that, under some mild assumptions on the smallest cluster
size, one can obtain an $(1+\epsilon)$-approximation for the optimal potential
with probability at least $1-\delta$, where $\epsilon&gt;0$ and $\delta\in(0,1)$,
using an expected number of $O(\frac{K^3}{\epsilon \delta})$ noiseless same-
cluster queries and comparison-based clustering of complexity $O(ndK +
\frac{K^3}{\epsilon \delta})$; here, $n$ denotes the number of points and $d$
the dimension of space. Compared to a handful of other known approaches that
perform importance sampling to account for small cluster sizes, the proposed
query technique reduces the number of queries by a factor of roughly
$O(\frac{K^6}{\epsilon^3})$, at the cost of possibly missing very small
clusters. We extend this settings to the case where some queries to the oracle
produce erroneous information, and where certain points, termed outliers, do
not belong to any clusters. Our proof techniques differ from previous methods
used for $K$-means clustering analysis, as they rely on estimating the sizes
of the clusters and the number of points needed for accurate centroid
estimation and subsequent nontrivial generalizations of the double Dixie cup
problem. We illustrate the performance of the proposed algorithm both on
synthetic and real datasets, including MNIST and CIFAR $10$.
Training Neural Networks Using Features Replay

Training a neural network using backpropagation algorithm requires passing
error gradients sequentially through the network. The backward locking
prevents us from updating network layers in parallel and fully leveraging the
computing resources. Recently, there are several works trying to decouple and
parallelize the backpropagation algorithm. However, all of them suffer from
severe accuracy loss or memory explosion when the neural network is deep. To
address these challenging issues, we propose a novel parallel-objective
formulation for the objective function of the neural network. After that, we
introduce features replay algorithm and prove that it is guaranteed to
converge to critical points for the non-convex problem under certain
conditions. Finally, we apply our method to training deep convolutional neural
networks, and the experimental results show that the proposed method achieves
{faster} convergence, {lower} memory consumption, and {better} generalization
error than compared methods.
Modeling Dynamic Missingness of Implicit Feedback for Recommendation

Implicit feedback is widely used in collaborative filtering methods for
recommendation. It is well known that implicit feedback contains a large
number of values that are \emph{missing not at random} (MNAR); and the missing
data is a mixture of negative and unknown feedback, making it difficult to
learn user's negative preferences. Recent studies modeled \emph{exposure}, a
latent missingness variable which indicates whether an item is missing to a
user, to give each missing entry a confidence of being negative feedback.
However, these studies use static models and ignore the information in
temporal dependencies among items, which seems to be a essential underlying
factor to subsequent missingness. To model and exploit the dynamics of
missingness, we propose a latent variable named ``\emph{user intent}'' to
govern the temporal changes of item missingness, and a hidden Markov model to
represent such a process. The resulting framework captures the dynamic item
missingness and incorporate it into matrix factorization (MF) for
recommendation. We also explore two types of constraints to achieve a more
compact and interpretable representation of \emph{user intents}. Experiments
on real-world datasets demonstrate the superiority of our method against
state-of-the-art recommender systems.
Representation Learning of Compositional Data

We consider the problem of learning a low dimensional representation for
compositional data. Compositional data consists of a collection of nonnegative
data that sum to a constant value. Since the parts of the collection are
statistically dependent, many standard tools cannot be directly applied.
Instead compositional data must be first transformed before analysis. Focusing
on principal component analysis (PCA), we propose an approach that allows low
dimensional representation learning directly from the original data. We show
that our proposed loss function upper bounds the exponential family PCA loss
applied to transformed compositional data. A key tool in showing this
relationship is a generalization of the scaled Bregman Theorem that equates
the perspective transform of the generator of a Bregman divergence to the
Bregman divergence of the perspective transform and a conformal divergence.
Our proposed surrogate loss has an easy to optimize form, and we also derive
the corresponding form for nonlinear autoencoders. Experiments on simulated
data and microbiome data show the promise of our method.
Model-based targeted dimensionality reduction for neuronal population data

Summarizing high-dimensional data using a small number of parameters is a
ubiquitous first step in the analysis of neuronal population activity.
Recently developed methods use "targeted" approaches that work by identifying
multiple, distinct low-dimensional subspaces of activity that capture the
population response to individual experimental task variables, such as the
value of a presented stimulus or the behavior of the animal. These methods
have gained attention because they decompose total neural activity into what
are ostensibly different parts of a neuronal computation. However, existing
targeted methods have been developed outside of the confines of probabilistic
modeling, making some aspects of the procedures ad hoc, or limited in
flexibility or interpretability. Here we propose a new model-based method for
targeted dimensionality reduction based on a probabilistic generative model of
the population response data. The low-dimensional structure of our model is
expressed as a low-rank factorization of a linear regression model. We perform
efficient inference using a combination of expectation maximization and direct
maximization of the marginal likelihood. We also develop an efficient method
for estimating the dimensionality of each subspace. We show that our approach
outperforms alternative methods in both mean squared error of the parameter
estimates, and in identifying the correct dimensionality of encoding using
simulated data. We also show that our method provides more accurate inference
of low-dimensional subspaces of activity than a competing algorithm, demixed
PCA.
On gradient regularizers for MMD GANs

We propose a principled method for gradient-based regularization of the critic
of GAN-like models trained by adversarially optimizing the kernel of a Maximum
Mean Discrepancy (MMD). Our method is based on studying the behavior of the
optimized MMD, and constrains the gradient based on analytical results rather
than an optimization penalty. Experimental results show that the proposed
regularization leads to stable training and outperforms state-of-the art
methods on image generation, including on 160 × 160 CelebA and 64 × 64
ImageNet.
Heterogeneous Multi-output Gaussian Process Prediction

We present a novel extension of multi-output Gaussian processes for handling
heterogeneous outputs. We assume that each output has its own likelihood
function and use a vector-valued Gaussian process prior to jointly model the
parameters in all likelihoods as latent functions. Our multi-output Gaussian
process uses a covariance function with a linear model of coregionalisation
form. Assuming conditional independence across the underlying latent functions
together with an inducing variable framework, we are able to obtain tractable
variational bounds amenable to stochastic variational inference. We illustrate
the performance of the model on synthetic data and two real datasets: a human
behavioral study and a demographic high-dimensional dataset.
Large-Scale Stochastic Sampling from the Probability Simplex

Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular
method for scalable Bayesian inference. These methods are based on sampling a
discrete-time approximation to a continuous time process, such as the Langevin
diffusion. When applied to distributions defined on a constrained space, such
as the simplex, the time-discretisation error can dominate when we are near
the boundary of the space. We demonstrate that while current SGMCMC methods
for the simplex perform well in certain cases, they struggle with sparse
simplex spaces; when many of the components are close to zero. However, most
popular large-scale applications of Bayesian inference on simplex spaces, such
as network or topic models, are sparse. We argue that this poor performance is
due to the biases of SGMCMC caused by the discretization error. To get around
this, we propose the stochastic CIR process, which removes all discretization
error and we prove that samples from the stochastic CIR process are
asymptotically unbiased. Use of the stochastic CIR process within a SGMCMC
algorithm is shown to give substantially better performance for a topic model
and a Dirichlet process mixture model than existing SGMCMC approaches.
Policy Regret in Repeated Games

The notion of policy regret'' in online learning is supposed to capture the
reactions of the adversary to the actions taken by the learner, which more
traditional notions such as external regret do not take into account. We
revisit this notion of policy regret, and first show that there are online
learning settings in which policy regret and external regret are incompatible:
any sequence of play which does well with respect to one must do poorly with
respect to the other. We then focus on the game theoretic setting, when the
adversary is a self-interested agent. In this setting we show that the
external regret and policy regret are not in conflict, and in fact that a wide
class of algorithms can ensure both as long as the adversary is also using
such an algorithm. We also define a new notion of equilibrium which we call
apolicy equilibrium'', and show that no-policy regret algorithms will have
play which converges to such an equilibrium. Relating this back to external
regret, we show that coarse correlated equilibria (which no-external regret
players will converge to) are a strict subset of policy equilibria. So in
game-theoretic settings every sequence of play with no external regret also
has no policy regret, but the converse is not true.
A Theory-Based Evaluation of Nearest Neighbor Models Put Into Practice

In the $k$-nearest neighborhood model ($k$-NN), we are given a set of points
$P$, and we shall answer queries $q$ by returning the $k$ nearest neighbors of
$q$ in $P$ according to some metric. This concept is crucial in many areas of
data analysis and data processing, e.g., computer vision, document retrieval
and machine learning. Many $k$-NN algorithms have been published and
implemented, but often the relation between parameters and accuracy of the
computed $k$-NN is not explicit. We study property testing of $k$-NN graphs in
theory and evaluate it empirically: given a point set $P \subset
\mathbb{R}^\delta$ and a directed graph $G=(P,E)$, is $G$ a $k$-NN graph,
i.e., every point $p \in P$ has outgoing edges to its $k$ nearest neighbors,
or is it $\epsilon$-far from being a $k$-NN graph? Here, $\epsilon$-far means
that one has to change more than an $\epsilon$-fraction of the edges in order
to make $G$ a $k$-NN graph. We develop a randomized algorithm with one-sided
error that decides this question, i.e., a property tester for the $k$-NN
property, with complexity $O(\sqrt{n} k^2 / \epsilon^2)$ measured in terms of
the number of vertices and edges it inspects, and we prove a lower bound of
$\Omega(\sqrt{n})$. We evaluate our tester empirically on the $k$-NN models
computed by various algorithms and show that it can be used to detect $k$-NN
models with bad accuracy in significantly less time than the building time of
the $k$-NN model.
Banach Wasserstein GAN

Wasserstein Generative Adversarial Networks (WGANs) can be used to generate
realistic samples from complicated image distributions. The Wasserstein metric
used in WGANs is based on a notion of distance between individual images,
which induces a notion of distance between probability distributions of
images. So far the community has considered l^2 as the underlying distance. We
generalize the theory of WGAN with gradient penalty to Banach spaces, allowing
practitioners to select the features to emphasize in the generator. We further
discuss the effect of some particular choices of underlying norms, focusing on
Sobolev norms. Finally, we demonstrate the impact of the choice of norm on
model performance and show state-of-the-art inception scores for non-
progressive growing GANs on CIFAR-10.
Provable Gaussian Embedding with One Observation

Success of machine learning methods heavily relies on having an appropriate
representation for data at hand. Traditionally, machine learning approaches
relied on user-defined heuristics to extract features encoding structural
information about data. However, recently there has been a surge in approaches
that learn how to encode the data automatically in a low dimensional space.
Exponential family embedding provides a probabilistic framework for learning
low-dimensional representation for various types of high-dimensional data.
Though successful in practice, theoretical underpinnings for exponential
family embeddings have not been established. In this paper, we study Gaussian
embedding model and develop the first theoretical results for exponential
family embedding models. First, we show that, under mild condition, the
embedding structure can be learned from one observation by leveraging the
parameter sharing between different contexts even though the data are
dependent with each other. Second, we study properties of two algorithms used
for learning the embedding structure and establish convergence results for
each of them. The first algorithm is based on a convex relaxation, while the
other solved the non-convex formulation of the problem directly. Experiments
demonstrate the effectiveness of our approach.
BRITS: Bidirectional Recurrent Imputation for Time Series

Time series are widely used as signals in many classification/regression
tasks. It is ubiquitous that time series contains many missing values. Given
multiple correlated time series data, how to fill in missing values and to
predict their class labels? Existing imputation methods often impose strong
assumptions of the underlying data generating process, such as linear dynamics
in the state space. In this paper, we propose BRITS, a novel method based on
recurrent neural networks for missing value imputation in time series data.
Our proposed method directly learns the missing values in a bidirectional
recurrent dynamical system, without any specific assumption. The imputed
values are treated as variables of RNN graph and can be effectively updated
during the backpropagation. BRITS has three advantages: (a) it can handle
multiple correlated missing values in time series; (b) it generalizes to time
series with nonlinear dynamics underlying; (c) it provides a data-driven
imputation procedure and applies to general settings with missing data. We
evaluate our model on three real-world datasets, including an air quality
dataset, a health-care data, and a localization data for human activity.
Experiments show that our model outperforms the state-of-the-art methods in
both imputation and classification/regression accuracies.
M-Walk: Learning to Walk in Graph  with Monte Carlo Tree Search

Learning to walk over a graph towards a target node for a given input query
and a source node is an important problem in applications such as knowledge
base completion (KBC). It can be formulated as a reinforcement learning (RL)
problem that has a known state transition model, but with sparse reward. To
overcome the challenge, we develop a graph walking agent called M-Walk, which
consists of a deep recurrent neural network (RNN) and a Monte Carlo Tree
Search (MCTS). The RNN encodes the state (i.e., history of the walking path)
and map it into the Q-value, the policy and the state value. In order to
effectively train the agent from sparse reward, we combine MCTS with the
neural policy to generate trajectories with more positive rewards. From these
trajectories, we improve the network in an off-policy manner using Q-learning,
which indirectly improves the RNN policy via parameter sharing. Our proposed
RL algorithm repeatedly applies this policy improvement step to learn the
entire model. At testing stage, the MCTS is also applied with the neural
policy to predict the target node. Experiment results on several graph-walking
benchmarks show that we are able to learn better policies compared to other
RL-based baseline methods, which are mainly based on policy gradient method.
It also outperforms traditional KBC baselines.
Extracting Relationships by Multi-Domain Matching

In many biological and medical contexts, we construct a large labeled corpus
by aggregating many sources to use in target prediction tasks. Unfortunately,
many of the sources may be irrelevant to our target task, so ignoring the
structure of the dataset is detrimental. This work proposes a novel approach,
the Multiple Domain Matching Network (MDMN), to exploit this structure. MDMN
embeds all data into a shared feature space while learning which domains share
strong statistical relationships. These relationships are often insightful in
their own right, and they allow domains to share strength without interference
from irrelevant data. This methodology builds on existing distribution-
matching approaches by assuming that source domains are varied and outcomes
multi-factorial. Therefore, each domain should only match a relevant subset.
Theoretical analysis shows that the proposed approach can have a tighter
generalization bound than existing multiple-domain adaptation approaches.
Empirically, we show that the proposed methodology handles higher numbers of
source domains (up to 21 empirically), and provides state-of-the-art
performance on image, text, and multi-channel time series classification,
including clinically relevant data of a novel treatment of Autism Spectrum
Disorder.
Generative Probabilistic Novelty Detection with Adversarial Autoencoders

Novelty detection is the problem of identifying whether a new data point is
considered to be an inlier or an outlier. We assume that training data is
available to describe only the inlier distribution. Recent approaches
primarily leverage deep encoder-decoder network architectures to compute a
reconstruction error that is used to either compute a novelty score or to
train a one-class classifier. While we too leverage a novel network of that
kind, we take a probabilistic approach and effectively compute how likely is
that a sample was generated by the inlier distribution. We achieve this with
two main contributions. First, we make the computation of the novelty
probability feasible because we linearize the parameterized manifold capturing
the underlying structure of the inlier distribution, and show how the
probability factorizes and can be computed with respect to local coordinates
of the manifold tangent space. Second, we improved the training of the
autoencoder network. An extensive set of results show that the approach
achieves state-of-the-art results on several benchmark datasets.
Diminishing Returns Shape Constraints for Interpretability and Regularization

We investigate machine learning models that can provide diminishing returns
and accelerating returns guarantees to capture prior knowledge or policies
about how outputs should depend on inputs. Shape constraints are well-explored
for one-dimensional models and generalized additive models (GAMs). We show
that one can build flexible, nonlinear, multi-dimensional models using lattice
functions with any combination of (ceterus paribus) concavity/convexity and
monotonicity constraints on any subsets of features. We show better accuracy
than shape-constrained GAMs, and more flexibility in shape constraint choice
and training stability than for shape-constrained neural networks, which we
also extend to handle the diminishing returns case. We demonstrate on real-
world examples that additional shape constraints aid interpretability and can
improve accuracy, especially when tuning-free regularization is useful.
Scalable Hyperparameter Transfer Learning

Bayesian optimization (BO) is a model-based approach for gradient-free black-
box function optimization, such as hyperparameter optimization. Typically, BO
relies on conventional Gaussian process (GP) regression, whose algorithmic
complexity is cubic in the number of evaluations. As a result, GP-based BO
cannot leverage large numbers of past function evaluations, for example, to
warm-start related BO runs. We propose a multi-task adaptive Bayesian linear
regression model for transfer learning in BO, whose complexity is linear in
the function evaluations: one Bayesian linear regression model is associated
to each black-box function optimization problem (or task), while transfer
learning is achieved by coupling the models through a shared deep neural net.
Experiments show that the neural net learns a representation suitable for
warm-starting the black-box optimization problems and that BO runs can be
accelerated when the target black-box function (e.g., validation loss) is
learned together with other related signals (e.g., training loss). The
proposed method was found to be at least one order of magnitude faster that
methods recently published in the literature.
Stochastic Nonparametric Event-Tensor Decomposition

Tensor decompositions are fundamental tools for multiway data analysis.
Existing approaches, however, ignore the valuable temporal information along
with data, or simply discretize them into time steps so that important
temporal patterns are easily missed. Moreover, most methods are limited to
multilinear decomposition forms, and hence are unable to capture intricate,
nonlinear relationships in data. To address these issues, we formulate event-
tensors, to preserve the complete temporal information for multiway data, and
propose a novel Bayesian nonparametric decomposition model. Our model can (1)
fully exploit the time stamps to capture the critical, causal/triggering
effects between the interaction events, (2) flexibly estimate the complex
relationships between the entities in tensor modes, and (3) uncover hidden
structures from their temporal interactions. For scalable inference, we
develop a doubly stochastic variational Expectation-Maximization algorithm to
conduct an online decomposition. Evaluations on both synthetic and real-world
datasets show that our model not only improves upon the predictive performance
of existing methods, but also discovers interesting clusters underlying the
data.
Scaling Gaussian Process Regression with Derivatives

Gaussian processes (GPs) with derivatives are useful in many applications,
including Bayesian optimization, implicit surface reconstruction, and terrain
reconstruction. Fitting a GP to function values and derivatives at $n$ points
in $d$ dimensions requires linear solves and log determinants with an ${n(d+1)
\times n(d+1)}$ positive definite matrix -- leading to prohibitive
$\mathcal{O}(n^3d^3)$ computations for standard direct methods. We propose
$\mathcal{O}(nd)$ iterative solvers using fast matrix-vector multiplications
(MVMs), together with pivoted Cholesky preconditioning that cuts the
iterations to convergence by several orders of magnitude, allowing for fast
kernel learning and prediction. Our approaches, together with dimensionality
reduction, allows us to scale Bayesian optimization with derivatives to high-
dimensional problems and large evaluation budgets.
Differentially Private Testing of Identity and Closeness of Discrete Distributions

We study the fundamental problems of identity testing (goodness of fit), and
closeness testing (two sample test) of distributions over $k$ elements, under
differential privacy. While the problems have a long history in statistics,
finite sample bounds for these problems have only been established recently.
In this work, we derive upper and lower bounds on the sample complexity of
both the problems under $(\varepsilon, \delta)$-differential privacy. We
provide optimal sample complexity algorithms for identity testing problem for
all parameter ranges, and the first results for closeness testing. Our
closeness testing bounds are optimal in the sparse regime where the number of
samples is at most $k$. Our upper bounds are obtained by privatizing non-
private estimators for these problems. The non-private estimators are chosen
to have small sensitivity. We propose a general framework to establish lower
bounds on the sample complexity of statistical tasks under differential
privacy. We show a bound on differentially private algorithms in terms of a
coupling between the two hypothesis classes we aim to test. By constructing
carefully chosen priors over the hypothesis classes, and using Le Cam's two
point theorem we provide a general mechanism for proving lower bounds. We
believe that the framework can be used to obtain strong lower bounds for other
statistical tasks under privacy.
Efficient Convex Completion of Coupled Tensors using Coupled Nuclear Norms

Coupled norms have emerged as a convex method to solve coupled tensor
completion. A limitation with coupled norms is that they only induce low
rankness using the multilinear rank of coupled tensors. In this paper, we
introduce a new set of coupled norms known as coupled nuclear norms by
constraining the CP rank of coupled tensors. We propose new coupled completion
models using the coupled nuclear norms as regularizers, that can be optimized
using computationally efficient optimization methods. We derive excess risk
bounds for proposed coupled completion models and show that proposed norms
lead to better performances. Through simulation and real-data experiments, we
demonstrate that proposed norms achieve better performances for coupled
completion compared to existing coupled norms.
Maximizing Induced Cardinality Under a Determinantal Point Process

Determinantal point processes (DPPs) are well-suited to recommender systems
where the goal is to generate collections of diverse, high-quality items. In
the existing literature this is usually formulated as finding the mode of the
DPP (the so-called MAP set). However, the MAP objective inherently assumes
that the DPP models "optimal" recommendation sets, and yet obtaining such a
DPP is nontrivial when there is no ready source of example optimal sets. In
this paper we advocate an alternative framework for applying DPPs to
recommender systems. Our approach assumes that the DPP simply models user
engagements with recommended items, which is more consistent with how DPPs for
recommender systems are typically trained. With this assumption, we are able
to formulate a metric that measures the expected number of items that a user
will engage with. We formalize this optimization of this metric as the Maximum
Induced Cardinality (MIC) problem. Although the MIC objective is not
submodular, we show that it can be approximated by a submodular function, and
that empirically it is well-optimized by a greedy algorithm.
Causal Inference with Noisy and Missing Covariates via Matrix Factorization

Valid causal inference in observational studies often requires controlling for
confounders. However, in practice measurements of confounders may be noisy,
and can lead to biased estimates of causal effects. We show that we can reduce
the bias caused by measurement noise using a large number of noisy
measurements of the underlying confounders. We propose the use of matrix
factorization to infer the confounders from noisy covariates, a flexible and
principled framework that adapts to missing values, accommodates a wide
variety of data types, and can augment a wide variety of causal inference
methods. We bound the error for the induced average treatment effect estimator
and show it is consistent in a linear regression setting, using Exponential
Family Matrix Completion preprocessing. We demonstrate the effectiveness of
the proposed procedure in numerical experiments with both synthetic data and
real clinical data.
rho-POMDPs have Lipschitz-Continuous epsilon-Optimal Value Functions

Many state-of-the-art algorithms for solving Partially Observable Markov
Decision Processes (POMDPs) rely on turning the problem into a “fully
observable” problem—a belief MDP—and exploiting the piece-wise linearity and
convexity (PWLC) of the optimal value function in this new state space (the
belief simplex ∆). This approach has been extended to solving ρ-POMDPs—i.e.,
for information-oriented criteria—when the reward ρ is convex in ∆. General
ρ-POMDPs can also be turned into “fully observable” problems, but with no
means to exploit the PWLC property. In this paper, we focus on POMDPs and
ρ-POMDPs with λ ρ -Lipschitz reward function, and demonstrate that, for finite
horizons, the optimal value function is Lipschitz-continuous. Then, value
function approximators are proposed for both upper- and lower-bounding the
optimal value function, which are shown to provide uniformly improvable
bounds. This allows proposing two algorithms derived from HSVI which are
empirically evaluated on various benchmark problems.
Online Structure Learning for Feed-Forward and Recurrent Sum-Product Networks

Sum-product networks have recently emerged as an attractive representation due
to their dual view as a special type of deep neural network with clear
semantics and a special type of probabilistic graphical model for which
inference is always tractable. Those properties follow from some conditions
(i.e., completeness and decomposability) that must be respected by the
structure of the network. As a result, it is not easy to specify a valid sum-
product network by hand and therefore structure learning techniques are
typically used in practice. This paper describes a new online structure
learning technique for feed-forward and recurrent SPNs. The algorithm is
demonstrated on real-world datasets with continuous features for which it is
not clear what network architecture might be best, including sequence datasets
of varying length.
Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss

Uncertainty sampling, a popular active learning algorithm, is used to reduce
the amount of data required to learn a classifier, but it is observed in
practice to converge to a variety of values depending on the initialization
and sometimes to even better values than the convergence value of random
sampling. In this work, we give a theoretical explanation of this phenomena,
showing that uncertainty sampling on a convex (e.g., logistic) loss can be
interpreted as performing a pre-conditioned stochastic gradient step on the
zero-one loss. Experiments on synthetic and real datasets support this
connection.
A Probabilistic U-Net for Segmentation of Ambiguous Images

Many real-world vision problems suffer from inherent ambiguities. In clinical
applications for example, it might not be clear from a CT scan alone which
particular region is cancer tissue. Therefore a group of graders typically
produces a set of diverse but plausible segmentations. We consider the task of
learning a distribution over segmentations given an input. To this end we
propose a generative segmentation model based on a combination of a U-Net with
a conditional variational autoencoder that is capable of efficiently producing
an unlimited number of plausible hypotheses. We show on a lung abnormalities
segmentation task and on a Cityscapes segmentation task that our model
reproduces the possible segmentation variants as well as the frequencies with
which they occur, doing so significantly better than published approaches.
These models could have a high impact in real-world applications, such as
being used as clinical decision-making algorithms accounting for multiple
plausible semantic segmentation hypotheses to provide possible diagnoses and
recommend further actions to resolve the present ambiguities.
Unorganized Malicious Attacks Detection

Recommender system has attracted much attention during the past decade. Many
attack detection algorithms have been developed for better recommendations,
mostly focusing on shilling attacks, where an attack organizer produces a
large number of user profiles by the same strategy to promote or demote an
item. This work considers a different attack style: unorganized malicious
attacks, where attackers individually utilize a small number of user profiles
to attack different items without any organizer. This attack style occurs in
many real applications, yet relevant study remains open. We first formulate
unorganized malicious attacks detection as a matrix completion problem, and
propose the Unorganized Malicious Attacks detection (UMA) approach, based on
the alternating splitting augmented Lagrangian method. We verify, both
theoretically and empirically, the effectiveness of our proposed approach.
Causal Inference via Kernel Deviance Measures

Discovering the causal structure among a set of variables is a fundamental
problem in many areas of science. In this paper, we propose Kernel Conditional
Deviance for Causal Inference (KCDC) a fully nonparametric causal discovery
method based on purely observational data. From a novel interpretation of the
notion of asymmetry between cause and effect, we derive a corresponding
asymmetry measure using the framework of reproducing kernel Hilbert spaces.
Based on this, we propose three decision rules for causal discovery. We
demonstrate the wide applicability of our method across a range of diverse
synthetic datasets. Furthermore, we test our method on real-world time series
data and the real-world benchmark dataset Tubingen Cause-Effect Pairs where we
outperform existing state-of-the-art methods.
Bayesian Alignments of Warped Multi-Output Gaussian Processes

We propose a novel Bayesian approach to modelling nonlinear alignments of time
series based on latent shared information. We apply the method to the real-
world problem of finding common structure in the sensor data of wind turbines
introduced by the underlying latent and turbulent wind field. The proposed
model allows for both arbitrary alignments of the inputs and non-parametric
output warpings to transform the observations. This gives rise to multiple
deep Gaussian process models connected via latent generating processes. We
present an efficient variational approximation based on nested variational
compression and show how the model can be used to extract shared information
between dependent time series, recovering an interpretable functional
decomposition of the learning problem. We show results for an artificial data
set and real-world data of two wind turbines.
Hybrid Macro/Micro Level Backpropagation for Training Deep Spiking Neural Networks

Spiking neural networks (SNNs) are positioned to enable spatio-temporal
information processing and ultra-low power event-driven neuromorphic hardware.
However, SNNs are yet to reach the same performances of conventional deep
artificial neural networks (ANNs), a long-standing challenge due to complex
dynamics and non-differentiable spike events encountered in training. The
existing SNN error backpropagation (BP) methods are limited in terms of
scalability, lack of proper handling of spiking discontinuities, and/or
mismatch between the rate-coded loss function and computed gradient. We
present a hybrid macro/micro level backpropagation algorithm (HM2-BP) for
training multi-layer SNNs. The temporal effects are precisely captured by the
proposed spike-train level post-synaptic potential (S-PSP) at the microscopic
level. The rate-coded errors are defined at the macroscopic level, computed
and back-propagated across both macroscopic and microscopic levels. Different
from existing BP methods, HM2-BP directly computes the gradient of the rate-
coded loss function w.r.t tunable parameters. We evaluate the proposed HM2-BP
algorithm by training deep fully connected and convolutional SNNs based on the
static MNIST [13] and dynamic neuromorphic N-MNIST [22] datasets. HM2-BP
achieves an accuracy level of 99.49% and 98.88% for MNIST and N-MNIST,
respectively, outperforming the best reported performances obtained from the
existing SNN BP algorithms. It also achieves competitive performances
surpassing those of conventional deep learning models when dealing with
asynchronous spiking streams.
Gen-Oja: Simple & Efficient Algorithm for Streaming Generalized Eigenvector Computation

In this paper, we study the problems of principle Generalized Eigenvector
computation and Canonical Correlation Analysis in the stochastic setting. We
propose a simple and efficient algorithm for these problems. We prove the
global convergence of our algorithm, borrowing ideas from the theory of fast-
mixing Markov chains and two-Time-Scale Stochastic Approximation, showing that
it achieves the optimal rate of convergence. In the process, we develop tools
for understanding stochastic processes with Markovian noise which might be of
independent interest.
Efficient online algorithms for fast-rate regret bounds under sparsity

We consider the online convex optimization problem. In the setting of
arbitrary sequences and finite set of parameters, we establish a new fast-rate
quantile regret bound. Then we investigate the optimization into the
$\ell_1$-ball by discretizing the parameter space. Our algorithm is projection
free and we propose an efficient solution by restarting the algorithm on
adaptive discretization grids. In the adversarial setting, we develop an
algorithm that achieves several rates of convergence with different
dependences on the sparsity of the objective. In the i.i.d. setting, we
establish new risk bounds that are adaptive to the sparsity of the problem and
to the regularity of the risk (ranging from a rate $1/\sqrt{T}$ for general
convex risk to $1/T$ for strongly convex risk). These results generalize
previous works on sparse online learning. They are obtained under a weak
assumption on the risk (Łojasiewicz's assumption) that allows multiple optima
which is crucial when dealing with degenerate situations.
Predictive Uncertainty Estimation via Prior Networks

Estimating how uncertain an AI system is in its predictions is important to
improve the safety of such systems. Uncertainty in predictive can result from
uncertainty in model parameters, irreducible data uncertainty and uncertainty
due to distributional mismatch between the test and training data
distributions. Different actions might be taken depending on the source of the
uncertainty so it is important to be able to distinguish between them.
Recently, baseline tasks and metrics have been defined and several practical
methods to estimate uncertainty developed. These methods, however, attempt to
model uncertainty due to distributional mismatch either implicitly through
model uncertainty or as data uncertainty. This work proposes a new framework
for modeling predictive uncertainty called Prior Networks (PNs) which
explicitly models distributional uncertainty. PNs do this by parameterizing a
prior distribution over predictive distributions. This work focuses on
uncertainty for classification and evaluates PNs on the tasks of identifying
out-of-distribution (OOD) samples and detecting misclassification on the MNIST
dataset, where they are found to outperform previous methods. Experiments on
synthetic and MNIST data show that unlike previous non-Bayesian methods PNs
are able to distinguish between data and distributional uncertainty.
Dual Policy Iteration

Recently, a novel class of Approximate Policy Iteration (API) algorithms have
demonstrated impressive practical performance (e.g., ExIt from [1], AlphaGo-
Zero from [2]). This new family of algorithms maintains, and alternately
optimizes, two policies: a fast, reactive policy (e.g., a deep neural network)
deployed at test time, and a slow, non-reactive policy (e.g., Tree Search),
that can plan multiple steps ahead. The reactive policy is updated under
supervision from the non-reactive policy, while the non-reactive policy is
improved with guidance from the reactive policy. In this work we study this
Dual Policy Iteration (DPI) strategy in an alternating optimization framework
and provide a convergence analysis that extends existing API theory. We also
develop a special instance of this framework which reduces the update of non-
reactive policies to model-based optimal control using learned local models,
and provides a theoretically sound way of unifying model-free and model-based
RL approaches with unknown dynamics. We demonstrate the efficacy of our
approach on various continuous control Markov Decision Processes.
A probabilistic population code based on neural samples

Sensory processing is often characterized as implementing probabilistic
inference: networks of neurons compute posterior beliefs over unobserved
causes given the sensory inputs. How these beliefs are computed and
represented by neural responses is much-debated (Fiser et al. 2010, Pouget et
al. 2013). A central debate concerns the questions of whether neural responses
represent samples of latent variables (Hoyer & Hyvarinnen 2003) or parameters
of their distributions (Ma et al. 2006) with efforts being made to distinguish
between them (Grabska-Barwinska et al. 2013). A separate debate addresses the
question of whether neural responses are proportionally related to the encoded
probabilities (Barlow 1969), or proportional to the logarithm of those
probabilities (Jazayeri & Movshon 2006, Beck et al. 2006, Beck et al. 2012).
Here, we show that these alternatives -- contrary to common assumptions -- are
not mutually exclusive and that the very same system can be compatible with
all of them. As a central analytical result, we show that modeling neural
responses in area V1 as samples from a posterior distribution over latents in
a linear Gaussian model of the image implies that those neural responses form
a probabilistic population code (PPC, Beck et al. 2006). In particular, the
posterior distribution over some experimenter-defined variable like
``orientation'' is part of the exponential family with sufficient statistics
that are linear in the neural sampling-based firing rates.
Manifold-tiling Localized Receptive Fields are Optimal in Similarity-preserving Neural Networks

Many neurons in the brain, such as place cells in the rodent hippocampus, have
localized receptive fields, i.e. they respond to a small neighborhood of
stimulus space. What is the functional significance of such representations
and how can they arise? Here, we propose that localized receptive fields
emerge in similarity-preserving networks of rectifying neurons that learn low-
dimensional manifolds populated by sensory inputs. Numerical simulations of
such networks on standard datasets yield manifold-tiling localized receptive
fields. More generally, we show analytically that, for data lying on symmetric
manifolds, optimal solutions of objectives, from which similarity-preserving
networks are derived, have localized receptive fields. Therefore, nonnegative
similarity-preserving mapping (NSM) implemented by neural networks can model
representations of continuous manifolds in the brain.
On the Convergence and Robustness of Training GANs with Regularized Optimal Transport

Generative Adversarial Networks (GANs) are one of the most practical methods
for learning data distributions. A popular GAN formulation is based on the use
of Wasserstein distance as a metric between probability distributions.
Unfortunately, minimizing the Wasserstein distance between the data
distribution and the generative model distribution is a computationally
challenging problem as its objective is non-convex, non-smooth, and even hard
to compute. In this work, we show that obtaining gradient information of the
smoothed Wasserstein GAN formulation, which is based on regularized Optimal
Transport (OT), is computationally effortless and hence one can apply first
order optimization methods to minimize this objective. Consequently, we
establish theoretical convergence guarantee to stationarity for a proposed
class of GAN optimization algorithms. Unlike the original non-smooth
formulation, our algorithm only requires solving the discriminator to
approximate optimality. We apply our method to learning MNIST digits as well
as CIFAR-10 images. Our experiments show that our method is computationally
efficient and generates images comparable to the state of the art algorithms
given the same architecture and computational power.
Model-Agnostic Private Learning

We design differentially private learning algorithms that are agnostic to the
learning model assuming access to limited amount of unlabeled public data.
First, we give a new differentially private algorithm for answering a sequence
of $m$ online classification queries (given by a sequence of $m$ unlabeled
public feature vectors) based on a private training set. Our private algorithm
follows the paradigm of subsample-and-aggregate, in which any generic non-
private learner is trained on disjoint subsets of the private training set,
then for each classification query, the votes of the resulting classifiers
ensemble are aggregated in a differentially private fashion. Our private
aggregation is based on a novel combination of distance-to-instability
framework \cite{ST13} and the sparse-vector technique \cite{DNRRV09,HR10}. We
show that our algorithm makes a conservative use of the privacy budget. In
particular, if the underlying non-private learner yields classification error
at most $\alpha\in (0, 1)$, then our construction answers more queries than
what is implied by a straightforward application of advanced composition
theorems from differential privacy; and by at least a factor of $1/\alpha$ in
some cases. Next, we apply the knowledge transfer technique to construct a
private learner that outputs a classifier, which can be used to answer
unlimited number of queries. In the (agnostic) PAC model, we analyze our
construction and prove upper bounds on the sample complexity for both the
realizable and the non-realizable cases. As in non-private sample complexity,
our bounds are completely characterized by the VC dimension of the concept
class.
Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders

Deep generative models have achieved remarkable success in various data
domains, including images, time series, and natural languages. There remain,
however, substantial challenges for combinatorial structures, including
graphs. One of the key challenges lies in the difficulty of ensuring semantic
validity in context. For examples, in molecular graphs, the number of bonding-
electron pairs must not exceed the valence of an atom; whereas in protein
interaction networks, two proteins may be connected only when they belong to
the same or correlated gene ontology terms. These constraints are not easy to
be incorporated into a generative model. In this work, we propose a
regularization framework for variational autoencoders as a step toward
semantic validity. We focus on the matrix representation of graphs and
formulate penalty terms that regularize the output distribution of the decoder
to encourage the satisfaction of validity constraints. Experimental results
confirm a much higher likelihood of sampling valid graphs in our approach,
compared with others reported in the literature.
Provably Correct Automatic Sub-Differentiation for Qualified Programs

The \emph{Cheap Gradient Principle}~\citep{Griewank:2008:EDP:1455489} --- the
computational cost of computing a $d$-dimensional vector of partial
derivatives of a scalar function is nearly the same (often within a factor of
$5$) as that of simply computing the scalar function itself --- is of central
importance in optimization; it allows us to quickly obtain (high-dimensional)
gradients of scalar loss functions which are subsequently used in black box
gradient-based optimization procedures. The current state of affairs is
markedly different with regards to computing sub-derivatives: widely used ML
libraries, including TensorFlow and PyTorch, do \emph{not} correctly compute
(generalized) sub-derivatives even on simple differentiable examples. This
work considers the question: is there a \emph{Cheap Sub-gradient Principle}?
Our main result shows that, under certain restrictions on our library of non-
smooth functions (standard in non-linear programming), provably correct
generalized sub-derivatives can be computed at a computational cost that is
within a (dimension-free) factor of $6$ of the cost of computing the scalar
function itself.
Deep Homogeneous Mixture Models: Representation, Separation, and Approximation

At their core, many unsupervised learning models provide a compact
representation of homogeneous density mixtures, but their similarities and
differences are not always clearly understood. In this work, we formally
establish the relationships among latent tree graphical models (including
special cases such as hidden Markov models and tensorial mixture models),
hierarchical tensor formats and sum-product networks. Based on this
connection, we then give a unified treatment of exponential separation in
\emph{exact} representation size between deep mixture architectures and
shallow ones. In contrast, for \emph{approximate} representation, we show that
the conditional gradient algorithm can approximate any homogeneous mixture
within $\epsilon$ accuracy by combining $O(1/\epsilon^2)$ ``shallow''
architectures, where the hidden constant may decrease (exponentially) with
respect to the depth. Our experiments on both synthetic and real datasets
confirm the benefits of depth in density estimation.
Parameters as interacting particles: asymptotic scaling, convexity, and error of neural networks

The performance of neural networks on high-dimensional data distributions
suggests that it may be possible to parameterize a representation of a
\emph{given} high-dimensional function with controllably small errors,
potentially outperforming standard interpolation methods. We demonstrate, both
theoretically and numerically, that this is indeed the case. We map the
parameters of a neural network to a system of particles relaxing with an
interaction potential determined by the loss function. We show that in the
limit that the number of parameters $n$ is large, the landscape of the mean-
squared error becomes convex and the representation error in the function
scales as $O(n^{-1})$. As a consequence, we rederive the universal
approximation theorem for neural networks but we additionally prove that the
optimal representation can be achieved through stochastic gradient descent,
the algorithm ubiquitously used for parameter optimization in machine
learning. In the asymptotic regime, we study the fluctuations around the
optimal representation and show that they arise at a scale $O(n^{-1})$, for
suitable choices of the batch size. These fluctuations in the landscape
demonstrate the necessity of the noise inherent in stochastic gradient descent
and our analysis provides a precise scale for tuning this noise. Our results
apply to both single and multi-layer neural networks, as well as standard
kernel methods like radial basis functions. From our insights, we extract
several practical guidelines for large scale applications of neural networks,
emphasizing the importance of both noise and quenching, in particular.
Multitask Reinforcement Learning for Zero-shot Generalization with Subtask Dependencies

We introduce a new RL problem where the agent is required to execute a given
subtask graph which describes a set of subtasks and their dependency. Unlike
existing approaches that explicitly describe what the agent should do, our
problem only describes properties of subtasks and relationships among them,
which requires the agent to perform a complex reasoning to find the optimal
subtask to execute. To solve this problem, we propose a neural subtask graph
solver (NSS) which encodes the subtask graph using a recursive neural network.
To overcome the difficulty of training, we propose a novel non-parametric
gradient-based policy to pre-train our NSS agent and further finetune it
through actor-critic method. The experimental results on two 2D visual domains
show that our agent can perform a complex reasoning to find the optimal way of
executing the subtask graph and generalize well to the unseen subtask graphs.
In addition, we compare our agent with a Monte-Carlo tree search (MCTS) method
showing that our method is much more efficient than MCTS, and the performance
of NSS can be further improved by combining it with MCTS.
A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks

Detecting test samples drawn sufficiently far away from the training
distribution statistically or adversarially is the fundamental requirement to
deploy a good classifier in many real-world machine learning applications.
However, deep neural networks with the softmax classifier are known to produce
highly overconfident posterior distributions even for such abnormal samples.
In this paper, we propose a simple yet effective method for detecting any
abnormal samples, which is applicable to any pre-trained softmax neural
classifier. We obtain the class conditional Gaussian distributions with
respect to (low- and upper-level) features of the deep models under Gaussian
discriminant analysis, which result in a confidence score based on the
Mahalanobis distance. While most prior methods have been evaluated for
detecting either out-of-distribution or adversarial samples, but not both, the
proposed method achieves the state-of-art performances for both cases in our
experiments. Moreover, we found that our proposed method is more robust in
extreme cases, e.g., when the training dataset has noisy labels or small
number of samples. Finally, we show that {the proposed method} enjoys broader
usage by applying it to class incremental learning: whenever out-of-
distribution samples are detected, our classification rule can incorporate new
classes well without further training deep models.
End-to-End Differentiable Physics for Learning and Control

We present a differentiable physics engine that can be integrated as a module
in deep neural networks for end-to-end learning. As a result, structured
physics knowledge can be embedded into larger systems, allowing them, for
example, to match observations by performing precise simulations, while
achieves high sample efficiency. Specifically, in this paper we demonstrate
how to perform backpropagation analytically through a physical simulator
defined via a linear complementarity problem. Unlike traditional finite
difference methods, such gradients can be computed analytically, which allows
for greater flexibility of the engine. Through experiments in diverse domains,
we highlight the system's ability to learn physical parameters from data,
efficiently match and simulate observed visual behavior, and readily enable
control via gradient-based planning methods.
BRUNO: A Deep Recurrent Model for Exchangeable Data

We present a novel model architecture which leverages deep learning tools to
perform exact Bayesian inference on sets of high dimensional, complex
observations. Our model is provably exchangeable, meaning that the joint
distribution over observations is invariant under permutation: this property
lies at the heart of Bayesian inference. The model does not require
variational approximations to train, and new samples can be generated
conditional on previous samples, with cost linear in the size of the
conditioning set. The advantages of our architecture are demonstrated on
learning tasks that require generalisation from short observed sequences while
modelling sequence variability, such as conditional image generation, few-shot
learning, and anomaly detection.
Stimulus domain transfer in recurrent models for large scale cortical population prediction on video

To better understand the representations in visual cortex, we need to generate
better predictions of neural activity in awake animals presented with their
ecological input: natural video. Despite recent advances in models for static
images, models for predicting responses to natural video are scarce and
standard linear-nonlinear models perform poorly. We developed a new deep
recurrent network architecture that predicts inferred spiking activity of
thousands of mouse V1 neurons simultaneously recorded with two-photon
microscopy, while accounting for confounding factors such as the animal's gaze
position and brain state changes related to running state and pupil dilation.
Powerful system identification models provide an opportunity to gain insight
into cortical functions through {\em in silico} experiments that can
subsequently be tested in the brain. However, in many cases this approach
requires that the model is able to generalize to stimulus statistics that it
was not trained on, such as band-limited noise and other parameterized
stimuli. We investigated these domain transfer properties in our model and
find that our model trained on natural images is able to correctly predict the
orientation tuning of neurons in responses to artificial noise stimuli.
Finally, we show that we can fully generalizing from movies to noise and
maintain high predictive performance on both by fine-tuning the final layer's
weights on a network otherwise trained on natural movies. The converse,
however, is not true.
Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction

Machine understanding of complex images is a key goal of artificial
intelligence. One challenge underlying this task is that visual scenes contain
multiple inter-related objects, and that global context plays an important
role in interpreting the scene. A natural modeling framework for capturing
such effects is structured prediction, which optimizes over complex labels,
while modeling within-label interactions. However, it is unclear what
principles should guide the design of a structured prediction model that
utilizes the power of deep learning components. Here we propose a design
principle for such architectures that follows from a natural requirement of
permutation invariance. We prove a necessary and sufficient characterization
for architectures that follow this invariance, and discuss its implication on
model design. Finally, we show that the resulting model achieves new state of
the art results on the Visual Genome scene graph labeling benchmark,
outperforming all recent approaches.
Distributed Multi-Player Bandits - a Game of Thrones Approach

We consider a multi-armed bandit game where N players compete for K arms for T
turns. Each player has different expected rewards for the arms, and the
instantaneous rewards are independent and identically distributed. Performance
is measured using the expected sum of regrets, compared to optimal assignment
of arms to players. We assume that each player only knows her actions and the
reward she received each turn. Players cannot observe the actions of other
players, and no communication between players is possible. We present a
distributed algorithm and prove that it achieves an expected sum of regrets of
near-O\left(\log^{2}T\right) . This is the first algorithm to achieve a poly-
logarithmic regret in this fully distributed scenario. All other works have
assumed that either all players have the same vector of expected rewards or
that communication between players is possible.
Efficient Loss-Based Decoding On Graphs For Extreme Classification

In extreme classification problems, learning algorithms are required to map
instances to labels from an extremely large label set. We build on a recent
extreme classification framework with logarithmic time and space (LTLS), and
on a general approach for error correcting output coding (ECOC) with loss-
based decoding, and introduce a flexible and efficient approach accompanied by
theoretical bounds. Our framework employs output codes induced by graphs, for
which we show how to perform efficient loss-based decoding to potentially
improve accuracy. In addition, our framework offers a tradeoff between
accuracy, model size and prediction time. We show how to find the sweet spot
of this tradeoff using only the training data. Our experimental study
demonstrates the validity of our assumptions and claims, and shows that our
method is competitive with state-of-the-art algorithms.
Chaining Mutual Information and Tightening Generalization Bounds

Bounding the generalization error of learning algorithms has a long history,
that yet falls short in explaining various generalization successes including
those of deep learning. Two important difficulties are (i) exploiting the
dependencies between hypotheses, (ii) exploiting the dependencies between the
algorithm's input and output. Progress on the first point was made with the
chaining method, used in the VC dimension bound. More recently, progress on
the second point was made with the mutual information method by Russo and Zou
'15. Yet, these two methods are currently disjoint. In this paper, we
introduce a technique to combine chaining and mutual information methods, to
obtain a generalization bound that is both algorithmic-dependent and that
exploits dependencies between hypotheses. We provide an example in which our
bound significantly outperforms both the chaining and the mutual information
method. As a corollary, we tighten Dudley's inequality under knowledge that a
learning algorithm chooses its output from a small subset of hypotheses with
high probability; an assumption motivated by the performance of SGD discussed
in Zhang et al.\ '17.
Implicit Probabilistic Integrators for ODEs

We introduce a family of implicit probabilistic integrators for initial value
problems (IVPs) taking as a starting point the multistep Adams--Moulton
method. The implicit construction allows for dynamic feedback from the
forthcoming time-step, by contrast with previous probabilistic integrators,
all of which are based on explicit methods. We begin with a concise survey of
the rapidly-expanding field of probabilistic ODE solvers. We then introduce
our method, which builds on and adapts the work of Conrad et al. (2016) and
Teymur et al. (2016), and provide a rigorous proof of its well-definedness and
convergence. We discuss the problem of the calibration of such integrators and
suggest one approach. We give an illustrative example highlighting the effect
of the use of probabilistic integrators -- including our new method -- in the
setting of parameter inference within an inverse problem.
Learning Attentional Communication for Multi-Agent Cooperation

Communication could potentially be an effective way for multi-agent
cooperation. However, information sharing among all agents or in predefined
communication architectures that existing methods adopt can be problematic.
When there is a large number of agents, agents hardly differentiate valuable
information that helps cooperative decision making from globally shared
information. Therefore, communication barely help, and could even impair the
learning of multi-agent cooperation. Predefined communication architectures,
on the other hand, restrict communication among agents and thus restrain
potential cooperation. To tackle these difficulties, in this paper, we propose
an attentional communication model that learns when communication is needed
and how to integrates shared information for cooperative decision making. Our
model leads to efficient and effective communication for large-scale multi-
agent cooperation. Empirically, we show the strength of our model in various
cooperative scenarios, where agents are able to develop more coordinated and
sophisticated strategies than existing methods.
Training Deep Models Faster with Robust, Approximate Importance Sampling

In theory, importance sampling speeds up stochastic gradient algorithms for
supervised learning by prioritizing training instances. In practice, the cost
of computing importances greatly limits the impact of importance sampling. We
propose a robust, approximate importance sampling procedure (RAIS) for
stochastic gradient descent. By approximating the ideal sampling distribution
using robust optimization, RAIS provides much of the benefit of exact
importance sampling with drastically reduced overhead. Empirically, we find
RAIS-SGD and standard SGD follow similar learning curves, but RAIS moves
faster through these paths.
Bandit Learning with Implicit Feedback

Implicit feedback, such as user clicks, although abundant in online
information service systems, does not provide substantial evidence on users'
evaluation of system's output. Such incomplete supervision inevitably misleads
model estimation, especially in a bandit learning setting where the feedback
is acquired on the fly. In this work, we study a contextual bandit problem
with implicit feedback by modeling the feedback as a composition of user
result examination and relevance judgment. Since users' examination behavior
is unobserved, we introduce latent variables to model it. We perform Thompson
sampling on top of variational Bayesian inference for arm selection and model
update. Rigorous upper regret bound analysis of the proposed algorithm proves
its feasibility of learning from implicit feedback; and extensive empirical
evaluations on click logs collected from a major MOOC platform further
demonstrate its learning effectiveness in practice.
Unsupervised Text Style Transfer using Language Models as Discriminators

Binary classifiers are employed as discriminators in GAN-based unsupervised
style transfer models to ensure that transferred sentences are similar to
sentences in the target domain. One difficulty with the binary discriminator
is that error signal is sometimes insufficient to train the model to produce
rich-structured language. In this paper, we propose a technique of using a
target domain language model as the discriminator to provide richer, token-
level feedback during the learning process. Because our language model scores
sentences directly using a product of locally normalized probabilities, it
offers more stable and more useful training signal to the generator. We train
the generator to minimize the negative log likelihood (NLL) of generated
sentences evaluated by a language model. By using continuous approximation of
the discrete samples, our model can be trained using back-propagation in an
end-to-end way. Moreover, we find empirically with a language model as a
structured discriminator, it is possible to eliminate the adversarial training
steps using negative samples, thus making training more stable. We compare our
model with previous work using convolutional neural networks (CNNs) as
discriminators and show our model outperforms them significantly in three
tasks including word substitution decipherment, sentiment modification and
related language translation.
Relational recurrent neural networks

Memory-based neural networks model temporal data by leveraging an ability to
remember information for long periods. It is unclear, however, whether they
also have an ability to perform complex relational reasoning with the
information they remember. Here, we first confirm our intuitions that standard
memory architectures may struggle at tasks that heavily involve an
understanding of the ways in which entities are connected -- i.e., tasks
involving relational reasoning. We then improve upon these deficits by using a
new memory module -- a Relational Memory Core (RMC) -- which employs multi-
head dot product attention to allow memories to interact. Finally, we test the
RMC on a suite of tasks that may profit from more capable relational reasoning
across sequential information, and show large gains in RL domains (BoxWorld &
Mini PacMan), program evaluation, and language modeling, achieving state-of-
the-art results on the WikiText-103, Project Gutenberg, and GigaWord datasets.
StreamingKernelPCAwith$\tilde{O}(\sqrt{n})$RandomFeatures

We study the statistical and computational aspects of kernel principal
component analysis using random Fourier features and show that under mild
assumptions, $O(\sqrt{n} \log n)$ features suffices to achieve
$O(1/\epsilon^2)$ sample complexity. Furthermore, we give a memory efficient
streaming algorithm based on classical Oja's algorithm that achieves this rate
Exploring Sparse Features in Deep Reinforcement Learning towards Fast Disease Diagnosis

This paper proposes a policy gradient method with two techniques, {\em reward
shaping} and {\em reconstruction}, to improve the performance of online
symptom checking. Reward shaping can guide the search towards better
directions. Reconstruction can guide the agent to learn correlations between
features. Together, they can find symptom queries that can yield positive
responses from a patient with high probability. Consequently, using these
techniques a symptom checker can obtain much improved diagnoses.
Bayesian Model-Agnostic Meta-Learning

Learning to infer Bayesian posterior from a few-shot dataset is an important
step towards robust meta-learning due to the model uncertainty inherent in the
problem. In this paper, we propose a novel Bayesian model-agnostic meta-
learning method. The proposed method combines gradient-based meta-learning
with nonparametric variational inference in a principled probabilistic
framework. During fast adaptation, the method is capable of learning complex
uncertainty structure beyond a point estimate or a simple Gaussian
approximation. In addition, a robust Bayesian meta-update mechanism with a new
meta-loss prevents overfitting during meta-update. Remaining an efficient
gradient-based meta-learner, the method is also model-agnostic and simple to
implement. Experiment results show the accuracy and robustness of the proposed
method in various tasks: sinusoidal regression, image classification, active
learning, and reinforcement learning.
Disconnected Manifold Learning for Generative  Adversarial Networks

Real images often lie on a union of disjoint manifolds rather than one
globally connected manifold, and this can cause several difficulties for the
training of common Generative Adversarial Networks (GANs). In this work, we
first show that single generator GANs are unable to correctly model a
distribution supported on a disconnected manifold, and investigate how sample
quality, mode collapse and local convergence are affected by this. Next, we
show how using a collection of generators can address this problem, providing
new insights into the success of such multi-generator GANs. Finally, we
explain the serious issues caused by considering a fixed prior over the
collection of generators and propose a novel approach for learning the prior
and inferring the necessary number of generators without any supervision. Our
proposed modifications can be applied on top of any other GAN model to enable
learning of distributions supported on disconnected manifolds. We conduct
several experiments to illustrate the aforementioned shortcoming of GANs, its
consequences in practice, and the effectiveness of our proposed modifications
in alleviating these issues.
Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Recent research has shown that word embedding spaces learned from text corpora
of different languages can be aligned without any parallel data supervision.
Inspired by the success in unsupervised cross-lingual word embeddings, in this
paper we target learning a cross-modal alignment between the embedding spaces
of speech and text learned from corpora of their respective modalities in an
unsupervised fashion. The proposed framework learns the individual speech and
text embedding spaces, and attempts to align the two spaces via adversarial
training, followed by a refinement procedure. We show how our framework could
be used to perform the tasks of spoken word classification and translation,
and the experimental results on these two tasks demonstrate that the
performance of our unsupervised alignment approach is comparable to its
supervised counterpart. Our framework is especially useful for developing
automatic speech recognition (ASR) and speech-to-text translation systems for
low- or zero-resource languages, which have little parallel audio-text data
for training modern supervised ASR and speech-to-text translation models, but
account for the majority of the languages spoken across the world.
Learning Signed Determinantal Point Processes through the Principal Minor Assignment Problem

Symmetric determinantal point processes (DPP) are a class of probabilistic
models that encode the random selection of items that have a repulsive
behavior. They have attracted a lot of attention in machine learning, where
returning diverse sets of items is sought for. Sampling and learning these
symmetric DPP's is pretty well understood. In this work, we consider a new
class of DPP's, which we call signed DPP's, where we relax the repulsive
behavior and allow some pairs of items to attract each other. We set the
ground for learning signed DPP's through a method of moments, by solving the
so called principal assignment problem for a class of matrices $K$ that
satisfy $K_{i,j}=\pm K_{j,i}$, $i\neq j$, in polynomial time.
Out-of-Distribution Detection using Multiple Semantic Label Representations

Deep Neural Networks are powerful models that attained remarkable results on a
variety of tasks. These models are shown to be extremely efficient when
training and test data are drawn from the same distribution. However, it is
not clear how a network will act when it is fed with an out-of-distribution
example. In this work, we consider the problem of out-of-distribution
detection in neural networks. We propose to use multiple semantic dense
representations instead of sparse representation as the target label.
Specifically, we propose to use several word representations obtained from
different corpora or architectures as target labels. We evaluated the proposed
model on computer vision, and speech commands detection tasks and compared it
to previous methods. Results suggest that our method compares favorably with
previous work. Besides, we present the efficiency of our approach for
detecting wrongly classified and adversarial examples.
Stochastic Chebyshev Gradient Descent for Spectral Optimization

A large class of machine learning techniques requires the solution of
optimization problems involving spectral functions of parametric matrices,
e.g. log-determinant and nuclear norm. Unfortunately, computing the gradient
of a spectral function is generally of cubic complexity, as such gradient
descent methods are rather expensive for optimizing objectives involving the
spectral function. Thus, one naturally turns to stochastic gradient methods in
hope that they will provide a way to reduce or altogether avoid the
computation of full gradients. However, here a new challenge appears: there is
no straightforward way to compute unbiased stochastic gradients for spectral
functions. In this paper, we develop unbiased stochastic gradients for
spectral-sums, an important subclass of spectral functions. Our unbiased
stochastic gradients are based on combining randomized trace estimators with
stochastic truncation of the Chebyshev expansions. A careful design of the
truncation distribution allows us to offer distributions that are variance-
optimal, which is crucial for fast and stable convergence of stochastic
gradient methods. We further leverage our proposed stochastic gradients to
devise stochastic methods for objective functions involving spectral-sums, and
rigorously analyze their convergence rate. The utility of our methods is
demonstrated in numerical experiments.
Revisiting $(\epsilon, \gamma, \tau)$-similarity learning for domain adaptation

Similarity learning is an active research area in machine learning that
tackles the problem of finding a similarity function tailored to an observable
data sample in order to achieve efficient classification. This learning
scenario has been generally formalized by the means of a $(\epsilon, \gamma,
\tau)-$good similarity learning framework in the context of supervised
classification and has been shown to have strong theoretical guarantees. In
this paper, we propose to extend the theoretical analysis of similarity
learning to the domain adaptation setting, a particular situation occurring
when the similarity is learned and then deployed on samples following
different probability distributions. We give a new definition of an
$(\epsilon, \gamma)-$good similarity for domain adaptation and prove several
results quantifying the performance of a similarity function on a target
domain after it has been trained on a source domain. We particularly show that
if the source domain support contains that of the target then principally new
domain adaptation learning bounds can be proved.
How to tell when a clustering is (approximately) correct using convex relaxations

We introduce a generic method to obtain guarantees of near-optimality and
uniqueness (up to small perturbations) for a clustering. This method can be
instantiated for a variety of clustering loss functions for which convex
relaxations exist. Obtaining the guarantees amounts to solving a convex
optimization problem. We demonstrate the practical relevance of this method by
obtaining distribution free guarantees for the K-means clustering problem on
realistic data sets. The guarantees do not depend on any distributional
assumptions, but they depend on the data set at hand. They exist only when the
data is clusterable.
Constant Regret, Generalized Mixability, and Mirror Descent

We consider the setting of prediction with expert advice; a learner makes
predictions by aggregating those of a group of experts. Under this setting,
and for the right choice of loss function and ``mixing'' algorithm, it is
possible for the learner to achieve a constant regret regardless of the number
of prediction rounds. For example, a constant regret can be achieved for
\emph{mixable} losses using the \emph{aggregating algorithm}. The
\emph{Generalized Aggregating Algorithm} (GAA) is a name for a family of
algorithms parameterized by convex functions on simplices (entropies), which
reduce to the aggregating algorithm when using the \emph{Shannon entropy}. For
a given entropy $\Phi$, losses for which constant regret is possible using the
GAA are called $\Phi$-mixable. Which losses are $\Phi$-mixable was previously
left as an open question. We fully characterize $\Phi$-mixability and answer
other open questions posed by \cite{Reid2015}. Additionally, by leveraging the
connection between the \emph{mirror descent algorithm} and the update step of
the GAA, we suggest a new \emph{adaptive} generalized aggregating algorithm
and analyze its performance in terms of the regret bound.
A Bayesian Approach to Generative Adversarial Imitation Learning

Generative adversarial training for imitation learning has shown promising
results on high-dimensional and continuous control tasks. This paradigm is
based on reducing the imitation learning problem to the density matching
problem, where the agent iteratively refines the policy to match the empirical
state-action visitation frequency of the expert demonstration. Although this
approach has shown to robustly learn to imitate even with scarce
demonstration, one must still address the inherent challenge that collecting
trajectory samples in each iteration is a costly operation. To address this
issue, we first propose a Bayesian formulation of generative adversarial
imitation learning (GAIL), where the imitation policy and the cost function
are represented as stochastic neural networks. Then, we show that we can
significantly enhance the sample efficiency of GAIL leveraging the predictive
density of the cost, on an extensive set of imitation learning tasks with
high-dimensional states and actions.
Constrained Cross-Entropy Method for Safe Reinforcement Learning

We study a safe reinforcement learning problem in which the constraints are
defined as the expected cost over finite-length trajectories. We propose a
constrained cross-entropy-based method to solve this problem. The method
explicitly tracks its performance with respect to constraint satisfaction and
thus is well-suited for safety-critical applications. We show that the
asymptotic behavior of the proposed algorithm can be almost-surely described
by that of an ordinary differential equation. Then we give sufficient
conditions on the properties of this differential equation to guarantee the
convergence of the proposed algorithm. At last, we show with simulation
experiments that the proposed algorithm can effectively learn feasible
policies without assumptions on the feasibility of initial policies, even with
non-Markovian objective functions and constraint functions.
Multi-Agent Generative Adversarial Imitation Learning

Imitation learning algorithms can be used to learn a policy from expert
demonstrations without access to a reward signal. However, most existing
approaches are not applicable in multi-agent settings due to the existence of
multiple (Nash) equilibria and non-stationary environments. We propose a new
framework for multi-agent imitation learning for general Markov games, where
we build upon a generalized notion of inverse reinforcement learning. We
further introduce a practical multi-agent actor-critic algorithm with good
empirical performance. Our method can be used to imitate complex behaviors in
high-dimensional environments with multiple cooperative or competing agents.
Adaptive Learning with Unknown Information Flows

An agent facing sequential decisions that are characterized by partial
feedback needs to strike a balance between maximizing immediate payoffs based
on available information, and acquiring new information that may be essential
for maximizing future payoffs. This trade-off is captured by the multi-armed
bandit (MAB) framework that has been studied and applied when at each time
epoch payoff observations are collected on the actions that are selected at
that epoch. In this paper we introduce a new, generalized MAB formulation in
which additional information on each arm may appear arbitrarily throughout the
decision horizon, and study the impact of such information flows on the
achievable performance and the design of efficient decision-making policies.
By obtaining matching lower and upper bounds, we characterize the (regret)
complexity of this family of MAB problems as a function of the information
flows. We introduce an adaptive exploration policy that, without any prior
knowledge of the information arrival process, attains the best performance (in
terms of regret rate) that is achievable when the information arrival process
is a priori known. Our policy uses dynamically customized virtual time indexes
to endogenously control the exploration rate based on the realized information
arrival process.
Forecasting Treatment Responses Over Time Using Recurrent Marginal Structural Networks

Electronic health records provide a rich source of data for machine learning
methods to learn dynamic treatment responses over time. However, any direct
estimation is hampered by the presence of time-dependent confounding, where
actions taken are dependent on time-varying variables related to the outcome
of interest. Drawing inspiration from marginal structural models, a class of
methods in epidemiology which use propensity weighting to adjust for time-
dependent confounders, we introduce the Recurrent Marginal Structural Network

a sequence-to-sequence architecture for forecasting a patient's expected
response to a series of planned treatments. Using simulations of a state-of-
the-art pharmacokinetic-pharmacodynamic (PK-PD) model of tumor growth, we
demonstrate the ability of our network to accurately learn unbiased treatment
responses from observational data, exhibiting robustness to changes in the
policy of treatment assignments, and performance gains over benchmarks.

Generative modeling for protein structures

Analyzing the structure and function of proteins is a key part of
understanding biology at the molecular and cellular level. In addition, a
major engineering challenge is to design new proteins in a principled and
methodical way. Current computational modeling methods for protein design are
slow and often require human oversight and intervention. Here, we apply
Generative Adversarial Networks (GANs) to the task of generating protein
structures, toward application in fast de novo protein design. We encode
protein structures in terms of pairwise distances between alpha-carbons on the
protein backbone, which eliminates the need for the generative model to learn
translational and rotational symmetries. We then introduce a convex
formulation of corruption-robust 3D structure recovery to fold the protein
structures from generated pairwise distance maps, and solve these problems
using the Alternating Direction Method of Multipliers. We test the
effectiveness of our models by predicting completions of corrupted protein
structures and show that the method is capable of quickly producing
biochemically viable solutions.
Inference in Deep Gaussian Processes using Stochastic Gradient Hamiltonian Monte Carlo

Deep Gaussian Processes (DGPs) are hierarchical generalizations of Gaussian
Processes that combine well calibrated uncertainty estimates with the high
flexibility of multilayer models. One of the biggest challenges with these
models is that exact inference is intractable. The current state-of-the-art
inference method, Variational Inference (VI), employs a Gaussian approximation
to the posterior distribution. This can be a potentially poor unimodal
approximation of the generally multimodal posterior. In this work, we provide
evidence for the non-Gaussian nature of the posterior and we apply the
Stochastic Gradient Hamiltonian Monte Carlo method to directly sample from it.
To efficiently optimize the hyperparameters, we introduce the Moving Window
MCEM algorithm. This results in significantly better predictions at a lower
computational cost than its VI counterpart. Thus our method establishes a new
state-of-the-art for inference in DGPs.
Knowledge Distillation by On-the-Fly Native Ensemble

Knowledge distillation is effective to train small and generalisable network
models for meeting the low-memory and fast running requirements. Existing
offline distillation methods rely on a strong pre-trained teacher, which
enables favourable knowledge discovery and transfer but requires a complex
two-phase training stage. Online alternatives address this limitation at the
price of lacking a high-capacity teacher. In this work, we present an On-the-
fly Native Ensemble (ONE) strategy for one-stage online distillation.
Specifically, ONE trains only a single multi-branch network while
simultaneously establishing a strong teacher on-the-fly to supervise the
target network. Extensive evaluations show that ONE improves the
generalisation performance a variety of deep neural networks more
significantly than alternative methods on four image classification dataset:
CIFAR10, CIFAR100, SVHN, and ImageNet, whilst having the computational
efficiency advantages.
Non-Adversarial Mapping with VAEs

The study of cross-domain mapping without supervision has recently attracted
much attention. Much of the recent progress was enabled by the use of
adversarial training as well as cycle constraints. The practical difficulty of
adversarial training motivates research into non-adversarial methods. In a
recent paper, Hoshen and Wolf, showed that cross-domain mapping is possible
without the use of cycles or GANs. Although promising, their approach suffers
from several drawbacks including costly inference and an optimization variable
for every training example preventing their method from using large training
sets. We present an alternative approach which is able to achieve non-
adversarial mapping using a novel form of Variational Auto-Encoder. Our method
is much faster at inference time, is able to leverage large datasets and has a
simple interpretation.
Generalisation in humans and deep neural networks

We compare the robustness of humans and current convolutional deep neural
networks (DNNs) on object recognition under twelve different types of image
degradations. First, using three well known DNNs (ResNet-152, VGG-19,
GoogLeNet) we find the human visual system to be more robust to nearly all of
the tested image manipulations, and we observe progressively diverging
classification error-patterns between humans and DNNs when the signal gets
weaker. Secondly, we show that DNNs trained directly on distorted images
consistently surpass human performance on the exact distortion types they were
trained on, yet they display extremely poor generalisation abilities when
tested on other distortion types. For example, training on salt-and-pepper
noise does not imply robustness on uniform white noise and vice versa. Thus,
changes in the noise distribution between training and testing constitutes a
crucial challenge to deep learning vision systems that can be systematically
addressed in a lifelong machine learning approach. Our new dataset consisting
of 83K carefully measured human psychophysical trials provide a useful
reference for lifelong robustness against image degradations set by the human
visual system.
Towards Text Generation with Adversarially Learned Neural Outlines

Recent progress in deep generative models has been fueled by two paradigms --
autoregressive and adversarial models. We propose a combination of both
approaches with the goal of learning generative models of text. Our method
first produces a high-level outline and then generates words sequentially,
conditioning on both the outline and the previous output. We generate outlines
with an adversarial model trained to approximate the distribution of sentences
in a latent space induced by general-purpose sentence encoders. This provides
strong, informative conditioning for the autoregressive stage. Our qualitative
results show that this generative procedure yields natural-looking sentences
and interpolations. Quantitative results suggest that conditioning information
from generated outlines effectively guides the autoregressive model to produce
realistic samples even at high temperatures with multinomial sampling.
cpSGD: Communication-efficient and differentially-private distributed SGD

Distributed stochastic gradient descent is an important subroutine in
distributed learning. A setting of particular interest is when the clients are
mobile devices, where two important concerns are communication efficiency and
the privacy of the clients. Several recent works have focused on reducing the
communication cost or introducing privacy guarantees, but none of the proposed
communication efficient methods are known to be privacy preserving and none of
the known privacy mechanisms are known to be communication efficient. To this
end, we study algorithms that achieve both communication efficiency and
differential privacy. For $d$ variables and $n \approx d$ clients, the
proposed method uses $\cO(\log \log(nd))$ bits of communication per client per
coordinate and ensures constant privacy. We also improve previous analysis of
the \emph{Binomial mechanism} showing that it achieves nearly the same utility
as the Gaussian mechanism, while requiring fewer representation bits, which
can be of independent interest.
Blackbox Matrix×Matrix Gaussian Process Inference

Despite numerous advances in scalable models, the inference tools used for
Gaussian processes (GPs) have yet to fully capitalize on recent trends in
machine learning hardware. In this paper, we present an efficient and general
approach to GP inference based on Blackbox Matrix-Matrix multiplication
(BBMM). BBMM uses a modified batched version of the conjugate gradients
algorithm to derive all terms required for training and inference in a single
call. Adapting this algorithm to complex models simply requires a routine for
efficient matrix-matrix multiplication with the kernel and its derivative. In
addition, BBMM utilizes a specialized preconditioner that substantially speeds
up convergence. In experiments, we show that BBMM efficiently utilizes GPU
hardware, speeding up GP inference by an order of magnitude on a variety of
popular GP models compared to existing approaches.
Diffusion Maps for Textual Network Embedding

Textual network embedding leverages rich text information associated with the
network to learn low-dimensional vectorial representations of vertices. Rather
than using typical natural language processing (NLP) approaches, recent
research exploits the relationship of texts on the same edge to graphically
embed text. However, these models neglect to measure the complete level of
connectivity between any two texts in the graph. We present diffusion maps for
textual network embedding (DMTE), integrating global structural information of
the graph to capture the semantic relatedness between texts, with a diffusion-
convolution operation applied on the text inputs. In addition, a new objective
function is designed to efficiently preserve the high-order proximity using
the graph diffusion. Experimental results show that the proposed approach
outperforms state-of-the-art methods on the vertex-classification and link-
prediction tasks.
Edward2: Simple, Dynamic, Accelerated

We describe Edward2, a probabilistic programming language (PPL) extending
Edward. Edward2 distills the core of Edward down to a single abstraction—the
random variable—while expanding its feature set. By blurring the line between
modeling and computation, Edward2 enables numerous applications not possible
in existing PPLs: the grammar VAE; learning to learn by variational inference
by gradient descent; and GPU-accelerated NUTS. In a benchmark on VAEs, Edward2
sees a 5x speedup running on TPUs compared to GPU. In a benchmark on NUTS,
Edward2 sees a 20x speedup over Stan and 7x over PyMC3.
VideoCapsuleNet: A Simplified Network for Action Detection

The recent advances in Deep Convolutional Neural Networks (DCNNs) have shown
extremely good results for video human action classification, however, action
detection is still a challenging problem. The current action detection
approaches follow a complex pipeline which involves multiple tasks such as
tube proposals, optical flow, and tube classification. In this work, we
present a more elegant solution for action detection based on the recently
developed capsule network. We propose a 3D capsule network for videos, called
VideoCapsuleNet: a unified network for action detection which can jointly
perform pixel-wise action segmentation along with action classification. The
proposed network is a generalization of capsule network from 2D to 3D, which
takes a sequence of video frames as input. The 3D generalization drastically
increases the number of capsules in the network, making capsule routing
computationally expensive. We introduce capsule-pooling in the convolutional
capsule layer to address this issue which makes the voting algorithm
tractable. The routing-by-agreement in the network inherently models the
action representations and various action characteristics are captured by the
predicted capsules. This inspired us to utilize the capsules for action
localization and the class-specific capsules predicted by the network are used
to determine a pixel-wise localization of actions. The localization is further
improved by parameterized skip connections with the convolutional capsule
layers and the network is trained end-to-end with a classification as well as
localization loss.
Rectangular Bounding Process

Stochastic partition models divide a multi-dimensional space into a number of
rectangular regions, such that the data within each region exhibit certain
types of homogeneity. Due to the nature of their partition strategy, existing
partition models may create many unnecessary divisions in sparse regions when
trying to describe data in dense regions. To avoid this problem we introduce a
new parsimonious partition model -- the Rectangular Bounding Process (RBP) --
to efficiently partition multi-dimensional spaces, by employing a bounding
strategy to enclose data points within rectangular bounding boxes. Unlike
existing approaches, the RBP possesses several attractive theoretical
properties that make it a powerful nonparametric partition prior on a
hypercube. In particular, the RBP is self-consistent and as such can be
directly extended from a finite hypercube to infinite (unbounded) space. We
apply RBP to regression trees and relational models as a flexible partition
prior. The experimental results validate the merit of RBP {in rich yet
parsimonious expressiveness} compared to the state-of-the-art methods.
Improved Algorithms for Collaborative PAC Learning

We study a recent model of collaborative PAC learning where $k$ players with
$k$ different tasks collaborate to learn a single classifier that works for
all tasks. Previous work showed that when there is a classifier that has very
small errors on all tasks, there is a collaborative algorithm that finds a
single classifier for all tasks and it uses $O((\ln (k))^2)$ times the sample
complexity to learn a single task. In this work, we design new algorithms for
both the realizable and the non-realizable settings using only $O(\ln (k))$
times the sample complexity to learn a single task. The sample complexity
upper bounds of our algorithms match previous lower bounds and in some range
of parameters, are even better than previous algorithms that are allowed to
output different classifiers for different tasks.
Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

Learning long-term dependencies in extended temporal sequences requires credit
assignment to events far back in the past. The most common method for training
recurrent neural networks, back-propagation through time (BPTT), requires
credit information to be propagated backwards through every single step of the
forward computation, potentially over thousands or millions of time steps.
This becomes computationally expensive or even infeasible when used with long
sequences. Importantly, biological brains are unlikely to perform such
detailed reverse replay over very long sequences of internal states (consider
days, months, or years.) However, humans are often reminded of past memories
or mental states which are associated with the current mental state. We
consider the hypothesis that such memory associations between past and present
could be used for credit assignment through arbitrarily long sequences,
propagating the credit assigned to the current state to the associated past
state. Based on this principle, we study a novel algorithm which only back-
propagates through a few of these temporal skip connections, realized by a
learned attention mechanism that associates current states with relevant past
states. We demonstrate in experiments that our method matches or outperforms
regular BPTT and truncated BPTT in tasks involving particularly long-term
dependencies, but without requiring the biologically implausible backward
replay through the whole history of states. Additionally, we demonstrate that
the proposed method transfers to longer sequences significantly better than
LSTMs trained with BPTT and LSTMs trained with full self-attention.
Communication Compression for Decentralized Training

Optimizing distributed learning systems is an art of balancing between
computation and communication. There have been two lines of research that try
to deal with slower networks: {\em communication compression} for low
bandwidth networks, and {\em decentralization} for high latency networks. In
this paper, We explore a natural question: {\em can the combination of both
techniques lead to a system that is robust to both bandwidth and latency?}
Although the system implication of such combination is trivial, the underlying
theoretical principle and algorithm design is challenging: unlike centralized
algorithms, simply compressing {\rc exchanged information, even in an unbiased
stochastic way, within the decentralized network would accumulate the error
and cause divergence.} In this paper, we develop a framework of quantized,
decentralized training and propose two different strategies, which we call
{\em extrapolation compression} and {\em difference compression}. We analyze
both algorithms and prove both converge at the rate of $O(1/\sqrt{nT})$ where
$n$ is the number of workers and $T$ is the number of iterations, matching the
convergence rate for full precision, centralized training. We validate our
algorithms and find that our proposed algorithm outperforms the best of merely
decentralized and merely quantized algorithm significantly for networks with
{\em both} high latency and low bandwidth.
Depth-Limited Solving for Imperfect-Information Games

A fundamental challenge in imperfect-information games is that states do not
have well-defined values. As a result, depth-limited search algorithms used in
single-agent settings and perfect-information games do not apply. This paper
introduces a principled way to conduct depth-limited solving in imperfect-
information games by allowing the opponent to choose among a number of
strategies for the remainder of the game at the depth limit. Each one of these
strategies results in a different set of values for leaf nodes. This forces an
agent to be robust to the different strategies an opponent may employ. We
demonstrate the effectiveness of this approach by building a master-level
heads-up no-limit Texas hold'em poker AI that defeats two prior top agents
using only a 4-core CPU and 16 GB of memory. Developing such a powerful agent
would have previously required a supercomputer.
Training Deep Neural Networks with 8-bit Floating Point Numbers

The state-of-the-art hardware platforms for training deep neural networks are
moving from traditional single precision (32-bit) computations towards 16 bits
of precision - in large part due to the high energy efficiency and smaller bit
storage associated with using reduced-precision representations. However,
unlike inference, training with numbers represented with less than 16 bits has
been challenging due to the need to maintain fidelity of the gradient
computations during back-propagation. Here we demonstrate, for the first time,
the successful training of deep neural networks using 8-bit floating point
numbers while fully maintaining the accuracy on a spectrum of deep learning
models and datasets. In addition to reducing the data and computation
precision to 8 bits, we also successfully reduce the arithmetic precision for
additions (used in partial product accumulation and weight updates) from 32
bits to 16 bits through the introduction of a number of key ideas including
chunk-based accumulation and floating point stochastic rounding. The use of
these novel techniques lays the foundation for a new generation of hardware
training platforms with the potential for 2-4 times improved throughput over
today's systems.
Scalar Posterior Sampling with Applications

We propose a practical non-episodic PSRL algorithm that unlike recent state-
of-the-art PSRL algorithms uses a deterministic, model-independent episode
switching schedule. Our algorithm termed deterministic schedule PSRL (DS-PSRL)
is efficient in terms of time, sample, and space complexity. We prove a
Bayesian regret bound under mild assumptions. Our result is more generally
applicable to multiple parameters and continuous state action problems. We
compare our algorithm with state-of-the-art PSRL algorithms on standard
discrete and continuous problems from the literature. Finally, we show how the
assumptions of our algorithm satisfy a sensible parameterization for a large
class of problems in sequential recommendations.
Understanding Batch Normalization

Batch normalization is a ubiquitous deep learning technique that normalizes
activations in intermediate layers. It is associated with improved accuracy
and faster learning, but despite its enormous success there is little
consensus regarding why it works. We aim to rectify this and take an empirical
approach to understanding batch normalization. Our primary observation is that
the higher learning rates that batch normalization enables have a regularizing
effect that dramatically improves generalization of normalized networks, which
is both demonstrated empirically and motivated theoretically. We show how both
activations and gradient information becomes less input dependent across
spatial dimensions and examples within a mini-batch for deep unnormalized
networks, and show how this limit possible learning rates. Motivated by recent
results in random matrix theory, we argue that this ill-conditioning is due to
fluctuations in random initialization, shedding new light on classical
initialization schemes and their consequences.
Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples

Adversarial sample attacks perturb benign inputs to induce DNN misbehaviors.
Recent research has demonstrated the widespread presence and the devastating
consequences of such attacks. Existing defense techniques either assume prior
knowledge of specific attacks or may not work well on complex models due to
their underlying assumptions. We argue that adversarial sample attacks are
deeply entangled with interpretability of DNN models: while classification
results on benign inputs can be reasoned based on the human perceptible
features/attributes, results on adversarial samples can hardly be explained.
Therefore, we propose a novel adversarial sample detection technique for face
recognition models, based on interpretability. It features a novel bi-
directional causality inference between attributes and internal neurons to
identify neurons critical for individual attributes. The activation values of
critical neurons are enhanced to amplify the reasoning part of the computation
and the values of other neurons are weakened to suppress the gut-feeling part.
The classification results after such transformation are compared with those
of the original model to detect adversaries. Results show that our technique
can achieve 94% detection accuracy for 7 different kinds of attacks with 9.91%
false positives on benign inputs. In contrast, a state-of-the-art feature
squeezing technique can only achieve 55% accuracy with 23.3% false positives.
On Neuronal Capacity

We define the capacity of a learning machine to be the logarithm of the number
(or volume) of the functions it can implement. We review known results, and
derive new results, estimating the capacity of several neuronal models: linear
and polynomial threshold gates, linear and polynomial threshold gates with
constrained weights (binary weights, positive weights), and ReLU neurons. We
also derive capacity estimates and bounds for fully recurrent networks and
layered feedforward networks.
Breaking the Activation Function Bottleneck through Adaptive Parameterization

Standard neural network architectures are non-linear only by virtue of a
simple element-wise activation function, making them both brittle and
excessively large. In this paper, we consider methods for making the feed-
forward layer more flexible while preserving its basic structure. We develop
simple drop-in replacements that learn to adapt their parameterization
conditional on the input, thereby increasing statistical efficiency
significantly. We present an adaptive LSTM that advances the state of the art
for the Penn Treebank and Wikitext-2 word-modeling tasks while using fewer
parameters and converging in half as many iterations.
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

There is a natural correlation between the visual and auditive elements of a
video. In this work we leverage this connection to learn audio and video
features from self-supervised temporal synchronization. We demonstrate that a
calibrated curriculum learning scheme, a careful choice of negative examples,
and the use of a contrastive loss are critical ingredients to obtain powerful
multi-sensory representations from models optimized to discern temporal
synchronization of audio-video pairs. Without further finetuning, the
resulting audio features achieve state-of-the-art performance on established
audio classification benchmarks (DCASE2014 and ESC-50), while our visual
stream provides a very effective initialization to significantly improve the
performance of video-based action recognition models (our self-supervised
pretraining yields a remarkable gain in accuracy of 16.7% on UCF101 and 13.0%
on HMDB51, compared to learning from scratch).
Towards Robust Interpretability with Self-Explaining Neural Networks

Most recent work on interpretability of complex machine learning models has
focused on estimating a-posteriori explanations for previously trained models
around specific predictions. Self-explaining models where interpretability
plays a key role already during learning have received much less attention. We
propose three desiderata for explanations in general -- explicitness,
faithfulness, and stability -- and show that existing methods do not satisfy
them. In response, we design self-explaining models in stages, progressively
generalizing linear classifiers to complex yet architecturally explicit
models. Faithfulness and stability are enforced via regularization
specifically tailored to such models. Experimental results across various
benchmark datasets show that our framework offers a promising direction for
reconciling model complexity and interpretability.
Deep State Space Models for Time Series Forecasting

We present a novel approach to probabilistic time series forecasting that
combines state space models with deep learning. By parametrizing a per-time-
series linear state space model with a jointly-learned recurrent neural
network, our method retains desired properties of state space models such as
data efficiency and interpretability, while making use of the ability to learn
complex patterns from raw data offered by deep learning approaches. Our method
scales gracefully from regimes where little training data is available to
regimes where data from millions of time series can be leveraged to learn
accurate models. We provide qualitative as well as quantitative results with
the proposed method, showing that it compares favorably to the state-of-the-
art.
Constrained Graph Variational Autoencoders for Molecule Design

Graphs are ubiquitous data structures for representing interactions between
entities. With an emphasis on the use of graphs to represent chemical
molecules, we explore the task of learning to generate graphs that conform to
a distribution observed in training data. We propose a variational autoencoder
model in which both encoder and decoder are graph-structured. Our decoder
assumes a sequential ordering of graph extension steps and we discuss and
analyze design choices that mitigate the potential downsides of this
linearization. Experiments compare our approach with a wide range of baselines
on the molecule generation task and show that our method is more successful at
matching the statistics of the original dataset on semantically important
metrics. Furthermore, we show that by using appropriate shaping of the latent
space, our model allows us to design molecules that are (locally) optimal in
desired properties.
Learning Libraries of Subroutines for Neurally–Guided Bayesian Program Learning

Successful approaches to program induction require a hand-engineered domain-
specific language (DSL), constraining the space of allowed programs and
imparting prior knowledge of the domain. We contribute a program induction
algorithm that learns a DSL while jointly training a neural network to
efficiently search for programs in the learned DSL. We use our model to
synthesize functions on lists, edit text, and solve symbolic regression
problems, showing how the model learns a domain-specific library of program
components for expressing solutions to problems in the domain.
Neural Architecture Optimization

Automatic neural architecture design has shown its potential in discovering
powerful neural network architectures. Existing methods, no matter based on
reinforcement learning or evolutionary algorithms (EA), conduct architecture
search in a discrete space, which is highly inefficient. In this paper, we
propose a simple and efficient method to automatic neural architecture design
based on continuous optimization. We call this new approach neural
architecture optimization (NAO). There are three key components in our
proposed approach: (1) An encoder embeds/maps neural network architectures
into a continuous space. (2) A predictor takes the continuous representation
of a network as input and predicts its accuracy. (3) A decoder maps a
continuous representation of a network back to its architecture. The
performance predictor and the encoder enable us to perform optimization in the
continuous space to find the embedding of a new architecture with potentially
better accuracy. Such a better embedding is then decoded to a network by the
decoder. Experiments show that the architecture discovered by our method is
very competitive for image classification task on CIFAR-10 and language
modeling task on PTB, outperforming or on par with the best results of
previous architecture search methods. Furthermore, the computational resource
is 10 times fewer than typical methods based on RL and EA.
Preference Based Adaptation for Learning Objectives

In many real-world learning tasks, it is hard to directly optimize the true
performance measures, meanwhile choosing the right surrogate objectives is
also difficult. Under this situation, it is desirable to incorporate an
optimization of objective process into the learning loop based on weak
modeling of the relationship between the true measure and the objective. In
this work, we discuss the task of objective adaptation, in which the learner
iteratively adapts the learning objective to the underlying true objective
based on the preference feedback from an oracle. We show that when the
objective can be linearly parameterized, this preference based learning
problem can be solved by utilizing the dueling bandit model. A novel sampling
based algorithm DL^2M is proposed to learn the optimal parameter, which enjoys
strong theoretical guarantees and efficient empirical performance. To avoid
learning a hypothesis from scratch after each objective function update, a
boosting based hypothesis adaptation approach is proposed to efficiently adapt
any pre-learned element hypothesis to the current objective. We apply the
overall approach to multi-label learning, and show that the proposed approach
achieves significant performance under various multi-label performance
measures.
Distributed $k$-Clustering for Data with Heavy Noise

In this paper, we consider the $k$-center/median/means clustering with
outliers problems (or the $(k, z)$-center/median/means problems) in the
distributed setting. Most previous distributed algorithms have their
communication costs linearly depending on $z$, the number of outliers.
Recently Guha et al. overcame this dependence issue by considering bi-criteria
approximation algorithms that output solutions with $2z$ outliers. For the
case where $z$ is large, the extra $z$ outliers discarded by the algorithms
might be too large, considering that the data gathering process might be
costly. In this paper, we improve the number of outliers to the best possible
$(1+\epsilon)z$, while maintaining the $O(1)$-approximation ratio and
independence of communication cost on $z$. The problems we consider include
the $(k, z)$-center problem, and $(k, z)$-median/means problems in Euclidean
metrics. Implementation of the our algorithm for $(k, z)$-center shows that it
outperforms many previous algorithms, both in terms of the communication cost
and quality of the output solution.
Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo

A key task in Bayesian machine learning is sampling from distributions that
are only specified up to a partition function (i.e., constant of
proportionality). One prevalent example of this is sampling posteriors in
parametric distributions, such as latent-variable generative models. However
sampling (even very approximately) can be #P-hard. Classical results (going
back to Bakry and Emery) on sampling focus on log-concave distributions, and
show a natural Markov chain called Langevin diffusion mix in polynomial time.
However, all log-concave distributions are uni-modal, while in practice it is
very common for the distribution of interest to have multiple modes. In this
case, Langevin diffusion suffers from torpid mixing. We address this problem
by combining Langevin diffusion with simulated tempering. The result is a
Markov chain that mixes more rapidly by transitioning between different
temperatures of the distribution. We analyze this Markov chain for a mixture
of (strongly) log-concave distributions of the same shape. In particular, our
technique applies to the canonical multi-modal distribution: a mixture of
gaussians (of equal variance). Our algorithm efficiently samples from these
distributions given only access to the gradient of the log-pdf. To the best of
our knowledge, this is the first result that proves fast mixing for multimodal
distributions.
A General Method for Amortizing Variational Filtering

We introduce a general-purpose, theoretically-grounded, and simple method for
performing filtering variational inference in dynamical latent variable
models, which we refer to as the variational filtering EM algorithm. The
algorithm is derived from the variational objective in the filtering setting
and naturally consists of a Bayesian prediction-update loop, with updates
performed using any desired optimization method. A computationally efficient
implementation of the algorithm is provided, using iterative amortized
inference models to perform inference optimization. Through empirical
evaluations with several deep dynamical latent variable models on a variety of
sequence data sets, we demonstrate that this simple filtering scheme compares
favorably against previously proposed filtering methods in terms of inference
performance, thereby improving model quality.
A Reduction for Efficient LDA Topic Reconstruction

We present a novel approach for LDA (Latent Dirichlet Allocation) topic
reconstruction. The main technical idea is to show that the distribution over
the documents generated by LDA can be transformed into a distribution for a
much simpler generative model in which documents are generated from {\em the
same set of topics} but have a much simpler structure: documents are single
topic and topics are chosen uniformly at random. Furthermore, this reduction
is approximation preserving, in the sense that approximate distributions-- the
only ones we can hope to compute in practice-- are mapped into approximate
distribution in the simplified world. This opens up the possibility of
efficiently reconstructing LDA topics in a roundabout way. Compute an
approximate document distribution from the given corpus, transform it into an
approximate distribution for the single-topic world, and run a reconstruction
algorithm in the uniform, single topic world-- a much simpler task than direct
LDA reconstruction. Indeed, we show the viability of the approach by giving
very simple algorithms for a generalization of two notable cases that have
been studied in the literature, $p$-separability and Gibbs sampling for
matrix-like topics.
Cluster Variational Approximations for Structure Learning of Continuous-Time Bayesian Networks from Incomplete Data

Continuous-time Bayesian networks (CTBNs) constitute a general and powerful
framework for modeling continuous-time stochastic processes on networks. This
makes them particularly attractive for learning the directed structures among
interacting entities. However, if the available data is incomplete, one needs
to simulate the prohibitively complex CTBN dynamics. Existing approximation
techniques, such as sampling and low-order variational methods, either scale
unfavorably in system size, or are unsatisfactory in terms of accuracy.
Inspired by recent advances in statistical physics, we present a new
approximation scheme based on cluster-variational methods significantly
improving upon existing variational approximations. We can then analytically
marginalize CTBN parameters, as these are of secondary importance for
structure learning. This recovers a scalable scheme for direct structure
learning from incomplete and noisy time-series data. Our approach outperforms
existing methods in terms of scalability.
RenderNet: A deep convolutional network for differentiable rendering from 3D shapes

Traditional computer graphics rendering pipeline is designed for procedurally
generating 2D quality images from 3D shapes with high performance. The non-
differentiability due to discrete operations such as visibility computation
makes it hard to explicitly correlate rendering parameters and the resulting
image, posing a significant challenge for inverse rendering tasks. Recent work
on differentiable rendering achieves differentiability either by designing
surrogate gradients for non-differentiable operations or via an approximate
but differentiable renderer. These methods, however, are still limited when it
comes to handling occlusion and restricted to particular rendering effects. We
present RenderNet, a differentiable rendering convolutional network with a
novel projection unit that can render 2D images from 3D shapes. Spatial
occlusion and shading calculation are automatically encoded in the network.
Our experiments show that RenderNet can successfully learn to implement
different shaders, and can be used in inverse rendering tasks to estimate
shape, pose, lighting and texture from a single image.
Robust Hypothesis Testing Using Wasserstein Uncertainty Sets

We develop a novel computationally efficient and general framework for robust
hypothesis testing. The new framework features a new way to construct
uncertainty sets under the null and the alternative distributions, which are
sets centered around the empirical distribution defined via Wasserstein
metric, thus our approach is data-driven and free of distributional
assumptions. We develop a convex safe approximation of the minimax formulation
and show that such approximation renders a nearly-optimal detector among the
family of all possible tests. By exploiting the structure of the least
favorable distribution, we also develop a tractable reformulation of such
approximation, with complexity independent of the dimension of observation
space and can be nearly sample-size-independent in general. Real-data example
using human activity data demonstrated the excellent performance of the new
robust detector.
Robust Detection of Adversarial Attacks by Modeling the Intrinsic Properties of Deep Neural Networks

It has been shown that deep neural network (DNN) based classifiers are
vulnerable to human-imperceptive adversarial perturbations which can cause DNN
classifiers to output wrong predictions with high confidence. We propose an
unsupervised learning approach to detect adversarial inputs without any
knowledge of attackers. Our approach tries to capture the intrinsic properties
of a DNN classifier and uses them to detect adversarial inputs. The intrinsic
properties used in this study are the output distributions of the hidden
neurons in a DNN classifier presented with natural images. Our approach can be
easily applied to any DNN classifiers or combined with other defense strategy
to improve robustness. Experimental results show that our approach
demonstrates state-of-the-art robustness in defending black-box and gray-box
attacks.
Learning to Repair Software Vulnerabilities with Generative Adversarial Networks

Motivated by the problem of automated repair of software vulnerabilities, we
propose an adversarial learning approach that maps from one discrete source
domain to another target domain without requiring paired labeled examples or
source and target domains to be bijections. We demonstrate that the proposed
adversarial learning approach is an effective technique for repairing software
vulnerabilities, performing close to seq2seq approaches that require labeled
pairs. The proposed Generative Adversarial Network approach is application-
agnostic in that it can be applied to other problems similar to code repair,
such as grammar correction or sentiment translation.
Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation

Neural Machine Translation (NMT) has achieved remarkable progress with the
quick evolvement of model structures. In this paper, we propose the concept of
layer-wise coordination for NMT, which explicitly coordinates the learning of
hidden representations of the encoder and decoder together layer by layer,
gradually from low level to high level. Furthermore, we share the parameters
of each layer between the encoder and decoder to regularize and coordinate the
learning. Experiments show that combined with the state-of-the-art Transformer
model, layer-wise coordination achieves improvements on three IWSLT and two
WMT translation tasks. More specifically, our method achieves 34.43 and 29.01
BLEU score on WMT16 English-Romanian and WMT14 English-German tasks,
outperforming the Transformer baseline.
Dirichlet belief networks as structured topic prior

Recently, on the success of deep learning, considerable research effort has
been devoted to developing deep architectures for topic models. Although
several deep models have been proposed to learn better topic proportions of
documents, how to leverage the benefits of deep structures for learning word
distributions of topics has not yet been rigorously studied. Here we propose a
new multi-layer generative process on word distributions of topics, where each
layer consists of a set of topics and each topic is drawn from a mixture of
the topics in the layer above. As the topics in all the layers can be directly
interpreted by words, the proposed model is able to discover interpretable
topic hierarchies. As an self-contained module, our model can be flexibly
adapted to different kinds of topic models to improve their modelling accuracy
and interpretability. Extensive experiments on real corpora demonstrated the
advantages of the proposed model.
Stochastic Expectation Maximization with Variance Reduction

Expectation-Maximization (EM) is a popular tool for learning latent variable
models, but the vanilla batch EM does not scale to large data sets because the
whole data set is needed at every E-step. Stochastic Expectation Maximization
(sEM) reduces the cost of E-step by stochastic approximation. However, sEM has
a slower asymptotic convergence rate than batch EM, and requires a decreasing
sequence of step sizes, which is difficult to tune. In this paper, we propose
a variance reduced stochastic EM (sEM-vr) algorithm inspired by variance
reduced stochastic gradient descent algorithms. We show that sEM-vr has the
same exponential asymptotic convergence rate as batch EM. Moreover, sEM-vr
only requires a constant step size to achieve this rate, which alleviates the
burden of parameter tuning. We compare sEM-vr with batch EM, sEM and other
algorithms on Gaussian mixture models and probabilistic latent semantic
analysis, and sEM-vr converges significantly faster than these baselines.
Submodular Maximization via Gradient Ascent: The Case of Deep Submodular   Functions

We study the problem of maximizing deep submodular functions (DSFs) subject to
a matroid constraint. DSFs are an expressive class of submodular functions
that include, as strict subfamilies, the facility location, weighted coverage,
and sums of concave composed with modular functions. We use a strategy similar
to the continuous greedy approach~\cite{calinescu2007maximizing}, but we show
that the multilinear extension of any DSF has a natural and computationally
attainable concave relaxation that we can optimize using gradient ascent. Our
results show a guarantee of \max_{0
The challenge of realistic music generation: modelling raw audio at scale

Realistic music generation is a challenging task. When building generative
models of music that are learnt from data, typically high-level
representations such as scores or MIDI are used that abstract away the
idiosyncrasies of a particular performance. But these nuances are very
important for our perception of musicality and realism, so in this work we
embark on modelling music in the raw audio domain. It has been shown that
autoregressive models excel at generating raw audio waveforms of speech, but
when applied to music, we find them biased towards capturing local signal
structure at the expense of modelling long-range correlations. This is
problematic because music exhibits structure at many different timescales. In
this work, we explore autoregressive discrete autoencoders (ADAs) as a means
to enable autoregressive models to capture long-range correlations in
waveforms. We find that they allow us to unconditionally generate piano music
directly in the raw audio domain, which shows stylistic consistency across
tens of seconds.
Spectral Signatures in Backdoor Attacks on Deep Nets

A recent line of work has uncovered a new form of data poisoning: so-called
backdoor attacks. These attacks are particularly dangerous because they do not
affect a network's behavior on typical, benign data. Rather, the network only
deviates from its expected output when triggered by an adversary's planted
perturbation. In this paper, we identify a new property of all known backdoor
attacks, which we call spectral signatures. This property allows us to utilize
tools from robust statistics to thwart the attacks. We demonstrate the
efficacy of these signatures in detecting and removing poisoned examples on
real image sets and state of the art neural network architectures. We believe
that understanding spectral signatures is a crucial first step towards a
principled understanding of backdoor attacks.
Reward learning from human preferences and demonstrations in Atari

To solve complex real-world problems with reinforcement learning, we cannot
rely on manually specified reward functions. Instead, we need humans to
communicate an objective to the agent directly. In this work, we combine two
approaches to this problem: learning from expert demonstrations and learning
from trajectory preferences. We use both to train a deep neural network to
model the reward function and use its predicted reward to train an DQN-based
deep reinforcement learning agent on 9 Atari games. Our approach beats the
imitation learning baseline in 7 games and achieves strictly superhuman
performance on 2 games. Additionally, we investigate the fit of the reward
model, present some reward hacking problems, and study the effects of noise in
the human labels.
Approximate Knowledge Compilation by Online Collapsed Importance Sampling

We introduce collapsed compilation, a novel approximate inference algorithm
for discrete probabilistic graphical models. It is a collapsed sampling
algorithm that incrementally selects which variable to sample next based on
the partial sample obtained so far. This online collapsing, together with
knowledge compilation inference on the remaining variables, naturally exploits
local structure and context-specific independence in the distribution. These
properties are naturally exploited in exact inference, but are difficult to
harness for approximate inference. Moreover, by having a partially compiled
circuit available during sampling, collapsed compilation has access to a
highly-effective proposal distribution for importance sampling. Our
experimental evaluation shows that collapsed compilation performs well on
standard benchmarks. In particular, when the amount of exact inference is
equally limited, collapsed compilation is competitive with the state of the
art.
Neural Arithmetic Logic Units

Neural networks can learn to represent and manipulate numerical information,
but they seldom generalize well outside of the range of numerical values
encountered during training. To encourage more systematic numerical
extrapolation, we propose an architecture that represents numerical quantities
as linear activations which are manipulated using primitive arithmetic
operators, controlled by learned gates. We call this module a neural
arithmetic logic unit (NALU), by analogy to the arithmetic logic unit in
traditional processors. Experiments show that NALU-enhanced neural networks
can learn to track time, perform arithmetic over images of numbers, translate
numerical language into real-valued scalars, execute computer code, and count
objects in images. In contrast to conventional architectures, we obtain
substantially better generalization both inside and outside of the range of
numerical values encountered during training, often extrapolating orders of
magnitude beyond trained numerical ranges.
Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training

Distributed training of deep nets is an important technique to address some of
the present day computing challenges like memory consumption and computational
demands. Classical distributed approaches, synchronous or asynchronous, are
based on the parameter server architecture, i.e., worker nodes compute
gradients which are communicated to the parameter server while updated
parameters are returned. Recently, distributed training with AllReduce
operations gained popularity as well. While many of those operations seem
appealing, little is reported about wall-clock training time improvements. In
this paper, we carefully analyze the AllReduce based setup, propose timing
models which include communication, latency and compute time, and demonstrate
that a pipelined training with a width of two combines the best of both
synchronous and asynchronous training. Specifically, for a setup consisting of
a four-node GPU cluster we show wall-clock time training improvements of up to
5.4x compared to conventional approaches.
Improved Expressivity Through Dendritic Neural Networks

A typical biological neuron, such as a pyramidal neuron of neocortex receives
thousands of afferent synaptic inputs on its dendrite tree and sends the
efferent axonal output downstream. In typical artificial neural networks,
dendrite trees are modeled as linear structures that funnel weighted synaptic
inputs to the cell bodies. However, numerous experimental and theoretical
studies have shown that dendritic arbors are far more than such simple linear
accumulators. That is, synaptic inputs can actively modulate their neighboring
synaptic activities, therefore the dendritic structures are highly nonlinear.
In this study, we model such local nonlinearity of dendritic trees with our
Dendritic Neural Network (DENN) structure and apply this to typical machine
learning tasks. Equipped with localized non-linearities, DENNs can attain
greater model expressivity than regular neural networks while maintain
efficient network inference. Such strength is evidenced by higher fitting
power when we train DENNs with supervised machine learning tasks. We also
empirically show that the locality structure can give DENNs a boost to their
generalization ability, exemplified by outranking naive deep neural network
architectures when tested on 121 classification tasks from the UCI machine
learning repository.
How to Start Training: The Effect of Initialization and Architecture

We identify and study two common failure modes for early training in deep ReLU
nets. For each, we give a rigorous proof of when it occurs and how to avoid
it, for fully connected, convolutional, and residual architectures. We show
that the first failure mode, exploding or vanishing mean activation length,
can be avoided by initializing weights from a symmetric distribution with
variance 2/fan-in and, for ResNets, by correctly scaling the residual modules.
We prove that the second failure mode, exponentially large variance of
activation length, never occurs in residual nets once the first failure mode
is avoided. In contrast, for fully connected nets, we prove that this failure
mode can happen and is avoided by keeping constant the sum of the reciprocals
of layer widths. We demonstrate empirically the effectiveness of our
theoretical results in predicting when networks are able to start training. In
particular, we note that many popular initializations fail our criteria,
whereas correct initialization and architecture allows much deeper networks to
be trained.
Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

We give a rigorous analysis of the statistical behavior of gradients in a
randomly initialized fully connected network N with ReLU activations. Our
results show that the empirical variance of the squares of the entries in the
input-output Jacobian of N is exponential in the sum of the reciprocals of the
hidden layer widths. Our approach complements the mean field theory analysis
of random neural nets. From this point of view, we rigorously compute the
finite width corrections to gradients at the edge of chaos.
Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives

In this paper we propose a novel method that provides contrastive explanations
justifying the classification of an input by a black box classifier such as a
deep neural network. Given an input we find what should be minimally and
sufficiently present (viz. important object pixels in an image) to justify its
classification and analogously what should be minimally and necessarily
\emph{absent} (viz. certain background pixels). We argue that such
explanations are natural for humans and are used commonly in domains such as
health care and criminology. What is minimally but critically \emph{absent} is
an important part of an explanation, which to the best of our knowledge, has
not been explicitly identified by current explanation methods that explain
predictions of neural networks. We validate our approach on three real
datasets obtained from diverse domains; namely, a handwritten digits dataset
MNIST, a large procurement fraud dataset and a brain activity strength
dataset. In all three cases, we witness the power of our approach in
generating precise explanations that are also easy for human experts to
understand and evaluate.
HitNet: Hybrid Ternary Recurrent Neural Network

Quantization is a promising technique to reduce the model size, memory
footprint, and massive computation operations of recurrent neural networks
(RNNs) for embedded devices with limited resources. Although extreme low-bit
quantization has achieved impressive success on convolutional neural networks,
it still suffers from huge accuracy degradation on RNNs with the same low-bit
precision. In this paper, we first investigate the accuracy degradation on RNN
models under different quantization schemes, and the distribution of tensor
values in the full precision model. Our observation reveals that due to the
difference between the distributions of weights and activations, different
quantization methods are suitable for different parts of models. Based on our
observation, we propose HitNet, a hybrid ternary recurrent neural network,
which bridges the accuracy gap between the full precision model and the
quantized model. In HitNet, we develop a hybrid quantization method to
quantize weights and activations. Moreover, we introduce a sloping factor
motivated by prior work on Boltzmann machine to activation functions, further
closing the accuracy gap between the full precision model and the quantized
model. Overall, our HitNet can quantize RNN models into ternary values, {-1,
0, 1}, outperforming the state-of-the-art quantization methods on RNN models
significantly. We test it on typical RNN models, such as Long-Short-Term
Memory (LSTM) and Gated Recurrent Units (GRU), on which the results outperform
previous work significantly. For example, we improve the perplexity per word
(PPW) of a ternary LSTM on Penn Tree Bank (PTB) corpus from 126 (the state-of-
the-art result to the best of our knowledge) to 110.3 with a full precision
model in 97.2, and a ternary GRU from 142 to 113.5 with a full precision model
in 102.7.
A Unified Framework for Extensive-Form Game Abstraction with Bounds

Abstraction has long been a key component in the practical solving of large-
scale extensive-form games. Despite this, abstraction remains poorly
understood. There have been some recent theoretical results but they have been
confined to specific assumptions on abstraction structure and are specific to
various specific disjoint types of abstraction, and specific solution
concepts, for example, exact Nash equilibria or strategies with bounded
immediate regret. In this paper we present a unified framework for analyzing
abstractions that can express all types of abstractions and solution concepts
used in prior papers with performance guarantees---while maintaining
comparable bounds on abstraction quality. Moreover, our framework extends well
beyond prior work. We present the first exact decomposition of abstraction
error for a broad class of abstractions that encompasses abstractions used in
practice. Because it is significantly more general, this decomposition has a
stronger dependence on the specific strategy computed in the abstraction. We
show that this dependence can be removed by making similar, though slightly
weaker, assumptions than in prior work. We also show, via counterexample, that
such assumptions are necessary for some games. Finally, we prove the first
bounds for how $\epsilon$-Nash equilibria computed in abstractions perform in
the original game. This is important because often one cannot afford to
compute an exact Nash equilibrium in the abstraction. All our results apply to
general-sum n-player games.
Removing the Feature Correlation Effect of Multiplicative Noise

Multiplicative noise, including dropout, is widely used to regularize deep
neural networks (DNNs), and is shown to be effective in a wide range of
architectures and tasks. From an information perspective, we consider
injecting multiplicative noise into a DNN as training the network to solve the
task with noisy information pathways, which leads to the observation that
multiplicative noise tends to increase the correlation between features, so as
to increase the signal-to-noise ratio of information pathways. However, high
feature correlation is undesirable, as it increases redundancy in
representations. In this work, we propose feature-decorrelating multiplicative
noise (FDMN), which exploits batch normalization to remove the correlation
effect in a simple yet effective way. We show that FDMN significantly improves
the performance of standard multiplicative noise on image classification
tasks, providing a better alternative to dropout for batch-normalized
networks. Additionally, we present a unified view of FDMN and shake-shake
regularization, which explains the performance gain of the latter.
Maximum-Entropy Fine Grained Classification

Fine-Grained Visual Classification (FGVC) is an important computer vision
problem that involves small diversity within the different classes, and often
requires expert annotators to collect data. Utilizing this notion of small
visual diversity, we revisit Maximum-Entropy learning in the context of fine-
grained classification, and provide a training routine that maximizes the
entropy of the output probability distribution for training convolutional
neural networks on FGVC tasks. We provide a theoretical as well as empirical
justification of our approach, and achieve state-of-the-art performance across
a variety of classification tasks in FGVC, that can potentially be extended to
any fine-tuning task. Our method is robust to different hyperparameter values,
amount of training data and amount of training label noise and can hence be a
valuable tool in many similar problems.
On Learning Markov Chains

Estimating an unknown discrete distribution from its samples is a fundamental
problem in statistical learning. Over the past decade, this problem has
attracted a significant amount of research effort and has been solved for
different divergence measures. Surprisingly, an equally important problem,
estimating an unknown Markov chain from its samples, is still far from
understood. We consider the problem of determining the minimax risk (expected
loss) of estimating an unknown $k$-state Markov chain from its $n$ sequential
sample points. We present two related but different formulations: (1)
predicting the conditional distribution of the next sample point with respect
to the KL-divergence. Under this formulation, we show that the minimax
prediction risk is both $\Omega(k\log\log n/n)$ and $\mathcal{O}(k^2\log\log
n/n)$. (2) estimating the transition matrix with respect to a natural loss
induced by some $f$-divergence measure. Under this formulation, if we allow
the transition probabilities to be arbitrarily small, then certain states may
not even be observable. Therefore, we consider the case when the transition
probabilities are bounded away from zero. We completely resolve the latter
problem for essentially all sufficiently smooth $f$-divergences, including but
not limited to $L_2$-, Chi-squared, KL-, Hellinger, and Alpha-divergences.
Additionally, we show that for the KL-divergence, if one allows the transition
probabilities to be as small as $1/n$, then the minimax risk would be roughly
$\log\log n$ times larger. The agreement between our theory and experimental
results is excellent.
A Neural Compositional Paradigm for Image Captioning

Mainstream captioning models often follow a sequential structure to generate
cap- tions, leading to issues such as introduction of irrelevant semantics,
lack of diversity in the generated captions, and inadequate generalization
performance. In this paper, we present an alternative paradigm for image
captioning, which factorizes the captioning procedure into two stages: (1)
extracting an explicit semantic representation from the given image; and (2)
constructing the caption based on a recursive compositional procedure in a
bottom-up manner. Compared to conventional ones, our paradigm better preserves
the semantic content through an explicit factorization of semantics and
syntax. By using the compositional generation procedure, caption construction
follows a recursive structure, which naturally fits the properties of human
language. Moreover, the proposed compositional procedure requires less data to
train, generalizes better, and yields more diverse captions.
Quantifying Learning Guarantees for Convex but Inconsistent Surrogates

We study the consistency properties of machine learning methods based on
minimizing convex surrogates. We extend the recent framework of Osokin et al.
(2017) for quantitative analysis of the consistency properties to the case of
inconsistent surrogates. Our key technical contribution consists in the new
lower bound on the calibration function for the quadratic surrogate, which is
non-trivial (not always zero) for inconsistent cases. The new bound allows to
quantify the level of inconsistency of the setting and shows how learning with
inconsistent surrogates can have guarantees on sample complexity and
optimization difficulty. We apply our theory in two concrete cases:
hierarchical classification with the tree-structured loss and ranking with the
mean average precision loss. The results show the approximation-computation
trade-offs caused by inconsistent surrogates and their potential benefits.
Dialog-based Interactive Image Retrieval

Inspired by the enormous growth of huge online media collections of many types
(e.g. images, audio, video, e-books, etc.), and the paucity of intelligent
retrieval systems, this paper introduces a novel approach to interactive
visual content retrieval. The proposed retrieval framework is guided by free-
form natural language feedback from users, allowing for more natural and
effective communication. Such a system constitutes a multi-modal dialog
protocol where in each dialog turn, a user submits a natural language request
to a retrieval agent, which then attempts to retrieve the optimal object. We
formulate the retrieval task as a reinforcement learning problem, and reward
the dialog system for improving the rank of the target object during each
dialog turn. This framework can be applied to a variety of visual media types
(images, videos, graphics, etc.), and in this paper, we study in-depth its
application on the task of interactive image retrieval. To avoid the
cumbersome and costly process of collecting human-machine conversations as the
dialog system learns, we train the dialog system with a user simulator, which
is itself trained to describe the differences between target and candidate
images. The efficacy of our approach is demonstrated in a footwear image
retrieval application. Extensive experiments on both simulated and real-world
data show that: 1) our proposed learning framework achieves better accuracy
than other supervised and reinforcement learning baselines; and 2) user
feedback based on natural language rather than pre-specified attributes leads
to more effective retrieval results, and a more natural and expressive
communication interface.
Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator

In this paper, we propose a new technique named \textit{Stochastic Path-
Integrated Differential EstimatoR} (SPIDER), which can be used to track many
deterministic quantities of interests with significantly reduced gradient
computation. Combining SPIDER with the method of normalized gradient descent,
we propose two new algorithms, namely SPIDER-SFO and SPIDER-SSO, that solve
non-convex stochastic optimization problems using stochastic gradients only.
We prove that both SPIDER-SFO and SPIDER-SSO algorithms achieve a record-
breaking $\tilde{\mathcal{O}}(\epsilon^{-3} )$ gradient computation cost to
find an $\epsilon$-approximate first-order and second-order stationary point,
respectively. In addition, we prove that SPIDER-SFO nearly matches the
algorithmic lower bound for finding stationary point under the gradient
Lipschitz assumption in the finite-sum setting.
Are GANs Created Equal? A Large-Scale Study

Generative adversarial networks (GAN) are a powerful subclass of generative
models. Despite a very rich research activity leading to numerous interesting
GAN algorithms, it is still very hard to assess which algorithm(s) perform
better than others. We conduct a neutral, multi-faceted large-scale empirical
study on state-of-the art models and evaluation measures. We find that most
models can reach similar scores with enough hyperparameter optimization and
random restarts. This suggests that improvements can arise from a higher
computational budget and tuning more than fundamental algorithmic changes. To
overcome some limitations of the current metrics, we also propose several data
sets on which precision and recall can be computed. Our experimental results
suggest that future GAN research should be based on more systematic and
objective evaluation procedures. Finally, we did not find evidence that any of
the tested algorithms consistently outperforms the non-saturating GAN
introduced in \cite{goodfellow2014generative}.
Learning Disentangled Joint Continuous and Discrete Representations

We present a framework for learning disentangled and interpretable jointly
continuous and discrete representations in an unsupervised manner. By
augmenting the continuous latent distribution of variational autoencoders with
a relaxed discrete distribution and controlling the amount of information
encoded in each latent unit, we show how continuous and categorical factors of
variation can be discovered automatically from data. Experiments show that the
framework disentangles continuous and discrete generative factors on various
datasets and outperforms current disentangling methods when a discrete
generative factor is prominent.
TADAM: Task dependent adaptive metric for improved few-shot learning

Few-shot learning has become essential for producing models that generalize
from few examples. In this work, we identify that metric scaling and metric
task conditioning are important to improve the performance of few-shot
algorithms. Our analysis reveals that simple metric scaling completely changes
the nature of few-shot algorithm parameter updates. Metric scaling provides
improvements up to 14% in accuracy for certain metrics on the mini-Imagenet
5-way 5-shot classification task. We further propose a simple and effective
way of conditioning a learner on the task sample set, resulting in learning a
task-dependent metric space. Moreover, we propose and empirically test a
practical end-to-end optimization procedure based on auxiliary task co-
training to learn a task-dependent metric space. The resulting few-shot
learning model based on the task-dependent scaled metric achieves state of the
art on mini-Imagenet. We confirm these results on another few-shot dataset
that we introduce in this paper based on CIFAR100.
Do Less, Get More: Streaming Submodular Maximization with Subsampling

In this paper, we develop the first one-pass streaming algorithm for
submodular maximization that does not evaluate the entire stream even once. By
carefully subsampling each element of data stream, our algorithm enjoys the
tightest approximation guarantees in various settings while having the
smallest memory footprint and requiring the lowest number of function
evaluations. More specifically, for a monotone submodular function and a
$p$-matchoid constraint, our randomized algorithm achieves a $4p$
approximation ratio (in expectation) with $O(k)$ memory and $O(km/p)$ queries
per element ($k$ is the size of the largest feasible solution and $m$ is the
number of matroids used to define the constraint). For the non-monotone case,
our approximation ratio increases only slightly to $4p+2-o(1)$. To the best or
our knowledge, our algorithm is the first that combines the benefits of
streaming and subsampling in a novel way in order to truly scale submodular
maximization to massive machine learning problems. To showcase its
practicality, we empirically evaluated the performance of our algorithm on a
video summarization application and observed that it outperforms the state-of-
the-art algorithm by up to fifty fold, while maintaining practically the same
utility. We also evaluated the scalability of our algorithm on a large dataset
of Uber pick up locations.
Sparse Covariance Modeling in High Dimensions with Gaussian Processes

This paper studies statistical relationships among elements of high-
dimensional observations varying across non-random covariates. We propose to
model the observation elements' changing covariances as sparse multivariate
stochastic processes. In particular, our novel covariance modeling method
reduces dimensionality by relating the observation vectors to a lower
dimensional subspace. To characterize the changing correlations, we jointly
model the latent factors and the factor loadings as collections of basis
functions that vary with the covariates as Gaussian processes. Automatic
relevance determination (ARD) encodes basis sparsity through their
coefficients to account for the inherent redundancy. Experiments conducted
across domains show superior performances to the state-of-the-art methods.
Image-to-image translation for cross-domain disentanglement

Deep image translation methods have recently shown excellent results,
outputting high-quality images covering multiple modes of the data
distribution. There has also been increased interest in disentangling the
internal representations learned by deep methods to further improve their
performance and achieve a finer control. In this paper, we bridge these two
objectives and introduce the concept of cross-domain disentanglement. We aim
to separate the internal representation into three parts. The shared part
contains information for both domains. The exclusive parts, on the other hand,
contain only factors of variation that are particular to each domain. We
achieve this through bidirectional image translation based on Generative
Adversarial Networks and cross-domain autoencoders, a novel network component.
The obtained model offers multiple advantages. We can output diverse samples
covering multiple modes of the distributions of both domains. We can perform
cross-domain retrieval without the need of labeled data. Finally, we can
perform domain-specific image transfer and interpolation. We compare our model
to the state-of-the-art in multi-modal image translation and achieve better
results.
Gradient Sparsification for Communication-Efficient Distributed Optimization

Modern large-scale machine learning applications require stochastic
optimization algorithms to be implemented on distributed computational
architectures. A key bottleneck is the communication overhead for exchanging
information such as stochastic gradients among different workers. In this
paper, to reduce the communication cost, we propose a convex optimization
formulation to minimize the coding length of stochastic gradients. The key
idea is to randomly drop out coordinates of the stochastic gradient vectors
and amplify the remaining coordinates appropriately to ensure the sparsified
gradient to be unbiased. To solve the optimal sparsification efficiently,
several simple and fast algorithms are proposed for an approximate solution,
with a theoretical guarantee for sparseness. Experiments on $\ell_2$
regularized logistic regression, support vector machines, and convolutional
neural networks validate our sparsification approaches.
Revisiting Multi-Task Learning with ROCK: a Deep Residual Auxiliary Block for Visual Detection

Multi-Task Learning (MTL) is appealing for deep learning regularization. In
this paper, we tackle a specific MTL context denoted as primary MTL, where the
ultimate goal is to improve the performance of a given primary task by
leveraging several other auxiliary tasks. Our main methodological contribution
is to introduce ROCK, a new generic multi-modal fusion block for deep learning
tailored to the primary MTL context. ROCK architecture is based on a residual
connection, which makes forward prediction explicitly impacted by the
intermediate auxiliary representations. The auxiliary predictor's architecture
is also specifically designed to our primary MTL context, by incorporating
intensive pooling operators for maximizing complementarity of intermediate
representations. Extensive experiments on NYUv2 dataset (object detection with
scene classification, depth prediction, and surface normal estimation as
auxiliary tasks) validate the relevance of the approach and its superiority to
flat MTL approaches. Our method outperforms state-of-the-art object detection
models on NYUv2 by a large margin, and is also able to handle large-scale
heterogeneous inputs (real and synthetic images) and missing annotation
modalities.
Adaptive Online Learning in Dynamic Environments

In this paper, we study online convex optimization in dynamic environments,
and aim to bound the dynamic regret with respect to any sequence of
comparators. Existing work have shown that online gradient descent enjoys an
$O(\sqrt{T}(1+P_T))$ dynamic regret, where $T$ is the number of iterations and
$P_T$ is the path-length of the comparator sequence. However, this result is
unsatisfactory, as there exists a large gap from the $\Omega(\sqrt{T(1+P_T)})$
lower bound established in our paper. To address this limitation, we develop a
novel online method, namely adaptive learning for dynamic environment (Ader),
which achieves an $O(\sqrt{T(\log \log T+P_T)})$ dynamic regret, matching the
lower bound up to a double logarithmic factor. The basic idea is to maintain a
set of experts, each attaining an optimal dynamic regret for a specific path-
length, and combines them with an expert-tracking algorithm. Furthermore, we
propose an improved Ader based on the surrogate loss, and in this way the
number of gradient evaluations per round is reduced from $O(\log T)$ to $1$.
Finally, we extend Ader to the setting that a sequence of dynamical models is
available to characterize the comparators.
Frequency-Agnostic Word Representation

Continuous word representation (aka word embedding) is a basic building block
in many neural network-based models used in natural language processing tasks.
Although it is widely accepted that words with similar semantics should be
close to each other in the embedding space, we find that word embeddings
learned in several tasks are biased towards word frequency: the embeddings of
high-frequency and low-frequency words lie in different subregions of the
embedding space, and the embedding of a rare word and a popular word can be
far from each other even if they are semantically similar. This makes learned
word embeddings ineffective, especially for rare words, and consequently
limits the performance of these neural network models. In order to mitigate
the issue, in this paper, we propose a neat, simple yet effective adversarial
training method to blur the boundary between the embeddings of high-frequency
words and low-frequency words. We conducted comprehensive studies on ten
datasets across four natural language processing tasks, including word
similarity, language modeling, machine translation and text classification.
Results show that we achieve higher performance than the baselines in all
tasks.
Generative Neural Machine Translation

We introduce Generative Neural Machine Translation (GNMT), a latent variable
architecture which is designed to model the semantics of the source and target
sentences. We modify an encoder-decoder translation model by adding a latent
variable as a language agnostic representation which is encouraged to learn
the meaning of the sentence. GNMT achieves competitive BLEU scores on pure
translation tasks, and is superior when there are missing words in the source
sentence. We augment the model to facilitate multilingual translation and
semi-supervised learning without adding parameters. This framework
significantly reduces overfitting when there is limited paired data available,
and is effective for translating between pairs of languages not seen during
training.
Found Graph Data and Planted Vertex Covers

A typical way in which network data is recorded is to measure all interactions
involving a specified set of core nodes, which produces a graph containing
this core together with a potentially larger set of fringe nodes that link to
the core. Interactions between nodes in the fringe, however, are not present
in the resulting graph data. For example, a phone service provider may only
record calls in which at least one of the participants is a customer; this can
include calls between a customer and a non-customer, but not between pairs of
non-customers. Knowledge of which nodes belong to the core is crucial for
interpreting the dataset, but this metadata is unavailable in many cases,
either because it has been lost due to difficulties in data provenance, or
because the network consists of ``found data'' obtained in settings such as
counter-surveillance. This leads to an algorithmic problem of recovering the
core set. Since the core is a vertex cover, we essentially have a planted
vertex cover problem, but with an arbitrary underlying graph. We develop a
framework for analyzing this planted vertex cover problem, based on the theory
of fixed-parameter tractability, together with algorithms for recovering the
core. Our algorithms are fast, simple to implement, and out-perform several
baselines based on core-periphery structure on various real-world datasets.
Joint Active Feature Acquisition and Classification with Variable-Size Set Encoding

We consider the problem of active feature acquisition where the goal is to
sequentially select the subset of features in order to achieve the maximum
prediction performance in the most cost-effective way at test time. In this
work, we formulate this active feature acquisition as a jointly learning
problem of training both the classifier (environment) and the RL agent that
decides either to stop and predict' orcollect a new feature' at test time, in
a cost-sensitive manner. We also introduce a novel encoding scheme to
represent acquired subsets of features by proposing an order-invariant set
encoding at the feature level, which also significantly reduces the search
space for our agent. We evaluate our model on a carefully designed synthetic
dataset for the active feature acquisition as well as several medical
datasets. Our framework shows meaningful feature acquisition process for
diagnosis that complies with human knowledge, and outperforms all baselines in
terms of prediction performance as well as feature acquisition cost.
Regularization Learning Networks

Despite their impressive performance, Deep Neural Networks (DNNs) typically
underperform Gradient Boosting Trees (GBTs) on many tabular-dataset learning
tasks. We propose that applying a different regularization coefficient to each
weight might boost the performance of DNNs by allowing them to make more use
of the more relevant inputs. However, this will lead to an intractable number
of hyperparameters. Here, we introduce Regularization Learning Networks
(RLNs), which overcome this challenge by introducing an efficient
hyperparameter tuning scheme that minimizes a new Counterfactual Loss. Our
results show that RLNs significantly improve DNNs on tabular datasets, and
achieve comparable results to GBTs, with the best performance achieved with an
ensemble that combines GBTs and RLNs. RLNs produce extremely sparse networks,
eliminating up to 99.8% of the network edges and 82% of the input features,
thus providing more interpretable models and reveal the importance that the
network assigns to different inputs. RLNs could efficiently learn a single
network in datasets that comprise both tabular and unstructured data, such as
in the setting of medical imaging accompanied by electronic health records.
Multitask Boosting for Survival Analysis with Competing Risks

The co-occurrence of multiple diseases among the general population is an
important problem as those patients have more risk of complications and
represent a large share of health care expenditure. Learning to predict time-
to-event probabilities for these patients is a challenging problem because the
risks of events are correlated (there are competing risks) with often only few
patients experiencing individual events of interest, and of those only a
fraction are actually observed in the data. We introduce in this paper a
survival model with the flexibility to leverage a common representation of
related events that is designed to correct for the strong imbalance in
observed outcomes. The procedure is sequential: outcome-specific survival
distributions form the components of nonparametric multivariate estimators
which we combine into an ensemble in such a way as to ensure accurate
predictions on all outcome types simultaneously. Our algorithm is general and
represents the first boosting-like method for time-to-event data with multiple
outcomes. We demonstrate the performance of our algorithm on synthetic and
real data.
Geometry Based Data Generation

We propose a new type of generative model of high-dimensional data that learns
a manifold geometry of the data, rather than density, and can generate points
evenly along this manifold. This is in contrast to existing generative models
that represent data density, and are strongly affected by noise and other
artifacts of data collection. We demonstrate how this approach corrects
sampling biases and artifacts, and improves several downstream data analysis
tasks such as clustering and classification. Finally, we demonstrate that this
approach is especially useful in biology where, despite the advent of single-
cell technologies, rare subpopulations and gene-interaction relationships are
affected by biased sampling. We show that SUGAR can generate hypothetical
populations and reveal intrinsic patterns and mutual-information relationships
between genes on a single cell dataset of hematopoiesis.
SLAYER: Spike Layer Error Reassignment in Time

Configuring deep Spiking Neural Networks (SNNs) is an exciting research avenue
for low power spike event based computation. However, the spike generation
function is non-differentiable and therefore not directly compatible with the
standard error backpropagation algorithm. In this paper, we introduce a new
general backpropagation mechanism for learning synaptic weights and axonal
delays which overcomes the problem of non-differentiability of the spike
function and uses a temporal credit assignment policy for backpropagating
error to preceding layers. We describe and release a GPU accelerated software
implementation of our method which allows training both fully connected and
convolutional neural network (CNN) architectures. Using our software, we
compare our method against existing SNN based learning approaches and standard
ANN to SNN conversion techniques and show that our method achieves state of
the art performance for an SNN on the MNIST, NMNIST, DVS Gesture, and TIDIGITS
datasets.
On Oracle-Efficient PAC RL with Rich Observations

We study the computational tractability of provably sample-efficient (PAC)
reinforcement learning in episodic environments with rich observations. We
present new sample-efficient algorithms for environments with deterministic
hidden state dynamics and stochastic rich observations. These methods operate
in an oracle model of computation—accessing policy and value function classes
exclusively through standard optimization primitives—and therefore represent
computationally efficient alternatives to prior algorithms that require
enumeration. In the more general stochastic transition setting, we prove that
the only known sample-efficient algorithm, OLIVE [1], cannot be implemented in
our oracle model. We also present several examples that illustrate fundamental
challenges of tractable PAC reinforcement learning in such general settings.
Gradient Descent for Spiking Neural Networks

Much of studies on neural computation are based on network models of static
neurons that produce analog output, despite the fact that information
processing in the brain is predominantly carried out by dynamic neurons that
produce discrete pulses called spikes. Research in spike-based computation has
been impeded by the lack of efficient supervised learning algorithm for
spiking networks. Here, we present a gradient descent method for optimizing
spiking network models by introducing a differentiable formulation of spiking
networks and deriving the exact gradient calculation. For demonstration, we
trained recurrent spiking networks on two dynamic tasks: one that requires
optimizing fast (≈ millisecond) spike-based interactions for efficient
encoding of information, and a delayed-memory XOR task over extended duration
(≈ second). The results show that our method indeed optimizes the spiking
network dynamics on the time scale of individual spikes as well as the
behavioral time scales. In conclusion, our result offers a general purpose
supervised learning algorithm for spiking neural networks, thus advancing
further investigations on spike-based computation.
Generalizing Tree Probability Estimation via Bayesian Networks

Probability estimation is one of the fundamental tasks in statistics and
machine learning. However, standard methods for probability estimation on
discrete objects do not handle object structure in a satisfactory manner. In
this paper, we derive a general Bayesian network formulation for probability
estimation on leaf-labeled trees that enables flexible approximations which
can generalize beyond observations. We show that efficient algorithms for
learning Bayesian networks can be easily extended to probability estimation on
this challenging structured space. Experiments on both synthetic and real data
show that our methods greatly outperform the current practice of using the
empirical distribution, as well as a previous effort for probability
estimation on trees.
Designing by Training: Acceleration Neural Network for Fast High-Dimensional Convolution

The high-dimensional convolution is widely used in various disciplines but has
a serious performance problem due to its high computational complexity. Over
the decades, people took a handmade approach to design fast algorithms for the
Gaussian convolution. Recently, requirements for various non-Gaussian
convolutions have emerged and are contentiously getting higher. However, the
handmade acceleration approach is no longer feasible for so many different
convolutions since it is a time-consuming and painstaking job. Instead, we
propose an Acceleration Network (AccNet) which turns the work of designing new
fast algorithms to training the AccNet. This is done by: 1, interpreting
splatting, blurring, slicing operations as convolutions; 2, turning these
convolutions to $g$CP layers to build AccNet. After training, the activation
function $g$ together with AccNet weights automatically define the new
splatting, blurring and slicing operations. Experiments demonstrate AccNet is
able to design acceleration algorithms for a ton of convolutions including
Gaussian/non-Gaussian convolutions and produce state-of-the-art results.
Understanding the Role of Adaptivity in Machine Teaching: The Case of Version Space Learners

In real-world applications of education, an effective teacher adaptively
chooses the next example to teach based on the learner's current state.
However, most existing work in algorithmic machine teaching focus on the batch
setting, where adaptivity plays no role. In this paper, we study the case of
teaching consistent, version space learners in an interactive setting. At any
time step, the teacher provides an example, the learner performs an update,
and the teacher observes the learner's new state. We highlight that adaptivity
does not speed up the teaching process when considering existing models of
version space learners, such as "worst-case" (the learner picks the next
hypothesis randomly from the version space) and "preference-based" (the
learner picks hypothesis according to some global preference). Inspired by
human teaching, we propose a new model where the learner picks hypotheses
according to some local preference defined by the current hypothesis. We show
that our model exhibits several desirable properties, e.g., adaptivity plays a
key role, and the learner’s transitions over hypotheses are
smooth/interpretable. We develop efficient teaching algorithms, and
demonstrate our results via simulation and user studies.
A loss framework for calibrated anomaly detection

Given samples from a probability distribution, anomaly detection is the
problem of determining if a given point lies in a low-density region. This
paper concerns calibrated anomaly detection, which is the practically relevant
extension where we additionally wish to produce a confidence score for a point
being anomalous. Building on a classification framework for anomaly detection,
we show how minimisation of a suitably modified proper loss produces density
estimates only for anomalous instances. We then show how to incorporate
quantile control by relating our objective to a generalised version of the
pinball loss. Finally, we show how to efficiently optimise the objective with
kernelised scorer, by leveraging a recent result from the point process
literature. The resulting objective captures a close relative of the one-class
SVM as a special case.
PacGAN: The power of two samples in generative adversarial networks

Generative adversarial networks (GANs) are a technique for learning generative
models of complex data distributions from samples. Despite remarkable advances
in generating realistic images, a major shortcoming of GANs is the fact that
they tend to produce samples with little diversity, even when trained on
diverse datasets. This phenomenon, known as mode collapse, has been the focus
of much recent work. We study a principled approach to handling mode collapse,
which we call packing. The main idea is to modify the discriminator to make
decisions based on multiple samples from the same class, either real or
artificially generated. We draw analysis tools from binary hypothesis testing
---in particular the seminal result of Blackwell---to prove a fundamental
connection between packing and mode collapse. We show that packing naturally
penalizes generators with mode collapse, thereby favoring generator
distributions with less mode collapse during the training process. Numerical
experiments on benchmark datasets suggest that packing provides significant
improvements.
Variational Memory Encoder-Decoder

Introducing variability while maintaining coherence is a core task in learning
to generate utterances in conversation. Standard neural encoder-decoder models
and their extensions using conditional variational autoencoder often result in
either trivial or digressive responses. To overcome this, we explore a novel
approach that injects variability into neural encoder-decoder via the use of
external memory as a mixture model, namely Variational Memory Encoder-Decoder
(VMED). By associating each memory read with a mode in the latent mixture
distribution at each timestep, our model can capture the variability observed
in sequential data such as natural conversations. We empirically compare the
proposed model against other recent approaches on various conversational
datasets. The results show that VMED consistently achieves significant
improvement over others in both metric-based and qualitative evaluations.
Stochastic Composite Mirror Descent: Optimal Bounds with High Probabilities

We study stochastic composite mirror descent, a class of scalable algorithms
able to exploit the geometry and composite structure of a problem. We consider
both convex and strongly convex objectives, for each of which we establish
high-probability convergence rates optimal up to a logarithmic factor. We
apply the derived computational error bounds to investigate generalization
bounds of stochastic gradient descent (SGD) in a non-parametric setting, which
refine the existing bounds in expectation by either removing smoothness
assumptions on loss functions or improving the specific learning rates. We
show that the involved estimation errors scale logarithmically with respect to
the number of iterations provided that the step size sequence is square-
summable, which justifies the ability of SGD to overcome overfitting. Our
analysis removes boundedness assumptions on subgradients often imposed in the
literature. Numerical results are reported to support our findings.
Hybrid Retrieval-Generation Reinforced Agent for Medical Image Report Generation

Generating long and topic-coherent reports to describe medical images poses
challenges to bridging visual patterns with informative human linguistic
descriptions. We propose a novel Hybrid Retrieval-Generation Reinforced Agent
(HRGR-Agent) which reconciles traditional retrieval-based approaches populated
with human prior knowledge, and modern learning-based approaches to achieve
structured, robust and diverse report generation. HRGR-Agent employs a
retrieval policy module to generate a sequence of actions to decide between
retrieving appropriate template sentences from an off-the-shelf template
database, and automatic generating sentences by a generation module. Thus
multiple sentences are sequentially generated via hierarchical decision-
making. Our HRGR-Agent is updated via reinforcement learning, guided by
sentence-level and word-level rewards. Experiments show that HRGR-Agent
achieves the state-of-the-art results on two medical report datasets,
generating a well balanced structured and complicated sentences with robust
coverage of heterogeneous medical report contents. In addition to automatic
evaluations, we demonstrate that our model achieves the highest detection
accuracy of medical terminologies, and best human evaluation performance.
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Modern Visual Question Answering (VQA) models have been shown to rely heavily
on superficial correlations between question and answer words learned during
training -- \eg overwhelmingly reporting the type of room as kitchen or the
sport being played as tennis, irrespective of the image. Most alarmingly, this
shortcoming is often not well reflected during evaluation because the same
strong priors exist in test distributions; however, a VQA system that fails to
ground questions in image content would likely perform poorly in real-world
settings. In this work, we present a novel regularization scheme for VQA that
reduces this effect. We introduce a question-only model that takes as input
the question encoding from the VQA model and must leverage language biases in
order to succeed. We then pose training as an adversarial game between the VQA
model and this question-only adversary -- discouraging the VQA model from
capturing language biases in its question encoding.Further, we leverage this
question-only model to estimate the mutual information between the image and
answer given the question, which we maximize explicitly to encourage visual
grounding. Our approach is a model agnostic training procedure and simple to
implement. We show empirically that it can improve performance significantly
on a bias-sensitive split of the VQA dataset for multiple base models --
achieving state-of-the-art on this task. Further, on standard VQA tasks, our
approach shows significantly less drop in accuracy compared to existing bias-
reducing VQA models.
Hybrid Knowledge Routed Modules for Large-scale Object Detection

The dominant object detection approaches treat the recognition of each region
separately and overlook crucial semantic correlations between objects in one
scene. This paradigm leads to substantial performance drop when facing heavy
long-tail problems, where very few samples are available for rare classes and
plenty of confusing categories exists. We exploit diverse human commonsense
knowledge for reasoning over large-scale object categories and reaching
semantic coherency within one image. Particularly, we present Hybrid Knowledge
Routed Modules (HKRM) that incorporates the reasoning routed by two kinds of
knowledge forms: an explicit knowledge module for structured constraints that
are summarized with linguistic knowledge (e.g. shared attributes,
relationships) about concepts; and an implicit knowledge module that depicts
some implicit constraints (e.g. common spatial layouts). By functioning over a
region-to-region graph, both modules can be individualized and adapted to
coordinate with visual patterns in each image, guided by specific knowledge
forms. HKRM are light-weight, general-purpose and extensible by easily
incorporating multiple knowledge to endow any detection networks the ability
of global semantic reasoning. Experiments on large-scale object detection
benchmarks show HKRM obtains around 34.5% improvement on VisualGenome (1000
categories) and 30.4% on ADE in terms of mAP.
Bilinear Attention Networks

Attention networks in multimodal learning provide an efficient way to utilize
given visual information selectively. However, the computational cost to learn
attention distributions for every pair of multimodal input channels is
prohibitively expensive. To solve this problem, co-attention builds two
separate attention distributions for each modality neglecting the interaction
between multimodal inputs. In this paper, we propose bilinear attention
networks (BAN) that find bilinear attention distributions to utilize given
vision-language information seamlessly. BAN considers bilinear interactions
among two groups of input channels, while low-rank bilinear pooling extracts
the joint representations for each pair of channels. Furthermore, we propose a
variant of multimodal residual networks to exploit eight-attention maps of the
BAN efficiently. We quantitatively and qualitatively evaluate our model on
visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing
that BAN significantly outperforms previous methods and achieves new state-of-
the-arts on both datasets.
Parsimonious Quantile Regression of Asymmetrically Heavy-tailed Financial Return Series

We propose a parsimonious quantile regression framework to learn the dynamic
tail behavior of financial asset returns. Our method captures well both the
time-varying characteristic and the asymmetrical heavy-tail property of
financial time series. It combines the merits of the popular sequential neural
network model, i.e. LSTM, with a novel parametric quantile function that we
construct to represent conditional returns of asset prices. Our method also
captures individually the serial dependence of higher moments, rather than
just the volatility. Across a wide range of asset classes, the out-of-sample
forecasts of our model outperform the GARCH family. Further, the approach does
not suffer from the issue of quantile crossing nor does it expose to the ill-
posedness comparing to the parametric probability density function approach.
Multi-Class Learning: From Theory to Algorithm

In this paper, we study the generalization performance of multi-class
classification and obtain a shaper data-dependent generalization error bound
with fast convergence rate, substantially improving the state-of-art bounds in
the existing data-dependent generalization analysis. The theoretical analysis
motivates us to devise two effective multi-class kernel learning algorithms
with statistical guarantee and fast convergence rate. Experimental results
show that our proposed methods can significantly outperform the existing
multi-class classification methods.
Multivariate Time Series Imputation with Generative Adversarial Networks

Multivariate time series are characterized by a variety of missing values and
the advanced analysis suffers. Conventional methods for imputing missing
values, such as mean and zero imputation, deletion and matrix factorization,
are not capable to model the temporal dependencies and the complex
distribution of the multivariate time series. In this paper, we regard missing
value imputation as a data generation problem. Inspired by the success of
Generative Adversarial Networks (GAN) in image generation, we adopt GAN to
learn the overall distribution of a multivariate time series dataset and to
generate the missing values for each sample. Different from image completion
where GAN is trained on complete images, we cannot obtain complete time series
due to the nature of data recording process. Therefore, a modified Gate
Recurrent Unit (GRU) model is employed in GAN to model the temporal
irregularity of the incomplete time series. Experiments on two multivariate
time series datasets show that our method outperforms the baselines in terms
of accuracy of missing value imputation. Additionally, a simple model on the
imputed data achieves state-of-the-art results on the two prediction tasks,
which demonstrates the benefit of our imputation method for downstream
applications.
Learning Versatile Filters for Efficient Convolutional Neural Networks

This paper introduces versatile filters to construct efficient convolutional
neural network. Considering the demands of efficient deep learning techniques
running on cost-effective hardware, a number of methods have been developed to
learn compact neural networks. Most of these works aim to slim down filters in
different ways, e.g., investigating small, sparse or binarized filters. In
contrast, we treat filters from an additive perspective. A series of secondary
filters can be derived from a primary filter. These secondary filters all
inherit in the primary filter without occupying more storage, but once been
unfolded in computation they could significantly enhance the capability of the
filter by integrating information extracted from different receptive fields.
Besides spatial versatile filters, we additionally investigate versatile
filters from the channel perspective. The new techniques are general to
upgrade filters in existing CNNs. Experimental results on benchmark datasets
and neural networks demonstrate that CNNs constructed with our versatile
filters are able to achieve comparable accuracy as that of original filters,
but require less memory and FLOPs.
Accelerated Stochastic Matrix Inversion:  General Theory and  Speeding up BFGS Rules for Faster Second-Order Optimization

We present the first accelerated randomized algorithm for solving linear
systems in Euclidean spaces. One essential problem of this type is the matrix
inversion problem. In particular, our algorithm can be specialized to invert
positive definite matrices in such a way that all iterates (approximate
solutions) generated by the algorithm are positive definite matrices
themselves. This opens the way for many applications in the field of
optimization and machine learning. As an application of our general theory, we
develop the first accelerated (deterministic and stochastic) quasi-Newton
updates. Our updates lead to provably more aggressive approximations of the
inverse Hessian, and lead to speed-ups over classical non-accelerated rules in
numerical experiments. Experiments with empirical risk minimization show that
our rules can accelerate training of machine learning models.
DifNet: Semantic Segmentation by Diffusion Networks

Deep Neural Networks (DNNs) have recently shown state of the art performance
on semantic segmentation tasks, however they still suffer from the problem of
poor boundary localization and spatial fragmented predictions. The
difficulties lie in the requirement of making dense predictions from a long
path model all at once, since details are hard to keep when data goes through
deeper layers. Instead, in this work, we decompose this difficult task into
two relative simple sub-tasks: seed detection which is required to predict
initial predictions without need of wholeness and preciseness, and similarity
estimation which estimates the possibility of any two nodes that belong to the
same class without need of knowing which class they are. We use one branch
network for one sub-task each, and apply a cascade of random walk operations
base on hierarchical semantics to approximate a complex diffusion process
which propagates seed information to the whole image according to the
estimated similarities. The proposed DifNet consistently produces an
improvement over the baseline models with same depth and equivalent number of
parameters, and also get promising performance on Pascal VOC 2012 and Pascal
Context dataset. Our DifNet is trained end-to-end without complex loss
functions.
Conditional Adversarial Domain Adaptation

Adversarial learning has been embedded into deep networks to learn
transferable representations for domain adaptation. Existing adversarial
domain adaptation methods may struggle to align different domains of multimode
distributions that are native in classification problems. In this paper, we
present conditional adversarial domain adaptation, a new framework that
conditions the adversarial adaptation models on discriminative information
conveyed in the classifier predictions. Conditional domain adversarial
networks are proposed to enable discriminative adversarial adaptation of
multimode domains. Experiments testify that the proposed approaches exceed the
state-of-the-art results on three domain adaptation datasets.
Relating Leverage Scores and Density using Regularized Christoffel Functions

Statistical leverage scores emerged as a fundamental tool for matrix sketching
and column sampling with applications to low rank approximation, regression,
random feature learning and quadrature. Yet, the very nature of this quantity
is barely understood. Borrowing ideas from the orthogonal polynomial
literature, we introduce the regularized Christoffel function associated to a
positive definite kernel. This uncovers a variational formulation for leverage
scores for kernel methods and allows to elucidate their relationships with the
chosen kernel as well as population density. Our main result quantitatively
describes a decreasing relation between leverage score and population density
for a broad class of kernels on Euclidean spaces. Numerical simulations
support our findings.
Non-Local Recurrent Network for Image Restoration

Many classic methods have shown non-local self-similarity in natural images to
be an effective prior for image restoration. However, it remains unclear and
challenging to make use of this intrinsic property via deep networks. In this
paper, we propose a non-local recurrent network (NLRN) as the first attempt to
incorporate non-local operations into a recurrent neural network (RNN) for
image restoration. The main contributions of this work are: (1) Unlike
existing methods that measure self-similarity in an isolated manner, the
proposed non-local module can be flexibly integrated into existing deep
networks for end-to-end training to capture deep feature correlation between
each location and its neighborhood. (2) We fully employ the RNN structure for
its parameter efficiency and allow deep feature correlation to be propagated
along adjacent recurrent states. This new design boosts robustness against
inaccurate correlation estimation due to severely degraded images. (3) We show
that it is essential to maintain a confined neighborhood for computing deep
feature correlation given degraded images. This is in contrast to existing
practice that deploys the whole image. Extensive experiments on both image
denoising and super-resolution tasks are conducted. Thanks to the recurrent
non-local operations and correlation propagation, the proposed NLRN achieves
superior results to state-of-the-art methods with many fewer parameters.
Bayesian Semi-supervised Learning with Graph Gaussian Processes

We propose a data-efficient Gaussian process-based Bayesian approach to the
semi-supervised learning problem on graphs. The proposed model shows extremely
competitive performance when compared to the state-of-the-art graph neural
networks on semi-supervised learning benchmark experiments, and outperforms
the neural networks in active learning experiments where labels are scarce.
Furthermore, the model does not require a validation data set for early
stopping to control over-fitting. Our model can be viewed as an instance of
empirical distribution regression weighted locally by network connectivity. We
further motivate the intuitive construction of the model with a Bayesian
linear model interpretation where the node features are filtered by an
operator related to the graph Laplacian. The method can be easily implemented
by adapting off-the-shelf scalable variational inference algorithms for
Gaussian processes.
Foreground Clustering for Joint Segmentation and Localization in Videos and Images

This paper presents a novel framework in which video/image segmentation and
localization are cast into a single optimization problem that integrates
information from low level appearance cues with that of high level
localization cues in a very weakly supervised manner. The proposed framework
leverages two representations at different levels, exploits the spatial
relationship between bounding boxes and superpixels as linear constraints and
simultaneously discriminates between foreground and background at box and
superpixel level. Different from previous approaches that mainly rely on
discriminative clustering, we incorporate a foreground model that minimizes
the histogram difference of an object across all image frames. Exploiting the
geometric relation between the superpixels and bounding boxes enables the
transfer of segmentation cues to improve localization output and vice-versa.
Inclusion of the foreground model generalizes our discriminative framework to
video data where the background tends to be similar and thus, not
discriminative. We demonstrate the effectiveness of our unified framework on
the YouTube Objects video dataset, Internet Object Discovery dataset and
Pascal VOC 2007.
Video Prediction via Selective Sampling

Most adversarial learning based video prediction methods suffer from image
blur, since the commonly used adversarial and regression loss pair work rather
in a competitive way than collaboration, yielding compromised blur effect. In
the meantime, as often relying on a single-pass architecture, the predictor is
inadequate to explicitly capture the forthcoming uncertainty. Our work
involves two key insights: (1) Video prediction can be approached as a
stochastic process: we sample a collection of proposals conforming to possible
frame distribution at following time stamp, and one can select the final
prediction from it. (2) De-coupling combined loss functions into dedicatedly
designed sub-networks encourages them to work in a collaborative way.
Combining above two insights we propose a two-stage network called VPSS
(\textbf{V}ideo \textbf{P}rediction via \textbf{S}elective \textbf{S}ampling).
Specifically a \emph{Sampling} module produces a collection of high quality
proposals, facilitated by a multiple choice adversarial learning scheme,
yielding diverse frame proposal set. Subsequently a \emph{Selection} module
selects high possibility candidates from proposals and combines them to
produce final prediction. Extensive experiments on diverse challenging
datasets demonstrate the effectiveness of proposed video prediction approach,
i.e., yielding more diverse proposals and accurate prediction results.
Distilled Wasserstein Learning for Word Embedding and Topic Modeling

We propose a novel Wasserstein method with a distillation mechanism, yielding
joint learning of word embeddings and topics. The proposed method is based on
the fact that the Euclidean distance between word embeddings may be employed
as the underlying distance in the Wasserstein topic model. The word
distributions of topics, their optimal transport to the word distributions of
documents, and the embeddings of words are learned in a unified framework.
When learning the topic model, we leverage a distilled ground-distance matrix
to update the topic distributions and smoothly calculate the corresponding
optimal transports. Such a strategy provides the updating of word embeddings
with robust guidance, improving algorithm convergence. As an application, we
focus on patient admission records, in which the proposed method embeds the
codes of diseases and procedures and learns the topics of admissions,
obtaining superior performance on clinically-meaningful disease network
construction, mortality prediction as a function of admission codes, and
procedure recommendation.
Neural Guided Constraint Logic Programming for Program Synthesis

Synthesizing programs using example input/outputs is a classic problem in
artificial intelligence. We present a method for solving Programming By
Example problems by using a neural model to guide the search of a constraint
logic programming system called miniKanren. Internally, miniKanren represents
a PBE problem as recursive constraints imposed by the provided examples. We
present a Recurrent Neural Network model and a Gated Graph Neural Network
model, both of which use these constraints as input to score candidate
programs. We further present a transparent version of miniKanren that can be
driven by an external agent, suitable for use by other researchers. We show
that our neural-guided approach using constraints can synthesize problems
faster in many cases, and has the potential to generalize to larger problems.
Genetic-Gated Networks for Deep Reinforcement Learning

We introduce the Genetic-Gated Networks (G2Ns), simple neural networks that
combine a gate vector composed of binary genetic genes in the hidden layer(s)
of networks. Our method can take both advantages of gradient-free optimization
and gradient-based optimization methods, of which the former is effective for
problems with multiple local minima, while the latter can quickly find local
minima. In addition, multiple chromosomes can define different models, making
it easy to construct multiple models and can be effectively applied to
problems that require multiple models. We show that this G2N can be applied to
typical reinforcement learning algorithms to achieve a large improvement in
sample efficiency and performance.
Fighting Boredom in Recommender Systems with Linear Reinforcement Learning

A common assumption in recommender systems (RS) is the existence of a best
fixed recommendation strategy. Such strategy may be simple and work at the
item level (e.g., in multi-armed bandit it is assumed one best fixed arm/item
exists) or implement more sophisticated RS (e.g., the objective of A/B testing
is to find the best fixed RS and execute it thereafter). We argue that this
assumption is rarely verified in practice, as the recommendation process
itself may impact the user’s preferences. For instance, a user may get bored
by a strategy, while she may gain interest again, if enough time passed since
the last time that strategy was used. In this case, a better approach consists
in alternating different solutions at the right frequency to fully exploit
their potential. In this paper, we first cast the problem as a Markov decision
process, where the rewards are a linear function of the recent history of
actions, and we show that a policy considering the long-term influence of the
recommendations may outperform both fixed-action and contextual greedy
policies. We then introduce an extension of the UCRL algorithm ( L IN UCRL )
to effectively balance exploration and exploitation in an unknown environment,
and we derive a regret bound that is independent of the number of states.
Finally, we empirically validate the model assumptions and the algorithm in a
number of realistic scenarios.
Enhancing the Accuracy and Fairness of Human Decision Making

Societies often rely on human experts to take a wide variety of decisions
affecting their members, from jail-or-release decisions taken by judges and
stop-and-frisk decisions taken by police officers to accept-or-reject
decisions taken by academics. In this context, each decision is taken by an
expert who is typically chosen uniformly at random from a pool of experts.
However, these decisions may be imperfect due to limited experience, implicit
biases, or faulty probabilistic reasoning. Can we improve the accuracy and
fairness of the overall decision making process by optimizing the assignment
between experts and decisions? In this paper, we address the above problem
from the perspective of sequential decision making and show that, for
different fairness notions from the literature, it reduces to a sequence of
(cons-trained) weighted bipartite matchings, which can be solved efficiently
using algorithms with approximation guarantees. Moreover, these algorithms
also benefit from posterior sampling to actively trade off exploitation---
selecting expert assignments which lead to accurate and fair decisions---and
exploration---selecting expert assignments to learn about the experts'
preferences and biases. We demonstrate the effectiveness of our algorithms on
both synthetic and real-world data and show that they can significantly
improve both the accuracy and fairness of the decisions taken by pools of
experts.
Temporal Regularization for Markov Decision Process

Several applications of Reinforcement Learning suffer from instability due to
high variance. This is especially prevalent in high dimensional domains.
Regularization is a commonly used technique in machine learning to reduce
variance, at the cost of introducing some bias. Most existing regularization
techniques focus on spatial (perceptual) regularization. Yet in reinforcement
learning, due to the nature of the Bellman equation, there is an opportunity
to also exploit temporal regularization based on smoothness in value estimates
over trajectories. This paper explores a class of methods for temporal
regularization. We formally characterize the bias induced by this technique
using Markov chain concepts. We illustrate the various characteristics of
temporal regularization via a sequence of simple discrete and continuous MDPs,
and show that the technique provides improvement even in high-dimensional
Atari games.
The Pessimistic Limits and Possibilities of Margin-based Losses in Semi-supervised Learning

Consider a classification problem where we have both labeled and unlabeled
data available. We show that for linear classifiers defined by convex margin-
based surrogate losses that are decreasing, it is impossible to construct any
semi-supervised approach that is able to guarantee an improvement over the
supervised classifier measured by this surrogate loss on the labeled and
unlabeled data. For convex margin-based loss functions that also increase, we
demonstrate safe improvements are possible.
Simple random search of static linear policies is competitive for reinforcement learning

We introduce a random search method for training static, linear policies for
continuous control problems, matching state-of-the-art sample efficiency on
the benchmark MuJoCo locomotion tasks. We evaluate the performance of our
method over hundreds of random seeds and many different hyperparameter
configurations for each benchmark task. Our simulations highlight a high
variability in performance in these benchmark tasks, indicating that commonly
used estimations of sample efficiency do not adequately evaluate the
performance of RL algorithms.
Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization

Responses generated by neural conversational models tend to lack
informativeness and diversity. We present Adversarial Information Maximization
(AIM), an adversarial learning framework that addresses these two related but
distinct problems. To foster response diversity, we leverage adversarial
training that allows distributional matching of synthetic and real responses.
To improve informativeness, our framework explicitly optimizes a variational
lower bound on pairwise mutual information between query and response.
Empirical results from automatic and human evaluations demonstrate that our
methods significantly boost informativeness and diversity.
Entropy and mutual information in models of deep neural networks

We examine a class of deep learning models with a tractable method to compute
information-theoretic quantities. Our contributions are three-fold: (i) We
show how entropies and mutual informations can be derived from heuristic
statistical physics methods, under the assumption that weight matrices are
independent and orthogonally-invariant. (ii) We extend particular cases in
which this result is known to be rigorously exact by providing a proof for
two-layers networks with Gaussian random weights, using the recently
introduced adaptive interpolation method. (iii) We propose an experiment
framework with generative models of synthetic datasets, on which we train deep
neural networks with a weight constraint designed so that the assumption in
(i) is verified during learning. We study the behavior of entropies and mutual
information throughout learning and conclude that, in the proposed setting,
the relationship between compression and generalization remains elusive.
Collaborative Learning for Deep Neural Networks

We introduce collaborative learning in which multiple classifier heads of the
same network are simultaneously trained on the same training data to improve
generalization and robustness to label noise with no extra inference cost. It
acquires the strengths from auxiliary training, multi-task learning and
knowledge distillation. There are two important mechanisms involved in
collaborative learning. First, the consensus of multiple views from different
classifier heads on the same example provides supplementary information as
well as regularization to each classifier, thereby improving generalization.
Second, intermediate-level representation (ILR) sharing with backpropagation
rescaling aggregates the gradient flows from all heads, which not only reduces
training computational complexity, but also facilitates supervision to the
shared layers. The empirical results on CIFAR and ImageNet datasets
demonstrate that deep neural networks learned as a group in a collaborative
way significantly reduce the generalization error and increase the robustness
to label noise.
High Dimensional Linear Regression using Lattice Basis Reduction

We consider a high dimensional linear regression problem where the goal is to
efficiently recover an unknown vector \beta^* from n noisy linear observations
Y=X\beta^+W \in \mathbb{R}^n, for known X \in \mathbb{R}^{n \times p} and
unknown W \in \mathbb{R}^n. Unlike most of the literature on this model we
make no sparsity assumption on \beta^. Instead we adopt a regularization based
on assuming that the underlying vectors \beta^* have rational entries with the
same denominator Q \in \mathbb{Z}_{>0}. We call this Q-rationality assumption.
We propose a new polynomial-time algorithm for this task which is based on the
seminal Lenstra-Lenstra-Lovasz (LLL) lattice basis reduction algorithm. We
establish that under the Q-rationality assumption, our algorithm recovers
exactly the vector \beta^* for a large class of distributions for the iid
entries of X and non-zero noise W. We prove that it is successful under small
noise, even when the learner has access to only one observation (n=1).
Furthermore, we prove that in the case of the Gaussian white noise for W,
n=o(p/\log p) and Q sufficiently large, our algorithm tolerates a nearly
optimal information-theoretic level of the noise.
Symbolic Graph Reasoning Meets Convolutions

Beyond local convolution networks, we explore how to harness various external
human knowledge for endowing the networks with the capability of semantic
global reasoning. Rather than using separate graphical models (e.g. CRF) or
constraints for modeling broader dependencies, we propose a new Symbolic Graph
Reasoning (SGR) layer, which performs reasoning over a group of symbolic nodes
whose outputs explicitly represent different properties of each semantic in a
prior knowledge graph. To cooperate with local convolutions, each SGR is
constituted by three modules: a) a primal local-to-semantic voting module
where the features of all symbolic nodes are generated by voting from local
representations; b) a graph reasoning module propagates information over
knowledge graph to achieve global semantic coherency; c) a dual semantic-to-
local mapping module learns new associations of the evolved symbolic nodes
with local representations, and accordingly enhances local features. The SGR
layer can be injected between any convolution layers and instantiated with
distinct prior graphs. Extensive experiments show incorporating SGR
significantly improves plain ConvNets on three semantic segmentation tasks and
one image classification task. More analyses show the SGR layer learns shared
symbolic representations for domains/datasets with the different label set
given a universal knowledge graph, demonstrating its superior generalization
capability.
DVAE#: Discrete Variational Autoencoders with Relaxed Boltzmann Priors

Boltzmann machines are powerful distributions that have been shown to be an
effective prior over binary latent variables in variational autoencoders
(VAEs). However, previous methods for training discrete VAEs have used the
evidence lower bound and not the tighter importance-weighted bound. We propose
two approaches for relaxing Boltzmann machines to continuous distributions
that permit training with importance-weighted bounds. These relaxations are
based on generalized overlapping transformations and the Gaussian integral
trick. Experiments on the MNIST and OMNIGLOT datasets show that these
relaxations outperform previous discrete VAEs with Boltzmann priors.
Partially-Supervised Image Captioning

Image captioning models are becoming increasingly successful at describing the
content of images in restricted domains. However, if these models are to
function in the wild --- for example, as aids for the visually impaired --- a
much larger number and variety of visual concepts must be understood. In this
work, we teach image captioning models new visual concepts with partial
supervision, such as available from object detection and image label datasets.
As these datasets contain text fragments rather than complete captions, we
formulate this problem as learning from incomplete data. To flexibly
characterize our uncertainty about the unobserved complete sequence, we
represent each incomplete training sequence with its own finite state
automaton encoding acceptable completions. We then propose a novel algorithm
for training sequence models, such as recurrent neural networks, on incomplete
sequences specified in this manner. In the context of image captioning, our
method lifts the restriction that previously required image captioning models
to be trained on paired image-sentence corpora only, or otherwise required
specialized model architectures to take advantage of alternative data
modalities. Applying our approach to an existing neural captioning model, we
achieve state of the art results on the novel object captioning task using the
COCO dataset. We further show that we can train a captioning model to describe
new visual concepts from the Open Images dataset while maintaining competitive
COCO evaluation scores.
3D-Aware Scene Manipulation via Inverse Graphics

We aim to obtain an interpretable, expressive, and disentangled scene
representation that contains comprehensive structural and textural information
for each object. Previous representations learned by neural networks are often
uninterpretable, limited to a single object, or lacking 3D knowledge. In this
work, we propose 3D scene de-rendering networks (3D-SDN) to address the above
issues by integrating disentangled representations for object semantics,
appearance, and geometry into a deep generative model. Our scene encoder
performs inverse graphics, translating a scene into a structured object
representation. Our decoder has two components: a differentiable shape
renderer and a neural texture generator. The disentanglement of semantics,
appearance, and geometry supports various 3D-aware scene manipulation
applications, \eg, rotating and moving objects freely while maintaining
consistent shape and texture, and changing the object appearance without
affecting its shape. We systematically evaluate our model and demonstrate that
our editing scheme based on 3D-SDN is superior to its 2D counterpart.
Random Feature Stein Discrepancies

Computable Stein discrepancies (SDs) have been deployed for a variety of
applications, ranging from sampler selection in posterior inference to
goodness-of-fit testing. Existing convergence-determining SDs admit strong
theoretical guarantees but suffer from a computational cost that grows
quadratically in the sample size. While linear-time SDs have been proposed for
goodness-of-fit testing, they exhibit significant degradations in testing
power---even when power is explicitly optimized. To address these
shortcomings, we introduce feature Stein discrepancies (FSDs), a new family of
quality measures that can be cheaply approximated using importance sampling.
We show how to construct FSDs that provably determine the convergence of a
sample to its target and develop high-accuracy approximations---random FSDs
(RFSDs)---which are computable in near-linear time. In our experiments with
sampler selection for approximate posterior inference and goodness-of-fit
testing, RFSDs typically perform as well or better than quadratic-time KSDs
while being orders of magnitude faster to compute.
Distributed Stochastic Optimization via Adaptive SGD

Stochastic convex optimization algorithms are the most popular way to train
machine learning models on large-scale data. Scaling up the training process
of these models is crucial, but the most popular algorithm, Stochastic
Gradient Descent (SGD), is a serial method that is surprisingly hard to
parallelize. In this paper, we propose an efficient distributed stochastic
optimization method by combing adaptive step sizes with variance reduction
techniques. We achieve a linear speedup in the number of machines, constant
memory footprint, and only a logarithmic number of communication rounds.
Critically, our approach is a black-box reduction that parallelizes any serial
SGD algorithm, allowing us to leverage the significant progress that has been
made in designing adaptive SGD algorithms. In particular, we achieve optimal
convergence rates without any prior knowledge of smoothness parameters,
yielding a more robust algorithm that reduces the need for hyperparameter
tuning. We implement our algorithm in the Spark distributed framework and
exhibit dramatic performance gains on large-scale logistic regression
problems.
Precision and Recall for Time Series

Classical anomaly detection is principally concerned with point-based
anomalies, those anomalies that occur at a single point in time. Yet, many
real-world anomalies are range-based, meaning they occur over a period of
time. In this paper, we present a new model that more accurately measures the
correctness of anomaly detection systems for range-based anomalies, while
subsuming the classical model's ability to classify point-based anomaly
detection systems.
Deep Attentive Tracking via Reciprocative Learning

Visual attention, derived from cognitive neuroscience, facilitates human
perception on the most pertinent subset of the sensory data. Recently,
significant efforts have been made to exploit attention schemes to advance
computer vision systems. For visual tracking, it is often challenging to track
target objects undergoing large appearance changes. Attention maps facilitate
visual tracking by selectively paying attention to temporally invariant motion
patterns. Existing tracking-by-detection approaches mainly use additional
attention modules to generate feature weights as the classifiers are not
equipped with such mechanisms. In this paper, we propose a reciprocative
learning algorithm to exploit visual attention for training deep classifiers.
The proposed algorithm consists of feed-forward and backward operations to
generate attention maps, which serve as regularization terms coupled with the
original classification loss function for training. The deep classifier learns
to attend to the regions of target objects robust to appearance changes.
Extensive experiments on large-scale benchmark datasets show that the proposed
attentive tracking method performs favorably against the state-of-the-art
approaches.
Virtual Class Enhanced Discriminative Embedding Learning

Recently, learning discriminative features to improve the recognition
performances gradually becomes the primary goal of deep learning, and numerous
remarkable works have emerged. In this paper, we propose a novel yet extremely
simple method Virtual Softmax to enhance the discriminative property of
learned features by injecting a dynamic virtual negative class into the
original softmax. Injecting virtual class aims to enlarge inter-class margin
and compress intra-class distribution by strengthening the decision boundary
constraint. Although it seems weird to optimize with this additional virtual
class, we show that our method derives from an intuitive and clear motivation,
and it indeed encourages the features to be more compact and separable. This
paper empirically and experimentally demonstrates the superiority of Virtual
Softmax, improving the performances on a variety of object classification and
face verification tasks.
Attention in Convolutional LSTM for Gesture Recognition

Convolutional long short-term memory (LSTM) networks have been widely used for
action/gesture recognition, and different attention mechanisms have also been
embedded into the LSTM or the convolutional LSTM (ConvLSTM) networks. Based on
the previous gesture recognition architectures which combine the three-
dimensional convolution neural network (3DCNN) and ConvLSTM, this paper
explores the effects of attention mechanism in ConvLSTM. Several variants of
ConvLSTM are evaluated: (a) Removing the convolutional structures of the three
gates in ConvLSTM, (b) Applying the attention mechanism on the input of
ConvLSTM, (c) Reconstructing the input and (d) output gates respectively with
the modified channel-wise attention mechanism. The evaluation results
demonstrate that the spatial convolutions in the three gates scarcely
contribute to the spatiotemporal feature fusion, and the attention mechanisms
embedded into the input and output gates cannot improve the feature fusion. In
other words, ConvLSTM mainly contributes to the temporal fusion along with the
recurrent steps to learn the long-term spatiotemporal features, when taking as
input the spatial or spatiotemporal features. On this basis, a new variant of
LSTM is derived, in which the convolutional structures are only embedded into
the input-to-state transition of LSTM. The code of the LSTM variants is
publicly available.
Universal Growth in Production Economies

We study a simple variant of the von Neumann model of an expanding economy, in
which multiple producers produce goods according to their production function.
The players trade their goods at the market and then use the bundles received
as inputs for the production in the next round. We show that a simple
decentralized dynamic, where players update their bids on the goods in the
market proportionally to how useful the investments were, leads to growth of
the economy in the long term (whenever growth is possible) but also creates
unbounded inequality, i.e. very rich and very poor players emerge. We analyze
several other phenomena, such as how the relation of a player with others
influences its development and the Gini index of the system.
Bayesian Model Selection Approach to Boundary Detection with Non-Local Priors

We propose a Bayesian model selection (BMS) boundary detection procedure using
non-local prior distributions for a sequence of data with multiple systematic
mean changes. By using the non-local priors in the Bayesian model selection
framework, the BMS method can effectively suppress the non-boundary spike
points with large instantaneous changes. Further, we speed up the algorithm by
reducing the multiple change points to a series of single change point
detection problems. We establish the consistency of the estimated number and
locations of the change points under various prior distributions. Extensive
simulation studies are conducted to compare the BMS with existing methods, and
our method is illustrated with application to the magnetic resonance imaging
guided radiation therapy data.
Efficient Stochastic Gradient Hard Thresholding

Stochastic gradient hard thresholding methods have recently been shown to work
favorably in solving large-scale empirical risk minimization problems under
sparsity or rank constraint. Despite the improved iteration complexity over
full gradient methods, the gradient evaluation and hard thresholding
complexity of the existing stochastic algorithms usually scales linearly with
data size, which could still be expensive when data is huge and the hard
thresholding step could be as expensive as singular value decomposition in
rank-constrained problems. To address these deficiencies, we propose an
efficient hybrid stochastic gradient hard thresholding (HSG-HT) method that
can be provably shown to have sample-size-independent gradient evaluation and
hard thresholding complexity bounds. Specifically, we prove that the
stochastic gradient evaluation complexity of HSG-HT scales linearly with
inverse of sub-optimality and its hard thresholding complexity scales
logarithmically. By applying the heavy ball acceleration technique, we further
propose an accelerated variant of HSG-HT which can be shown to have improved
factor dependence on restricted condition number. Numerical results confirm
our theoretical affirmation and demonstrate the computational efficiency of
the proposed methods.
SplineNets: Continuous Neural Decision Graphs

We present SplineNets, a practical and novel approach for using conditioning
in convolutional neural networks (CNNs). Our method dramatically reduces
runtime complexity and computation costs of CNNs, while maintaining or even
increasing accuracy. SplineNets employ a unified loss function with a desired
level of smoothness over both the network and decision parameters, while
allowing for sparse activation of a subset of nodes for individual samples.
Thus, functions of SplineNets are both dynamic (i.e., conditioned on the
input) and hierarchical (i.e., conditioned on the computational path). In
particular, we embed weights of functions on smooth, low dimensional manifolds
parameterized by compact B-splines, and define decisions as choosing a
position on these hyper-surfaces. We further show that by maximizing the
mutual information between these latent coordinates and data or its labels,
the network can be optimally utilized and specialized. Experiments on various
image classification datasets show the power of this new paradigm over regular
CNNs.
Generalized Zero-Shot Learning with Deep Calibration Network

A technical challenge of deep learning is recognizing target classes without
seen data. Zero-shot learning leverages semantic representations such as
attributes or class prototypes to bridge source and target classes. Existing
standard zero-shot learning methods may be prone to overfitting the seen data
of source classes as they are blind to the semantic representations of target
classes. In this paper, we study generalized zero-shot learning that assumes
accessible to target classes for unseen data during training, and prediction
on unseen data is made by searching on both source and target classes. We
propose a novel Deep Calibration Network (DCN) approach towards this
generalized zero-shot learning paradigm, which enables simultaneous
calibration of deep networks on the confidence of source classes and
uncertainty of target classes. Our approach maps visual features of images and
semantic representations of class prototypes to a common embedding space such
that the compatibility of seen data to both source and target classes are
maximized. We show superior accuracy of our approach over the state of the art
on benchmark datasets for generalized zero-shot learning, including AwA, CUB,
SUN, and aPY.
Neural Architecture Search with Bayesian Optimisation and Optimal Transport

Bayesian Optimisation (BO) refers to a class of methods for global
optimisation of a function f which is only accessible via point evaluations.
It is typically used in settings where f is expensive to evaluate. A common
use case for BO in machine learning is model selection, where it is not
possible to analytically model the generalisation performance of a statistical
model, and we resort to noisy and expensive training and validation procedures
to choose the best model. Conventional BO methods have focused on Euclidean
and categorical domains, which, in the context of model selection, only
permits tuning scalar hyper-parameters of machine learning algorithms.
However, with the surge of interest in deep learning, there is an increasing
demand to tune neural network architectures. In this work, we develop NASBOT,
a Gaussian process based BO framework for neural architecture search. To
accomplish this, we develop a distance metric in the space of neural network
architectures which can be computed efficiently via an optimal transport
program. This distance might be of independent interest to the deep learning
community as it may find applications outside of BO. We demonstrate that
NASBOT outperforms other alternatives for architecture search in several cross
validation based model selection tasks on multi-layer perceptrons and
convolutional neural networks.
Embedding Logical Queries on Knowledge Graphs

Learning low-dimensional embeddings of knowledge graphs is a powerful approach
used to predict unobserved or missing edges between entities. However, an open
challenge in this area is developing techniques that can go beyond simple edge
prediction and handle more complex logical queries, which might involve
multiple unobserved edges, entities, and variables. For instance, given an
incomplete biological knowledge graph, we might want to predict "em what drugs
are likely to target proteins involved with both diseases X and Y?" -- a query
that requires reasoning about all possible proteins that might interact with
diseases X and Y. Here we introduce a framework to efficiently make
predictions about conjunctive logical queries -- a flexible but tractable
subset of first-order logic -- on incomplete knowledge graphs. In our
approach, we embed graph nodes in a low-dimensional space and represent
logical operators as learned geometric operations (e.g., translation,
rotation) in this embedding space. By performing logical operations within a
low-dimensional embedding space, our approach achieves a time complexity that
is linear in the number of query variables, compared to the exponential
complexity required by a naive enumeration-based approach. We demonstrate the
utility of this framework in two application studies on real-world datasets
with millions of relations: predicting logical relationships in a network of
drug-gene-disease interactions and in a graph-based representation of social
interactions derived from a popular web forum.
Learning Optimal Reserve Price against Non-myopic Bidders

We consider the problem of learning optimal reserve price in repeated auctions
against non-myopic bidders, who may bid strategically in order to gain in
future rounds even if the single-round auctions are truthful. Previous
algorithms, e.g., empirical pricing, do not provide non-trivial regret rounds
in this setting in general. We introduce algorithms that obtain small regret
against non-myopic bidders either when the market is large, i.e., no bidder
appears in a constant fraction of the rounds, or when the bidders are
impatient, i.e., they discount future utility by some factor mildly bounded
away from one. Our approach carefully controls what information is revealed to
each bidder, and builds on techniques from differentially private online
learning as well as the recent line of works on jointly differentially private
algorithms.
Sequential Context Encoding for Duplicate Removal

Duplicate removal is a critical step to accomplish a reasonable amount of
predictions in prevalent proposal-based object detection frameworks. Albeit
simple and effective, most previous algorithms utilized a greedy process
without making sufficient use of properties of input data. In this work, we
design a new two-stage framework to effectively select the appropriate
proposal candidate for each object. The first stage suppresses most of easy
negative object proposals, while the second stage selects true positives in
the reduced proposal set. These two stages share the same network structure,
an encoder and a decoder formed as recurrent neural networks (RNN) with global
attention and context gate. The encoder scans proposal candidates in a
sequential manner to capture the global context information, which is then fed
to the decoder to extract optimal proposals. In our extensive experiments, the
proposed method outperforms other alternatives by a large margin.
Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning

This paper presents KeypointNet, an end-to-end geometric reasoning framework
to learn an optimal set of category-specific keypoints, along with their
detectors to predict 3D keypoints in a single 2D input image. We demonstrate
this framework on 3D pose estimation task by proposing a differentiable pose
objective that seeks the optimal set of keypoints for recovering the relative
pose between two views of an object. Our network automatically discovers a
consistent set of keypoints across viewpoints of a single object as well as
across all object instances of a given object class. Importantly, we find that
our end-to-end approach using no ground-truth keypoint annotations outperforms
a fully supervised baseline using the same neural network architecture for the
pose estimation task. The discovered 3D keypoints across the car, chair, and
plane categories of ShapeNet are visualized at https://keypoints.github.io/
Nonparametric learning for Bayesian models via randomized objective functions

We present a Bayesian nonparametric (NP) approach to learning from data that
is centered around a conventional probabilistic model, but does not assume
that this model is true. This affords a trivially parallelizable, scalable
Monte Carlo sampling scheme based on the notion of randomized objective
functions, which map posterior samples from the baseline model into posterior
samples from the NP update. This is particularly attractive for regularizing
NP methods or correcting approximate models, such as variational Bayes (VB).
We demonstrate the approach on a number of examples including VB classifiers
and Bayesian random forests.
SEGA: Variance Reduction via Gradient Sketching

We propose a novel randomized first order optimization method---SEGA (SkEtched
GrAdient method)---which progressively throughout its iterations builds a
variance-reduced estimate of the gradient from random linear measurements
(sketches) of the gradient provided at each iteration by an oracle. In each
iteration, SEGA updates the current estimate of the gradient through a sketch-
and-project operation using the information provided by the latest sketch, and
this is subsequently used to compute an unbiased estimate of the true gradient
through a random relaxation procedure. This unbiased estimate is then used to
perform a gradient step. Unlike standard subspace descent methods, such as
coordinate descent, SEGA can be used for optimization problems with a non-
separable proximal term. We provide a general convergence analysis and prove
linear convergence for strongly convex objectives. In the special case of
coordinate sketches, SEGA can be enhanced with various techniques such as
importance sampling, minibatching and acceleration, and its rate is up to a
small constant factor identical to the best-known rate of coordinate descent.
Automatic Program Synthesis of Long Programs with a Learned Garbage Collector

We consider the problem of generating automatic code given sample input-output
pairs. We train a neural network to map from the current state and the outputs
to the program's next statement. The neural network optimizes multiple tasks
concurrently: The next operation out of a set of high level commands, the
operands of the next statement, and which variables can be dropped from
memory. Using our method we are able to create programs that are more than
twice as long as existing state-of-the-art solutions, while improving the
success rate for comparable lengths, and cutting the run-time by two orders of
magnitude.
One-Shot Unsupervised Cross Domain Translation

Given a single image $x$ from domain $A$ and a set of images from domain $B$,
our task is to generate the analogous of $x$ in $B$. We argue that this task
could be a key AI capability that underlines the ability of cognitive agents
to act in the world and present empirical evidence that the existing
unsupervised domain translation methods fail on this task. Our method follows
a two step process. First, a variational autoencoder for domain $B$ is
trained. Then, given the new sample $x$, we create a variational autoencoder
for domain $A$ by adapting the layers that are close to the image in order to
directly fit $x$, and only indirectly adapt the other layers. Our experiments
indicate that the new method does as well, when trained on one sample $x$, as
the existing domain transfer methods, when these enjoy a multitude of training
samples from domain $A$. Our code will be made publicly available.
Regularizing by the Variance of the Activations' Sample-Variances

Normalization techniques play an important role in supporting efficient and
often more effective training of deep neural networks. While conventional
methods explicitly normalize the activations, we suggest to add a loss term
instead. This new loss term encourages the variance of the activations to be
stable and not vary from one random mini-batch to the next. As we prove, this
encourages the activations to be distributed around a few distinct modes. We
also show that if the inputs are from a mixture of two Gaussians, the new loss
would either join the two together, or separate between them optimally in the
LDA sense, depending on the prior probabilities. Finally, we are able to link
the new regularization term to the batchnorm method, which provides it with a
regularization perspective. Our experiments demonstrate an improvement in
accuracy over the batchnorm technique for both CNNs and fully connected
networks.
Overlapping Clustering, and One (class) SVM to Bind Them All

People belong to multiple communities, words belong to multiple topics, and
books cover multiple genres; overlapping clusters are commonplace. Many
existing overlapping clustering methods model each person (or word, or book)
as a non-negative weighted combination of "exemplars" who belong solely to one
community, with some small noise. Geometrically, each person is a point on a
cone whose corners are these exemplars. This basic form encompasses the widely
used Mixed Membership Stochastic Blockmodel of networks and its degree-
corrected variants, as well as topic models such as LDA. We show that a simple
one-class SVM yields provably consistent parameter inference for all such
models, and scales to large datasets. Experimental results on several
simulated and real datasets show our algorithm (called \svmcone) is both
accurate and scalable.
Algorithmic Linearly Constrained Gaussian Processes

We algorithmically construct multi-output Gaussian process priors which
satisfy linear differential equations. Our approach attempts to parametrize
all solutions of the equations using Gröbner bases. If successful, a push
forward Gaussian process along the paramerization is the desired prior. We
consider several examples from physics, geomathmatics and control, among them
the full inhomogeneous system of Maxwell's equations. By bringing together
stochastic learning and computeralgebra in a novel way, we combine noisy
observations with precise algebraic computations.
DeepExposure: Learning to Expose Photos with Asynchronously Reinforced Adversarial Learning

The accurate exposure is the key of capturing high-quality photos in
computational photography, especially for mobile phones that are limited by
sizes of camera modules. Inspired by exposure blending with luminosity masks
usually applied by professional photographers, in this paper, we develop a
novel algorithm for learning local exposures with deep reinforcement learning
and adversarial learning. To be specific, we segment an image into sub-images
that can reflect variations of dynamic range exposures according to raw low-
level features. Based on these sub-images, a local exposure for each sub-image
is automatically learned by virtue of policy network sequentially while the
reward of learning is globally designed for striking a balance of overall
exposures. The aesthetic evaluation function is approximated by discriminator
in generative adversarial networks. The reinforcement learning and the
adversarial learning are trained collaboratively by asynchronous deterministic
policy gradient and generative loss approximation. To further simplify the
algorithmic architecture, we also prove the feasibility of leveraging the
discriminator as the value function. Further more, we employ each local
exposure to retouch the raw input image respectively, thus delivering multiple
retouched images under different exposures which are fused with exposure
blending. The extensive experiments verify that our algorithms are superior to
state-of-the-art methods in terms of quantitative accuracy and visual
illustration.
Norm matters: efficient and accurate normalization schemes in deep networks

Over the past few years, Batch-Normalization has been commonly used in deep
networks, allowing faster training and high performance for a wide variety of
applications. However, the reasons behind its merits remained unanswered, with
several shortcomings that hindered its use for certain tasks. In this work, we
present a novel view on the purpose and function of normalization methods and
weight-decay, as tools to decouple weights' norm from the underlying optimized
objective. This property highlights the connection between practices such as
normalization, weight decay and learning-rate adjustments. We suggest several
alternatives to the widely used $L^2$ batch-norm, using normalization in $L^1$
and $L^\infty$ spaces that can substantially improve numerical stability in
low-precision implementations as well as provide computational and memory
benefits. We demonstrate that such methods enable the first batch-norm
alternative to work for half-precision implementations. Finally, we suggest a
modification to weight-normalization, which improves its performance on large-
scale tasks.
Dual Principal Component Pursuit: Improved Analysis and Efficient Algorithms

Recent methods for learning a linear subspace from data corrupted by outliers
are based on convex L1 and nuclear norm optimization and require the dimension
of the subspace and the number of outliers to be sufficiently small [1]. In
sharp contrast, the recently proposed Dual Principal Component Pursuit (DPCP)
method [2] can provably handle subspaces of high dimension by solving a non-
convex L1 optimization problem on the sphere. However, its geometric analysis
is based on quantities that are difficult to interpret and are not amenable to
statistical analysis. In this paper we provide a refined geometric analysis
and a new statistical analysis that show that DPCP can tolerate as many
outliers as the square of the number of inliers, thus improving upon other
provably correct robust PCA methods. We also propose a scalable Projected Sub-
Gradient Descent method (DPCP-PSGD) for solving the DPCP problem and show it
admits linear convergence even though the underlying optimization problem is
non-convex and non-smooth. Experiments on road plane detection from 3D point
cloud data demonstrate that DPCP-PSGD can be more efficient than the
traditional RANSAC algorithm, which is one of the most popular methods for
such computer vision applications.
MULAN: A Blind and Off-Grid Method for Multichannel Echo Retrieval

This paper addresses the general problem of blind echo retrieval, i.e., given
M sensors measuring in the discrete-time domain M mixtures of K delayed and
attenuated copies of an unknown source signal, can the echo location and
weights be recovered? This problem has broad applications in fields such as
sonars, seismology, ultrasounds or room acoustics. It belongs to the broader
class of blind channel identification problems, which have been intensively
studied in signal processing. All existing methods proceed in two steps: (i)
blind estimation of sparse discrete-time filters and (ii) echo information
retrieval by peak picking. The precision of these methods is fundamentally
limited by the rate at which the signals are sampled: estimated echo locations
are necessary on-grid, and since true locations never match the sampling grid,
the weight estimation precision is also strongly limited. This is the so-
called basis-mismatch problem in compressed sensing. We propose a radically
different approach to the problem, building on top of the framework of finite-
rate-of-innovation sampling. The approach operates directly in the parameter-
space of echo locations and weights, and enables near-exact blind and off-grid
echo retrieval from discrete-time measurements. It is shown to outperform
conventional methods by several orders of magnitudes in precision.
Mixture Matrix Completion

Completing a data matrix X has become an ubiquitous problem in modern data
science, with motivations in recommender systems, computer vision, and
networks inference, to name a few. One typical assumption is that X is low-
rank. A more general model assumes that each column of X corresponds to one of
several low-rank matrices. This paper generalizes these models to what we call
mixture matrix completion (MMC): the case where each entry of X corresponds to
one of several low-rank matrices. MMC is a more accurate model for recommender
systems, and brings more flexibility to other completion and clustering
problems. We make four fundamental contributions about this new model. First,
we show that MMC is theoretically possible (well-posed). Second, we give its
precise information-theoretic identifiability conditions. Third, we derive the
sample complexity of MMC. Finally, we give a practical algorithm for MMC with
performance comparable to the state-of-the-art for simpler related problems,
both on synthetic and real data.
Trajectory Convolution for Action Recognition

How to leverage the temporal dimension is a key question in video analysis.
Recent work suggests an efficient approach to video feature learning, namely,
factorizing 3D convolutions into separate components respectively for spatial
and temporal convolutions. The temporal convolution, however, comes with an
implicit assumption – the feature maps across time steps are well aligned so
that the features at the same locations can be aggregated. This assumption may
be overly strong in practical applications, especially in action recognition
where the motion of people or objects is a crucial aspect. In this work, we
propose a new CNN architecture TrajectoryNet, which incorporates trajectory
convolution, a new operation for integrating features along the temporal
dimension, to replace the standard temporal convolution. This operation
explicitly takes into account the location changes caused by deformation or
motion, allowing the visual features to be aggregated along the the motion
paths. On two very large-scale action recognition datasets, namely, Something-
Something and Kinetics, the proposed network architecture achieves notable
improvement over strong baselines.
A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem

Bandit learning is characterized by the tension between long-term exploration
and short-term exploitation. However, as has recently been noted, in settings
in which the choices of the learning algorithm correspond to important
decisions about individual people (such as criminal recidivism prediction,
lending, and sequential drug trials), exploration corresponds to explicitly
sacrificing the well-being of one individual for the potential future benefit
of others. In such settings, one might like to run a greedy'' algorithm, which always makes the optimal decision for the individuals at hand --- but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm. We give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve no regret'', perhaps (depending
on the specifics of the setting) with a constant amount of initial training
data. This suggests that in slightly perturbed environments, exploration and
exploitation need not be in conflict in the linear setting.
Revisiting Decomposable Submodular Function Minimization with Incidence Relations

We introduce a new approach to decomposable submodular function minimization
(DSFM) that exploits incidence relations. Incidence relations describe which
variables effectively influence the component functions, and when properly
utilized, they allow for improving the convergence rates of DSFM solvers. Our
main results include the precise parametrization of the DSFM problem based on
incidence relations, the development of new scalable alternative projections
and parallel coordinate descent methods and an accompanying rigorous analysis
of their convergence rates.
A Practical Algorithm for Distributed Clustering and Outlier Detection

We study the classic k-means/median clustering, which are fundamental problems
in unsupervised learning, in the setting where data are partitioned across
multiple sites, and where we are allowed to discard a small portion of the
data by labeling them as outliers. We propose a simple approach based on
constructing small summary for the original dataset. The proposed method is
time and communication efficient, has good approximation guarantees, and can
identify the global outliers effectively. To the best of our knowledge, this
is the first practical algorithm with theoretical guarantees for distributed
clustering with outliers. Our experiments on both real and synthetic data have
demonstrated the clear superiority of our algorithm against all the baseline
algorithms in almost all metrics.
Learning to Reconstruct Shapes from Unseen Categories

From a single view, humans are able to hallucinate the full 3D shape of the
object in the image, even if it is from a novel, unseen category. Contemporary
AI systems for single-image 3D reconstruction often lack this ability, because
the shape priors they learned is often tied to the training object classes. In
this paper, we study the task of single-image 3D reconstruction, but
attempting to recover the full 3D shape of an object outside the training
categories. Our model combines 2.5D sketches (depths and silhouettes),
spherical shape representations, and 3D voxels in a principled manner.
Experiments demonstrate that it achieves state-of-the-art results on
generalizing to diverse novel object categories.
Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation

There has been a recent surge of interest in developing theoretically faster
algorithms for leading eigenvector computation. The key to achieving faster
convergence rates therein is to use the classic shift-and-invert
preconditioning technique on top of power methods. The underlying problem then
can be reduced to a series of linear system subproblems that can leverage fast
approximate least squares solvers. Despite the simplicity of the power
iterations as the base method, it may suffer from making limited progress
towards solutions. In this work, we consider that the shift-and-invert
preconditioning is paired with a new base method, namely gradient descent
search. By virtue of the flexibility of setting step-sizes in gradient search
processes, we expect the shift-and-inverted gradient descent solver can
outperform the shift-and-inverted power methods. In particular, we present a
novel convergence analysis for this new pairing that achieves a rate at
$\tilde{O}(\sqrt{\frac{\lambda_{1}}{\lambda_{1}-\lambda_{p+1}}})$, where
$\lambda_{i}$ represents the $i$-th largest eigenvalue of the given real
symmetric matrix and $p$ is the multiplicity of $\lambda_{1}$. Our
experimental studies show that the proposed algorithm can be significantly
faster than the shift-and-inverted power method in practice.
Factored Bandits

We introduce the factored bandits model, which is a framework for learning
with limited (bandit) feedback, where actions can be decomposed into a
Cartesian product of atomic actions. Factored bandits incorporate rank-1
bandits as a special case, but significantly relax the assumptions on the form
of the reward function. We provide an anytime algorithm for stochastic
factored bandits and up to constants matching upper and lower regret bounds
for the problem. Furthermore, we show that with a slight modification the
proposed algorithm can be applied to utility based dueling bandits. We obtain
an improvement in the additive terms of the regret bound compared to state of
the art algorithms (the additive terms are dominating up to time horizons
which are exponential in the number of arms).
Delta-encoder: an effective sample synthesis method for few-shot object recognition

Learning to classify new categories based on just one or a few examples is a
long-standing challenge in modern computer vision. In this work, we proposes a
simple yet effective method for few-shot (and one-shot) object recognition.
Our approach is based on a modified auto-encoder, denoted Delta-encoder, that
learns to synthesize new samples for an unseen category just by seeing few
examples from it. The synthesized samples are then used to train a classifier.
The proposed approach learns to both extract transferable intra-class
deformations, or "deltas", between same-class pairs of training examples, and
to apply those deltas to the few provided examples of a novel class (unseen
during training) in order to efficiently synthesize samples from that new
class. The proposed method improves over the state-of-the-art in one-shot
object-recognition and compares favorably in the few-shot case. Upon
acceptance code will be made available.
Metric on Nonlinear Dynamical Systems with Koopman Operators

The development of a metric for structural data is a long-term problem in
pattern recognition and machine learning. In this paper, we develop a general
metric for comparing nonlinear dynamical systems that is defined with Koopman
operator in reproducing kernel Hilbert spaces. Our metric includes the
existing fundamental metrics for dynamical systems, which are basically
defined with principal angles between some appropriately-chosen subspaces, as
its special cases. We also describe the estimation of our metric from finite
data. We empirically illustrate our metric with an example of rotation
dynamics in a unit disk in a complex plane, and evaluate the performance with
real-world time-series data.
Learning a High Fidelity Pose Invariant Model for High-resolution Face Frontalization

Face frontalization refers to the process of synthesizing the frontal view of
a face from a given profile. Due to self-occlusion and appearance distortion
in the wild, it is extremely challenging to recover faithful results and
preserve texture details in a high-resolution. This paper proposes a High
Fidelity Pose Invariant Model (HF-PIM) to produce photographic and identity-
preserving results. HF-PIM frontalizes the profiles through a novel texture
warping procedure and leverages a dense correspondence field to bridge the 2D
and 3D surface space. We decompose the prerequisite of warping into
correspondence field estimation and facial texture recovering, which are both
well addressed by deep networks. Different from those reconstruction methods
relying on 3D data, we also propose Adversarial Residual Dictionary Learning
(ARDL) to supervise facial texture map recovering with only monocular images.
Exhaustive experiments on both controlled and uncontrolled environments
demonstrate that the proposed method not only boosts the performance of pose-
invariant face recognition but also dramatically improves high-resolution
frontalization appearances.
Mirrored Langevin Dynamics

We consider the problem of sampling from constrained distributions, which has
posed significant challenges to both non-asymptotic analysis and algorithmic
design. We propose a unified framework, which is inspired by the classical
mirror descent, to derive novel first-order sampling schemes. We prove that,
for a general target distribution with strongly convex potential, our
framework implies the existence of a first-order algorithm achieving
O~(\epsilon^{-2}d) convergence, suggesting that the state-of-the-art
O~(\epsilon^{-6}d^5) can be vastly improved. With the important Latent
Dirichlet Allocation (LDA) application in mind, we specialize our algorithm to
sample from Dirichlet posteriors, and derive the first non-asymptotic
O~(\epsilon^{-2}d^2) rate for first-order sampling. We further extend our
framework to the mini-batch setting and prove convergence rates when only
stochastic gradients are available. Finally, we report promising experimental
results for LDA on real datasets.
Stochastic Cubic Regularization for Fast Nonconvex Optimization

This paper proposes a stochastic variant of a classic algorithm---the cubic-
regularized Newton method [Nesterov and Polyak]. The proposed algorithm
efficiently escapes saddle points and finds approximate local minima for
general smooth, nonconvex functions in only
$\mathcal{\tilde{O}}(\epsilon^{-3.5})$ stochastic gradient and stochastic
Hessian-vector product evaluations. The latter can be computed as efficiently
as stochastic gradients. This improves upon the
$\mathcal{\tilde{O}}(\epsilon^{-4})$ rate of stochastic gradient descent. Our
rate matches the best-known result for finding local minima without requiring
any delicate acceleration or variance-reduction techniques.
Adaptation to Easy Data in Prediction with Limited Advice

We derive an online learning algorithm with improved regret guarantees for
easy'' loss sequences. We consider two types of easiness'': (a) stochastic
loss sequences and (b) adversarial loss sequences with small effective range
of the losses. While a number of algorithms have been proposed for exploiting
small effective range in the full information setting, Gerchinovitz and
Lattimore [2016] have shown the impossibility of regret scaling with the
effective range of the losses in the bandit setting. We show that just one
additional observation per round is sufficient to bypass the impossibility
result. The proposed Second Order Difference Adjustments (SODA) algorithm
requires no prior knowledge of the effective range of the losses,
$\varepsilon$, and achieves an $O(\varepsilon \sqrt{KT \ln K}) +
\tilde{O}(\varepsilon K \sqrt[4]{T})$ expected regret guarantee, where $T$ is
the time horizon and $K$ is the number of actions. The scaling with the
effective loss range is achieved under significantly weaker assumptions than
those made by Cesa-Bianchi and Shamir [2018] in an earlier attempt to bypass
the impossibility result. We also provide regret lower bound of
$\Omega(\varepsilon\sqrt{T K})$, which almost matches the upper bound. In
addition, we show that in the stochastic setting SODA achieves an
$O\left(\sum_{a:\Delta_a&gt;0} \frac{K\varepsilon^2}{\Delta_a}\right)$ pseudo-
regret bound that holds simultaneously with the adversarial regret guarantee.
In other words, SODA is safe against an unrestricted oblivious adversary and
provides improved regret guarantees for at least two different types of
``easiness'' simultaneously.
Differentially Private Bayesian Inference for Exponential Families

The study of private inference has been sparked by growing concern regarding
the analysis of data when it stems from sensitive sources. We present the
first method for private Bayesian inference in exponential families that
properly accounts for noise introduced by the privacy mechanism. It is
efficient because it works only with sufficient statistics and not individual
data. Unlike other methods, it gives properly calibrated posterior beliefs in
the non-asymptotic data regime.
Playing hard exploration games by watching YouTube

Deep reinforcement learning methods traditionally struggle with tasks where
environment rewards are particularly sparse. One successful method of guiding
exploration in these domains is to imitate trajectories provided by a human
demonstrator. However, these demonstrations are typically collected under
artificial conditions, i.e. with access to the agent’s exact environment setup
and the demonstrator’s action and reward trajectories. Here we propose a
method that overcomes these limitations in two stages. First, we learn to map
unaligned videos from multiple sources to a common representation using self-
supervised objectives constructed over both time and modality (i.e. vision and
sound). Second, we embed a single YouTube video in this representation to
learn a reward function that encourages an agent to imitate human gameplay.
This method of one-shot imitation allows our agent to convincingly exceed
human-level performance on the infamously hard exploration games Montezuma’s
Revenge, Pitfall! and Private Eye for the first time, even if the agent is not
presented with any environment rewards.
Dialog-to-Action: Conversational Question Answering Over a Large-Scale Knowledge Base

We present an approach to map utterances in conversation to logical forms,
which will be executed on a large-scale knowledge base. To handle enormous
ellipsis phenomena in conversation, we introduce dialog memory management to
manipulate historical entities, predicates, and logical forms when inferring
the logical form of current utterances. Dialog memory management is embodied
in a generative model, in which a logical form is interpreted in a top-down
manner following a small and flexible grammar. We learn the model from
denotations without explicit annotation of logical forms, and evaluate it on a
large-scale dataset consisting of 200K dialogs over 12.8M entities. Results
verify the benefits of modeling dialog memory, and show that our semantic
parsing-based approach outperforms a memory network based encoder-decoder
model by a huge margin.
Norm-Ranging LSH for Maximum Inner Product Search

Neyshabur and Srebro proposed Simple-LSH, which is the state-of-the-art
hashing method for maximum inner product search (MIPS) with performance
guarantee. We found that the performance of Simple-LSH, in both theory and
practice, suffers from long tails in the 2-norm distribution of real datasets.
We propose Norm-ranging LSH, which addresses the excessive normalization
problem caused by long tails in Simple-LSH by partitioning a dataset into
multiple sub-datasets and building a hash index for each sub-dataset
independently. We prove that Norm-ranging LSH has lower query time complexity
than Simple-LSH. We also show that the idea of partitioning the dataset can
improve other hashing based methods for MIPS. To support efficient query
processing on the hash indexes of the sub-datasets, a novel similarity metric
is formulated. Experiments show that Norm-ranging LSH achieves an order of
magnitude speedup over Simple-LSH for the same recall, thus significantly
benefiting applications that involve MIPS.
Optimization over Continuous and Multi-dimensional Decisions with Observational Data

We consider the optimization of an uncertain objective over continuous and
multi-dimensional decision spaces in problems in which we are only provided
with observational data. We propose a novel algorithmic framework that is
tractable, asymptotically consistent, and superior to comparable methods on
example problems. Our approach leverages predictive machine learning methods
and incorporates information on the uncertainty of the predicted outcomes for
the purpose of prescribing decisions. We demonstrate the efficacy of our
method on examples involving both synthetic and real data sets.
Fast Estimation of Causal Interactions using Wold Processes

We here focus on the task of learning Granger causality matrices for
multivariate point processes. In order to tackle this task, our work is the
first to explore Wold processes. By doing so, we are able to develop
asymptotically fast MCMC learning algorithms. With $N$ being the total number
of events and $K$ the number of processes, our learning algorithm has a
$O(N(log(N)+log(K))$ cost per iteration. This is much faster than the $O(N^3
K^2)$ or $O(K^3)$ for the state of the art. Our approach, called Granger-
Busca, is validated on real-world data being three times more accurate (in
Precision@10) than recent baselines. Granger-Busca is also the only approach
able to train models for large sets of data.
When do random forests fail?

Random forests are a class of ensemble algorithms that build large collections
of random trees and make predictions by averaging the tree predictions. In
this paper, we consider various tree constructions and examine how the choice
of parameters affects the generalization error of the resulting random forests
as the sample size goes to infinity. We show that subsampling of data points
during the tree construction phase is critical: Forests can become
inconsistent with either no subsampling or too severe subsampling. As a
consequence, even highly randomized trees can lead to inconsistent forests if
no subsampling is used, which implies that some of the commonly used setups
for random forests can be inconsistent. As a second consequence we can show
that, surprisingly, trees that have good performance in nearest-neighbor
search can be a poor choice for random forests.
Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

While designing the state space of an MDP, it is common to include states that
are transient or not reachable by any policy (e.g., in mountain car, the
product space of speed and position contains configurations that are not
physically reachable). This leads to defining weakly-communicating or multi-
chain MDPs. In this paper, we introduce TUCRL, the first algorithm able to
perform efficient exploration-exploitation in any finite Markov Decision
Process (MDP) without requiring any form of prior knowledge. In particular,
for any MDP with Sc communicating states, A actions and Gc < Sc possible
communicating next states, we derive a O(Dc \sqrt{Gc Sc A T}) regret bound,
where Dc is the diameter (i.e., the longest shortest path) of the
communicating part of the MDP. This is in contrast with optimistic algorithms
(e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-
communicating MDPs, as well as posterior sampling or regularised algorithms
(e.g., REGAL), which require prior knowledge on the bias span of the optimal
policy to bias the exploration to achieve sub-linear regret. We also prove
that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic
growth of the regret without first suffering a linear regret for a number of
steps that is exponential in the parameters of the MDP. Finally, we report
numerical simulations supporting our theoretical findings and showing how
TUCRL overcomes the limitations of the state-of-the-art.
Optimistic Optimization of a Brownian

In this paper, we address the problem of optimizing a Brownian motion. More
precisely, we consider a (random) realization $W$ of a Brownian motion on
$[0,1]$. Now, given this function, our goal is to return an
$\epsilon$-approximation of its maximum using the smallest possible number of
function evaluations. This number is called sample complexity of the
algorithm. We provide an algorithm with sample complexity of order
$\log^2(1/\epsilon)$. This improves over previous results of Al-Mharmah 7 and
Calvin [1996] and Calvin et al. [2017] which provided polynomial rates only.
Our algorithm is adaptive --- each query depends on previous values --- and
can be seen as an instance of the optimism-in-the-face-of-uncertainty
principle.
Practical Methods for Graph Two-Sample Testing

Hypothesis testing for graphs has been an important tool in several applied
research fields for more than two decades, and still remains a challenging
problem as one often needs to draw inference from few replicates of large
graphs. Recent studies in statistics and learning theory have provided some
theoretical insights about such high-dimensional graph testing problems, but
the practicality of the developed theoretical methods remains an open
question. In this paper, we consider the problem of two-sample testing of
large graphs. We demonstrate the practical merits and limitations of existing
theoretical tests, or more precisely, their bootstrapped variants. We also
propose two new tests based on asymptotic distributions, and show that the
proposed tests are computationally less expensive and, in some cases, more
reliable than the existing methods.
NAIS-Net: Stable Deep Networks from Non-Autonomous  Differential Equations

This paper introduces Non-Autonomous Input-Output Stable Network (NAIS-Net), a
very deep architecture where each stacked processing block is derived from a
time-invariant non-autonomous dynamical system. Non-autonomy is implemented by
skip connections from the block input to each of the unrolled processing
stages and allows stability to be enforced so that blocks can be unrolled
adaptively to a pattern-dependent processing depth. NAIS-Net induces non-
trivial, Lipschitz input-output maps, even for an infinite unroll length. We
prove that the network is globally asymptotically stable so that for every
initial condition there is exactly one input-dependent equilibrium assuming
tanh units, and multiple stable equilibria for ReL units. An efficient
implementation that enforces the stability under derived conditions for both
fully-connected and convolutional layers is also presented. Experimental
results show how NAIS-Net exhibits stability in practice, yielding a
significant reduction in generalization gap compared to ResNets.
On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Many tasks in machine learning and signal processing can be solved by
minimizing a convex function of a measure. This includes sparse spikes
deconvolution or training a neural network with a single hidden layer. For
these problems, we study a simple minimization method: the unknown measure is
discretized into a mixture of particles and a continuous-time gradient descent
is performed on their weights and positions. This is an idealization of the
usual way to train neural networks with a large hidden layer. We show that,
when initialized correctly and in the many-particle limit, this gradient flow,
although non-convex, converges to global minimizers. The proof involves
Wasserstein gradient flows, a by-product of optimal transport theory.
Numerical experiments show that this asymptotic behavior is already at play
for a reasonable number of particles, even in high dimension.
Constructing Deep Neural Networks by Bayesian Network Structure Learning

We introduce a principled approach for unsupervised structure learning of deep
neural networks. We propose a new interpretation for depth and inter-layer
connectivity where conditional independencies in the input distribution are
encoded hierarchically in the network structure. Thus, the depth of the
network is determined inherently. The proposed method casts the problem of
neural network structure learning as a problem of Bayesian network structure
learning. Then, instead of directly learning the discriminative structure, it
learns a generative graph, constructs its stochastic inverse, and then
constructs a discriminative graph. We prove that conditional-dependency
relations among the latent variables in the generative graph are preserved in
the class-conditional discriminative graph. We demonstrate on image
classification benchmarks that the deepest layers (convolutional and dense) of
common networks can be replaced by significantly smaller learned structures,
while maintaining classification accuracy---state-of-the-art on tested
benchmarks. Our structure learning algorithm requires a small computational
cost and runs efficiently on a standard desktop CPU.
Weakly Supervised Dense Event Captioning in Videos

Dense event captioning aims to detect and describe all events of interest
contained in a video. Despite the advanced development in this area, existing
methods tackle this task by making use of dense temporal annotations, which is
dramatically source-consuming. This paper formulates a new problem: weakly
supervised dense event captioning, which does not require temporal segment
annotations for model training. Our solution is based on the one-to-one
correspondence assumption, each caption describes one temporal segment, and
each temporal segment has one caption, which holds in current benchmark
datasets and most real world cases. We decompose the problem into a pair of
dual problems: event captioning and sentence localization and present a cycle
system to train our model. Extensive experimental results are provided to
demonstrate the ability of our model on both dense event captioning and
sentence localization in videos.
Faithful Inversion of Generative Models for Effective Amortized Inference

Inference amortization methods share information across multiple posterior-
inference problems, allowing each to be carried out more efficiently.
Generally, they require the inversion of the dependency structure in the
generative model, as the modeller must learn a mapping from observations to
distributions approximating the posterior. Previous approaches have involved
inverting the dependency structure in a heuristic way that fails to capture
these dependencies correctly, thereby limiting the achievable accuracy of the
resulting approximations. We introduce an algorithm for faithfully, and
minimally, inverting the graphical model structure of any generative model.
Such inverses have two crucial properties: (a) they do not encode any
independence assertions that are absent from the model and; (b) they are local
maxima for the number of true independencies encoded. We prove the correctness
of our approach and empirically show that the resulting minimally faithful
inverses lead to better inference amortization than existing heuristic
approaches.
From Stochastic Planning to Marginal MAP

It is well known that the problems of stochastic planning and probabilistic
inference are closely related. This paper makes two contributions in this
context. The first is to provide an analysis of the recently developed SOGBOFA
heuristic planning algorithm, that was shown to be effective for problems with
large factored state and action spaces. It is shown that SOGBOFA can be seen
as a specialized inference algorithm that computes its solutions through a
combination of a symbolic variant of belief propagation and gradient ascent.
The second contribution is a new solver for Marginal MAP (MMAP) inference. We
introduce a new reduction from MMAP to maximum expected utility problems which
are suitable for the symbolic computation in SOGBOFA. This yields a novel
algebraic gradient-based solver (AGS) for MMAP. An experimental evaluation
illustrates the potential of AGS in solving difficult MMAP problems.
On Binary Classification in Extreme Regions

In pattern recognition, a random label Y is to be predicted based upon
observing a random vector X valued in $\mathbb{R}^d$ with d>1 by means of a
classification rule with minimum probability of error. In a wide variety of
applications, ranging from finance/insurance to environmental sciences through
teletraffic data analysis for instance, extreme (i.e. very large) observations
X are of crucial importance, while contributing in a negligible manner to the
(empirical) error however, simply because of their rarity. As a consequence,
empirical risk minimizers generally perform very poorly in extreme regions. It
is the purpose of this paper to develop a general framework for classification
in the extremes. Precisely, under non-parametric heavy-tail assumptions for
the class distributions, we prove that a natural and asymptotic notion of
risk, accounting for predictive performance in extreme regions of the input
space, can be defined and show that minimizers of an empirical version of a
non-asymptotic approximant of this dedicated risk, based on a fraction of the
largest observations, lead to classification rules with good generalization
capacity, by means of maximal deviation inequalities in low probability
regions. Beyond theoretical results, numerical experiments are presented in
order to illustrate the relevance of the approach developed.
Near-Optimal Policies for Dynamic Multinomial Logit Assortment Selection Models

In this paper we consider the dynamic assortment selection problem under an
uncapacitated multinomial-logit (MNL) model. By carefully analyzing a revenue
potential function, we show that a trisection based algorithm achieves an
item-independent regret bound of O(sqrt(T log log T), which matches
information theoretical lower bounds up to iterated logarithmic terms. Our
proof technique draws tools from the unimodal/convex bandit literature as well
as adaptive confidence parameters in minimax multi-armed bandit problems.
Q-learning with Nearest Neighbors

We consider model-free reinforcement learning for infinite-horizon discounted
Markov Decision Processes (MDPs) with a continuous state space and unknown
transition kernel, when only a single sample path under an arbitrary policy of
the system is available. We consider the Nearest Neighbor Q-Learning (NNQL)
algorithm to learn the optimal Q function using nearest neighbor regression
method. As the main contribution, we provide tight finite sample analysis of
the convergence rate. In particular, for MDPs with a $d$-dimensional state
space and the discounted factor $\gamma \in (0,1)$, given an arbitrary sample
path with ``covering time'' $L$, we establish that the algorithm is guaranteed
to output an $\varepsilon$-accurate estimate of the optimal Q-function using
$\Ot(L/(\varepsilon^3(1-\gamma)^7))$ samples. For instance, for a well-behaved
MDP, the covering time of the sample path under the purely random policy
scales as $\Ot(1/\varepsilon^d),$ so the sample complexity scales as
$\Ot(1/\varepsilon^{d+3}).$ Indeed, we establish a lower bound that argues
that the dependence of $ \Omegat(1/\varepsilon^{d+2})$ is necessary.
Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization

We present a unified framework to analyze the global convergence of Langevin
dynamics based algorithms for nonconvex finite-sum optimization with $n$
component functions. At the core of our analysis is a direct analysis of the
ergodicity of the numerical approximations to Langevin dynamics, which leads
to faster convergence rates. Specifically, we show that gradient Langevin
dynamics (GLD) and stochastic gradient Langevin dynamics (SGLD) converge to
the \textit{almost minimizer}\footnote{Following \citet{raginsky2017non}, an
almost minimizer is defined to be a point which is within the ball of the
global minimizer with radius $O(d\log(\beta+1)/\beta)$, where $d$ is the
problem dimension and $\beta$ is the inverse temperature parameter.} within
$\tilde O\big(nd/(\lambda\epsilon) \big)$\footnote{$\tO(\cdot)$ notation hides
polynomials of logarithmic terms and constants.} and $\tilde
O\big(d^7/(\lambda^5\epsilon^5) \big)$ stochastic gradient evaluations
respectively, where $d$ is the problem dimension, and $\lambda$ is the
spectral gap of the Markov chain generated by GLD. Both results improve upon
the best known gradient complexity\footnote{Gradient complexity is defined as
the total number of stochastic gradient evaluations of an algorithm, which is
the number of stochastic gradients calculated per iteration times the total
number of iterations.} results \citep{raginsky2017non}. Furthermore, for the
first time we prove the global convergence guarantee for variance reduced
stochastic gradient Langevin dynamics (VR-SGLD) to the almost minimizer within
$\tilde O\big(\sqrt{n}d^5/(\lambda^4\epsilon^{5/2})\big)$ stochastic gradient
evaluations, which outperforms the gradient complexities of GLD and SGLD in a
wide regime. Our theoretical analyses shed some light on using Langevin
dynamics based algorithms for nonconvex optimization with provable guarantees.
Efficiency of adaptive importance sampling

The \textit{sampling policy} of stage $t$, formally expressed as a probability
density function $q_t$, stands for the distribution of the sample
$(x_{t,1},\ldots, x_{t,n_t})$ generated at $t$. From the past samples, some
information depending on some \textit{objective} is derived leading eventually
to update the sampling policy to $q_{t+1}$. This generic approach
characterizes \textit{adaptive importance sampling} (AIS) schemes. Each stage
$t$ is formed with two steps : (i) to explore the space with $n_t$ points
according to $q_t$ and (ii) to exploit the current amount of information to
update the sampling policy. The very fundamental question raised in the paper
concerns the behavior of empirical sums based on AIS. Without making any
assumption on $n_t$, the theory developed involves no restriction on the split
of computational resources between the explore (i) and the exploit (ii) step.
It is shown that the asymptotic behavior of AIS is the same as some ``oracle''
strategy that knows the optimal sampling policy from the beginning. From a
practical perspective, weighted AIS is introduced, a new method that allows to
forget poor samples from early stages.
Learning latent variable structured prediction models with Gaussian perturbations

The standard margin-based structured prediction commonly uses a maximum loss
over all possible structured outputs [23, 1, 5, 22]. The large-margin
formulation including latent variables [27, 18] not only results in a non-
convex formulation but also increases the search space by a factor of the size
of the latent space. Recent work [11] has proposed the use of the maximum loss
over random structured outputs sampled independently from some proposal
distribution, with theoretical guarantees. We extend this work by including
latent variables. We study a new family of loss functions under Gaussian
perturbations and analyze the effect of the latent space on the generalization
bounds. We show that the non-convexity of learning with latent variables
originates naturally, as it relates to a tight upper bound of the Gibbs
decoder distortion with respect to the latent space. Finally, we provide a
formulation using random samples that produces a tighter upper bound of the
Gibbs decoder distortion up to a statistical accuracy, which enables a faster
evaluation of the objective function. We illustrate the method with synthetic
experiments and a computer vision application.
The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

We analyze the Kozachenko–Leonenko (KL) fixed k-nearest neighbor estimator for
the differential entropy. We obtain the first uniform upper bound on its
performance for any fixed k over H"{o}lder balls on a torus without assuming
any conditions on how close the density could be from zero. Accompanying a
recent minimax lower bound over the H"{o}lder ball, we show that the KL
estimator for any fixed k is achieving the minimax rates up to logarithmic
factors without cognizance of the smoothness parameter s of the H"{o}lder
ball for $s \in (0,2]$ and arbitrary dimension d, rendering it the first
estimator that provably satisfies this property.
Deep Reinforcement Learning of Marked Temporal Point Processes

In a wide variety of applications, humans interact with a complex environment
by means of asynchronous stochastic discrete events in continuous time. Can we
design online interventions that will help humans achieve certain goals in
such asynchronous setting? In this paper, we address the above problem from
the perspective of deep reinforcement learning of marked temporal point
processes, where both the actions taken by an agent and the feedback it
receives from the environment are asynchronous stochastic discrete events
characterized using marked temporal point processes. In doing so, we define
the agent's policy using the intensity and mark distribution of the
corresponding process and then derive a flexible policy gradient method, which
embeds the agent's actions and the feedback it receives into real-valued
vectors using deep recurrent neural networks. Our method does not make any
assumptions on the functional form of the intensity and mark distribution of
the feedback and it allows for arbitrarily complex reward functions. We apply
our methodology to two different applications in viral marketing and
personalized teaching and, using data gathered from Twitter and Duolingo, we
show that it may be able to find interventions to help marketers and learners
achieve their goals more effectively than alternatives.
Evidential Deep Learning to Quantify Classification Uncertainty

Deterministic neural nets have been shown to learn effective predictors on a
wide range of machine learning problems. However, as the standard approach is
to train the network to minimize a prediction loss, the resultant model
remains ignorant to its prediction confidence. Orthogonally to Bayesian neural
nets that indirectly infer prediction uncertainty through weight
uncertainties, we propose explicit modeling of the same using the theory of
subjective logic. By placing a Dirichlet prior on the softmax output, we treat
predictions of a neural net as subjective opinions and learn the function that
collects the evidence leading to these opinions by a deterministic neural net
from data. The resultant predictor for a multi-class classification problem is
another Dirichlet distribution whose parameters are set by the continuous
output of a neural net. We provide a preliminary analysis on how the
peculiarities of our new loss function drive improved uncertainty estimation.
We observe that our method achieves unprecedented success on detection of out-
of-sample queries and endurance against adversarial perturbations.
Parsimonious Bayesian deep networks

Combining Bayesian nonparametrics and a forward model selection strategy, we
construct parsimonious Bayesian deep networks (PBDNs) that infer capacity-
regularized network architectures from the data and require neither cross-
validation nor fine-tuning when training the model. One of the two essential
components of a PBDN is the development of a special infinite-wide single-
hidden-layer neural network, whose number of active hidden units can be
inferred from the data. The other one is the construction of a greedy layer-
wise learning algorithm that uses a forward model selection criterion to
determine when to stop adding another hidden layer. We develop both Gibbs
sampling and stochastic gradient descent based maximum a posteriori inference
for PBDNs, providing state-of-the-art classification accuracy and
interpretable data subtypes near the decision boundaries, while maintaining
low computational complexity for out-of-sample prediction.
Single-Agent Policy Tree Search With Guarantees

We introduce two novel tree search algorithms that use a policy to guide
search. The first algorithm is a best-first enumeration that uses a cost
function that allows us to provide an upper bound on the number of nodes to be
expanded before reaching a goal state. We show that this best-first algorithm
is particularly well suited for ``needle-in-a-haystack'' problems. The second
algorithm, which is based on sampling, provides an upper bound on the expected
number of nodes to be expanded before reaching a set of goal states. We show
that this algorithm is better suited for problems where many paths lead to a
goal. We validate these tree search algorithms on 1,000 computer-generated
levels of Sokoban, where the policy used to guide search comes from a neural
network trained using A3C. Our results show that the policy tree search
algorithms we introduce are competitive with a state-of-the-art domain-
independent planner that uses heuristic search.
Semi-crowdsourced Clustering with Deep Generative Models

We consider the semi-supervised clustering problem where crowdsourcing
provides noisy information about the pairwise comparisons on a subset of data,
i.e., some sample pairs are (or are not) in the same clusters. We propose a
new approach for clustering, which effectively combines the low-level features
and a subset of noisy pairwise annotations. We build a deep generative model
to characterize the generative process of low-level features and a relational
model for the noisy pairwise annotations, which share the latent variables.
Fast amortized and natural gradient stochastic variational inference
algorithms are developed for the model and its fully Bayesian variant. Our
empirical results on synthetic and real-world datasets show that the proposed
method outperforms previous methods.
The committee machine: Computational to statistical gaps in learning a two-layers neural network

Heuristic tools from statistical physics have been used in the past to compute
the optimal learning and generalization errors in the teacher-student scenario
in multi- layer neural networks. In this contribution, we provide a rigorous
justification of these approaches for a two-layers neural network model called
the committee machine. We also introduce a version of the approximate message
passing (AMP) algorithm for the committee machine that allows to perform
optimal learning in polynomial time for a large set of parameters. We find
that there are regimes in which a low generalization error is information-
theoretically achievable while the AMP algorithm fails to deliver it; strongly
suggesting that no efficient algorithm exists for those cases, and unveiling a
large computational gap.
Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

Semi-supervised learning (SSL) provides a powerful framework for leveraging
unlabeled data when labels are limited or expensive to obtain. SSL algorithms
based on deep neural networks have recently proven successful on standard
benchmark tasks. However, we argue that these benchmarks fail to address many
issues that these algorithms would face in real-world applications. After
creating a unified reimplementation of various widely-used SSL techniques, we
test them in a suite of experiments designed to address these issues. We find
that the performance of simple baselines which do not use unlabeled data is
often underreported, that SSL methods differ in sensitivity to the amount of
labeled and unlabeled data, and that performance can degrade substantially
when the unlabeled dataset contains out-of-class examples. To help guide SSL
research towards real-world applicability, we make our unified reimplemention
and evaluation platform publicly available.
Training deep learning based denoisers without ground truth data

Recent deep learning based denoisers often outperform state-of-the-art
conventional denoisers such as BM3D. They are typically trained to minimize
the mean squared error (MSE) between the output of a deep neural network and
the ground truth image. In deep learning based denoisers, it is important to
use high quality noiseless ground truth for high performance, but it is often
challenging or even infeasible to obtain such a clean image in application
areas such as hyperspectral remote sensing and medical imaging. We propose a
Stein's Unbiased Risk Estimator (SURE) based method for training deep neural
network denoisers only with noisy images. We demonstrated that our SURE based
method without ground truth was able to train deep neural network denoisers to
yield performance close to deep learning denoisers trained with ground truth
and to outperform state-of-the-art BM3D. Further improvements were achieved by
including noisy test images for training denoiser networks using our proposed
SURE based method.
Re-evaluating evaluation

Progress in machine learning is measured by careful evaluation on problems of
outstanding common interest. However, the proliferation of benchmark suites
and environments, adversarial attacks, and other complications has diluted the
basic evaluation model by overwhelming researchers with choices. Deliberate or
accidental cherry picking is increasingly likely, and designing well-balanced
evaluation suites requires increasing effort. In this paper we take a step
back and propose Nash averaging. The approach builds on a detailed analysis of
the algebraic structure of evaluation in two basic scenarios: agent-vs-agent
and agent-vs-task. The key strength of Nash averaging is that it automatically
adapts to redundancies in evaluation data, so that results are not biased by
the incorporation of easy tasks or weak agents. Nash averaging thus encourages
maximally inclusive evaluation -- since there is no harm (computational cost
aside) from including all available tasks and agents.
Deep, complex, invertible  networks for inversion of transmission effects in multimode optical fibres

We use complex-weighted, deep convolutional networks to invert the effects of
multimode optical fibre distortion of a coherent input image. We generated
experimental data based on collections of optical fibre responses to
greyscale, input images generated with coherent light, and measuring only
image amplitude (not amplitude and phase as is typical) at the output of the
10 metre long 105 micrometre diameter multimode fibre. This data is made
available as the {\it Optical fibre inverse problem} Benchmark collection. The
experimental data is used to train complex-weighted models with a range of
regularisation approaches and subsequent denoising autoencoders. A new {\it
unitary regularisation} approach for complex-weighted networks is proposed
which performs best in robustly inverting the fibre transmission matrix, and
which fits well with the physical theory. The use of unitary layers allows
analytic inversion of the network via its complex conjugate transpose, and we
demonstrate simultaneous optimisation of both the forward and inverse models.
Multivariate Convolutional Sparse Coding for Electromagnetic Brain Signals

Frequency-specific patterns of neural activity are traditionally interpreted
as sustained rhythmic oscillations, and related to cognitive mechanisms such
as attention, high level visual processing or motor control. While alpha waves
(8--12,Hz) are known to closely resemble short sinusoids, and thus are
revealed by Fourier analysis or wavelet transforms, there is an evolving
debate that electromagnetic neural signals are composed of more complex
waveforms that cannot be analyzed by linear filters and traditional signal
representations. In this paper, we propose to learn dedicated representations
of such recordings using a multivariate convolutional sparse coding (CSC)
algorithm. Applied to electroencephalography (EEG) or magnetoencephalography
(MEG) data, this method is able to learn not only prototypical temporal
waveforms, but also associated spatial patterns so their origin can be
localized in the brain. Our algorithm is based on alternated minimization and
a greedy coordinate descent solver that leads to state-of-the-art running time
on long time series. To demonstrate the implications of this method, we apply
it to MEG data and show that it is able to recover biological artifacts. More
remarkably, our approach also reveals the presence of non-sinusoidal mu-shaped
patterns, along with their topographic maps related to the somatosensory
cortex.
Data-Efficient Hierarchical Reinforcement Learning

Hierarchical reinforcement learning (HRL) is a promising approach to extend
traditional reinforcement learning (RL) methods to solve more complex tasks.
Yet, the majority of current HRL methods require careful task-specific design
and on-policy training, making them difficult to apply in real-world
scenarios. In this paper, we study how we can develop HRL algorithms that are
general, in that they do not make onerous additional assumptions beyond
standard RL algorithms, and efficient, in the sense that they can be used with
modest numbers of interaction samples, making them suitable for real-world
problems such as robotic control. For generality, we develop a scheme where
lower-level controllers are supervised with goals that are learned and
proposed automatically by the higher-level controllers. To address efficiency,
we propose to use off-policy experience for both higher- and lower-level
training. This poses a considerable challenge, since changes to the lower-
level behaviors change the action space for the higher-level policy, and we
introduce an off-policy correction to remedy this challenge. This allows us to
take advantage of recent advances in off-policy model-free RL to learn both
higher and lower-level policies using substantially fewer environment
interactions than on-policy algorithms. We find that our resulting HRL agent
is generally applicable and highly sample-efficient. Our experiments show that
our method can be used to learn highly complex behaviors for simulated robots,
such as pushing objects and utilizing them to reach target locations, learning
from only a few million samples, equivalent to a few days of real-time
interaction. In comparisons with a number of prior HRL methods, we find that
our approach substantially outperforms previous state-of-the-art techniques.
Speaker-Follower Models for Vision-and-Language Navigation

Navigation guided by natural language instructions presents a challenging
reasoning problem for instruction followers. Natural language instructions
typically identify only a few high-level decisions and landmarks rather than
complete low-level motor behaviors; much of the missing information must be
inferred based on perceptual context. In machine learning settings, this
presents a double challenge: it is difficult to collect enough annotated data
to enable learning of this reasoning process from scratch, and empirically
difficult to implement the reasoning process using generic sequence models.
Here we describe an approach to vision-and-language navigation that addresses
both these issues with an embedded speaker model. We use this speaker model to
synthesize new instructions for data augmentation and to implement pragmatic
reasoning for evaluating candidate action sequences. Both steps are supported
by a panoramic action space that reflects the granularity of human-generated
instructions. Experiments show that all three pieces of this approach---
speaker-driven data augmentation, pragmatic reasoning and panoramic action
space---dramatically improve the performance of a baseline instruction
follower, more than doubling the success rate over the best existing approach
on a standard benchmark.
Inequity aversion improves cooperation in intertemporal social dilemmas

Groups of humans are often able to find ways to cooperate with one another in
complex, temporally extended social dilemmas. Models based on behavioral
economics are only able to explain this phenomenon for unrealistic stateless
matrix games. Recently, multi-agent reinforcement learning has been applied to
generalize social dilemma problems to temporally and spatially extended Markov
games. However, this has not yet generated an agent that learns to cooperate
in social dilemmas as humans do. A key insight is that many, but not all,
human individuals have inequity averse social preferences. This promotes a
particular resolution of the matrix game social dilemma wherein inequity-
averse individuals are personally pro-social and punish defectors. Here we
extend this idea to Markov games and show that it promotes cooperation in
several types of sequential social dilemma, via a profitable interaction with
policy learnability. In particular, we find that inequity aversion improves
temporal credit assignment for the important class of intertemporal social
dilemmas. These results help explain how large-scale cooperation may emerge
and persist.
Learning Gaussian Processes by Minimizing PAC-Bayesian Generalization Bounds

Gaussian Processes (GPs) are a generic modelling tool for supervised learning.
While they have been successfully applied on large datasets, their use in
safety-critical applications is hindered by the lack of good performance
guarantees. To this end, we propose a method to learn GPs and their sparse
approximations by directly optimizing a PAC-Bayesian bound on their
generalization performance, instead of maximizing the marginal likelihood.
Besides its theoretical appeal, we find in our evaluation that our learning
method is robust and yields significantly better generalization guarantees
than other common GP approaches on several regression benchmark datasets.
High-dimensional Bayesian optimization via collaborative filtering

In order to achieve state-of-the-art performance, modern machine learning
techniques require careful data pre-processing and hyperparameter tuning.
Moreover, given the ever increasing number of machine learning models being
developed, model selection is becoming increasingly important. Automating the
selection and tuning of machine learning pipelines, which can include
different data pre-processing methods and machine learning models, has long
been one of the goals of the machine learning community. In this paper, we
propose to solve this meta-learning task by combining ideas from collaborative
filtering and Bayesian optimization. Specifically, we use a probabilistic
matrix factorization model to transfer knowledge across experiments performed
in hundreds of different datasets and use an acquisition function to guide the
exploration of the space of possible ML pipelines. In our experiments, we show
that our approach quickly identifies high-performing pipelines across a wide
range of datasets, significantly outperforming the current state-of-the-art.
Stochastic Spectral and Conjugate Descent Methods

The state-of-the-art methods for solving optimization problems in big
dimensions are variants of randomized coordinate descent (RCD). In this paper
we introduce a fundamentally new type of acceleration strategy for RCD based
on the augmentation of the set of coordinate directions by a few spectral or
conjugate directions. As we increase the number of extra directions to be
sampled from, the rate of the method improves, and interpolates between the
linear rate of RCD and a linear rate independent of the condition number. We
develop and analyze also inexact variants of these methods where the spectral
and conjugate directions are allowed to be approximate only. We motivate the
above development by proving several negative results which highlight the
limitations of RCD with importance sampling.
But How Does It Work in Theory? Linear SVM with Random Features

We prove that, under low noise assumptions, the support vector machine with
$N\ll m$ random features (RFSVM) can achieve the learning rate faster than
$O(1/\sqrt{m})$ on a training set with $m$ samples when an optimized feature
map is used. Our work extends the previous fast rate analysis of random
features method from least square loss to 0-1 loss. We also show that the
reweighted feature selection method, which approximates the optimized feature
map, helps improve the performance of RFSVM in experiments on a synthetic data
set.
Learning to Optimize Tensor Programs

We introduce a learning-based framework to optimize tensor programs for deep
learning workloads. Efficient implementations of tensor operators, such as
matrix multiplication and high dimensional convolution are key enablers of
effective deep learning systems. However, existing systems rely on manually
optimized libraries such as cuDNN where only a narrow range of server class
GPUs are well-supported. The reliance on hardware specific operator libraries
limits the applicability of high-level graph optimizations and incurs
significant engineering costs when deploying to new hardware targets. We use
learning to remove this engineering burden. We learn domain specific
statistical cost models to guide the search of tensor operator implementations
over billions of possible program variants. We further accelerate the search
by effective model transfer across workloads. Experimental results show that
our framework delivers performance competitive with state-of-the-art hand-
tuned libraries for low-power CPU, mobile GPU, and server-class GPU.
Boosting Black Box Variational Inference

Approximating a probability density in a tractable manner is a central task in
Bayesian statistics. Variational Inference (VI) is a popular technique that
achieves tractability by choosing a relatively simple representation set.
Borrowing ideas from the classic boosting framework, recent approaches attempt
to \emph{boost} VI by replacing the selection of a single density with a
greedily constructed mixture of densities. In order to guarantee convergence,
previous works impose stringent assumptions that require significant effort
for practitioners. Specifically, they require a custom implementation of the
greedy step (called the LMO) for every probabilistic model with respect to an
unnatural variational family of truncated distributions. Our work fixes these
issues with novel theoretical and algorithmic insights. On the theoretical
side, we show that boosting VI satisfies a relaxed smoothness assumption which
is sufficient for the convergence of the functional Frank-Wolfe (FW)
algorithm. Furthermore, we rephrase the LMO problem and propose to maximize
the Residual ELBO (RELBO) which replaces the standard ELBO optimization in VI.
These theoretical enhancements allow for black box implementation of the
boosting subroutine. Finally, we present a stopping criterion drawn from the
duality gap in the classic FW analyses. We also present exhaustive experiments
to illustrate the usefulness of our theoretical and algorithmic contributions.
Nearly tight sample complexity bounds for learning mixtures of Gaussians via sample compression schemes

We prove that ~ϴ(k d^2 / ε^2) samples are necessary and sufficient for
learning a mixture of k Gaussians in R^d, up to error ε in total variation
distance. This improves both the known upper bound and lower bound for this
problem. For mixtures of axis-aligned Gaussians, we show that ~O(k d / ε^2)$
samples suffice, matching a known lower bound. The upper bound is based on a
novel technique for distribution learning based on a notion of sample
compression. Any class of distributions that allows such a sample compression
scheme can also be learned with few samples. Moreover, if a class of
distributions has such a compression scheme, then so do the classes of
products and mixtures of those distributions. The core of our main result is
showing that the class of Gaussians in R^d has an efficient sample
compression.
Actor-Critic Policy Optimization in PartiallyObservable Multiagent Environments

Optimization of parameterized policies for reinforcement learning (RL) is an
important and challenging problem in artificial intelligence. Among the most
common approaches are algorithms based on gradient ascent of a score function
representing discounted return. In this paper, we examine the role of these
policy gradient and actor-critic algorithms in partially-observable multiagent
environments. We show several candidate policy update rules and relate them to
a foundation of regret minimization and multiagent learning techniques for the
one-shot and tabular cases, leading to previously unknown convergence
guarantees. We apply our method to model-free multiagent reinforcement
learning in adversarial sequential decision problems (zero-sum imperfect
information games), using RL-style function approximation. We evaluate on
commonly-used benchmark Poker domains, comparing to several state-of-the-art
baselines, showing empirical convergence to approximate Nash equilibria in
self-play, without any domain-specific state space reductions.
Step Size Matters in Deep Learning

Training a neural network with the gradient descent algorithm gives rise to a
discrete-time nonlinear dynamical system. Consequently, behaviors that are
typically observed in these systems emerge during training, such as
convergence to an orbit but not to a fixed point or dependence of convergence
on the initialization. Step size of the algorithm plays a critical role in
these behaviors: it determines the subset of the local optima that the
algorithm can converge to, and it specifies the magnitude of the oscillations
if the algorithm converges to an orbit. To elucidate the effects of the step
size on training of neural networks, we study the gradient descent algorithm
as a discrete-time dynamical system, and by analyzing the Lyapunov stability
of different solutions, we show the relationship between the step size of the
algorithm and the solutions that can be obtained with this algorithm. The
results provide an explanation for several phenomena observed in practice,
including the deterioration in the training error with increased depth, the
hardness of estimating linear mappings with large singular values, and the
distinct performance of deep residual networks.
Derivative Estimation in Random Design

We propose a nonparametric derivative estimation method for randomdesign
without having to estimate the regression function. The method is based on a
variance-reducing linear combination of symmetric difference quotients. First,
we discuss the special case of uniformrandomdesign and establish the
estimator’s asymptotic properties. Secondly, we generalize these results for
any distribution of the dependent variable and compare the proposed estimator
with popular estimators for derivative estimation such as local polynomial
regression and smoothing splines.
Zeroth-order (Non)-Convex Stochastic Optimization via Conditional Gradient and Gradient Updates

In this paper, we propose and analyze zeroth-order stochastic approximation
algorithms for nonconvex and convex optimization. Specifically, we propose
generalizations of the conditional gradient algorithm achieving rates similar
to the standard stochastic gradient algorithm using only zeroth-order
information. Furthermore, under a structural sparsity assumption, we first
illustrate an implicit regularization phenomenon where the standard stochastic
gradient algorithm with zeroth-order information adapts to the sparsity of the
problem at hand by just varying the step-size. Next, we propose a truncated
stochastic gradient algorithm with zeroth-order information, whose rate of
convergence depends only poly-logarithmically on the dimensionality.
Latent Gaussian Activity Propagation: Using Smoothness and Structure to Separate and Localize Sounds in Large Noisy Environments

We present an approach for simultaneously separating and localizing multiple
sound sources using recorded microphone data. Inspired by topic models, our
approach is based on a probabilistic model of inter-microphone phase
differences, and poses separation and localization as a Bayesian inference
problem. We assume sound activity is locally smooth across time, frequency,
and location, and use the known position of the microphones to obtain a
consistent separation. We compare the performance of our method against
existing algorithms on simulated anechoic voice data and find that it obtains
high performance across a variety of input conditions.
Hybrid-MST: A Hybrid Active Sampling Strategy for Pairwise Preference Aggregation

In this paper we present a hybrid active sampling strategy for pairwise
preference aggregation, which aims at recovering the underlying rating of the
test candidates from sparse and noisy pairwise labeling. Our method employs
Bayesian optimization framework and Bradley-Terry model to construct the
utility function, then to obtain the Expected Information Gain (EIG) of each
pair. For computational efficiency, Gaussian-Hermite quadrature is used for
estimation of EIG. In this work, a hybrid active sampling strategy is
proposed, either using Global Maximum (GM) EIG sampling or Minimum Spanning
Tree (MST) sampling in each trial, which is determined by the test budget. The
proposed method has been validated on both simulated and real-world datasets,
where it shows higher preference aggregation ability than the state-of-the-art
methods.
Infinite-Horizon Gaussian Processes

Gaussian processes provide a flexible framework for forecasting, removing
noise, and interpreting long temporal datasets. State space modelling (Kalman
filtering) enables these non-parametric models to be deployed on long datasets
by reducing the complexity to linear in the number of data points. The
complexity is still cubic in the state dimension m which is an impediment to
practical application. In certain special cases (Gaussian likelihood, regular
spacing) the GP posterior will reach a steady posterior state when the data
are very long. We leverage this and formulate an inference scheme for GPs with
general likelihoods, where inference is based on single-sweep EP (assumed
density filtering). The infinite-horizon model tackles the cubic cost in the
state dimensionality and reduces the cost in the state dimension m to O(m^2)
per data point. The model is extended to online-learning of hyperparameters.
We show examples for large finite-length modelling problems, and present how
the method runs in real-time on a smartphone on a continuous data stream
updated at 100 Hz.
Dimensionality Reduction for Stationary Time Series via Stochastic Nonconvex Optimization

Stochastic optimization naturally arises in machine learning. Efficient
algorithms with provable guarantees, however, are still largely missing, when
the objective function is nonconvex and the data points are dependent. This
paper studies this fundamental challenge through a streaming PCA problem for
stationary time series data. Specifically, our goal is to estimate the
principle component of time series data with respect to the covariance matrix
of the stationary distribution. Computationally, we propose a variant of Oja's
algorithm combined with downsampling to control the bias of the stochastic
gradient caused by the data dependency. Theoretically, we quantify the
uncertainty of our proposed stochastic algorithm based on diffusion
approximations. This allows us to prove the asymptotic rate of convergence and
further implies near optimal asymptotic sample complexity. Numerical
experiments are provided to support our analysis.
Sequence-to-Segment Networks for Segment Detection

Detecting segments of interest from an input sequence is a challenging problem
which often requires not only good knowledge of individual target segments,
but also contextual understanding of the entire input sequence and the
relationships between the target segments. To address this problem, we propose
the Sequence-to-Segment Network (S$^2$N), a novel end-to-end sequential
encoder-decoder architecture. S$^2$N first encodes the input into a sequence
of hidden states that progressively capture both local and holistic
information. It then employs a novel decoding architecture, called Segment
Detection Unit (SDU), that integrates the decoder state and encoder hidden
states to detect segments sequentially. During training, we formulate the
assignment of predicted segments to ground truth as bipartite matching and use
the Earth Mover's Distance to calculate the localization errors. We experiment
with S$^2$N on temporal action proposal generation and video summarization and
show that S$^2$N achieves state-of-the-art performance on both tasks.
Scaling the Poisson GLM to massive neural datasets through polynomial approximations

Recent advances in recording technologies have allowed neuroscientists to
record simultaneous spiking activity from hundreds to thousands of neurons in
multiple brain regions. Such largescale recordings pose a major challenge to
existing statistical methods for neural data analysis. Here we develop highly
scalable approximate inference methods for Poisson generalized linear models
(GLMs) that allow for efficient and regularized estimation of high-dimensional
GLM parameters using a single pass over the data. Our approach relies on a
recently proposed method for obtaining global polynomial approximate
sufficient statistics \cite{huggins2017pass}, which we adapt to the Poisson
GLM setting. First, we consider a quadratic approximation to the Poisson GLM
log-likelihood and derive closed-form solutions for the approximate maximum
likelihood and MAP estimates, posterior distribution, and marginal likelihood.
We show that the approximation allows for efficient regularization via
Gaussian evidence optimization for hyperparameters governing shrinkage,
smoothness, and sparsity of GLM weights. Second, we consider an estimator
based on a fourth order approximation to the log-likelihood, which improves
accuracy of the estimator albeit at increased computational cost and a loss of
closed-form expressions for approximate Bayesian inference. We validate the
quadratic and fourth order estimators using simulations and medium-scale spike
train recordings from primate retina. Finally, we use the highly scalable
quadratic estimator to fit a fully-coupled Poisson GLM to spike train data
recorded from 831 neurons across five regions of the mouse brain for a
duration of 44 minutes, binned at 1 ms resolution, using a single pass over
the data. Across all neurons, this model is fit to over $2$ billion spike
count bins and has $831^2 \approx 691$K coupling filters revealing fine-
timescale statistical dependencies between neurons within and across cortical
and subcortical areas.
Multiplicative Weights Updates with Constant Step-Size in Graphical Constant-Sum Games

Since Multiplicative Weights (MW) updates are the discrete analogue of the
continuous Replicator Dynamics (RD), some researchers had expected their
qualitative behaviours would be similar. We show that this is false in the
context of graphical constant-sum games, which include two-person zero-sum
games as special cases. In such games which have a fully-mixed Nash
Equilibrium (NE), RD have the permanence and Poincare recurrence properties,
but we show that MW updates with constant step-size eps do not. We show that
the regret of RD is O(1/T); for MW updates, we prove a regret lower bound of
Omega( 1 / (eps T) ). For showing the regret results, we adopt a dynamical-
system perspective instead of the now popular optimization perspective.
Interestingly, the regret perspective can be useful for better understanding
of the behaviours of MW updates. In a two-person zero-sum game, if it has a
unique NE which is fully mixed, then we show, via regret, that for any
sufficiently small eps, there exists h>0 and at least two probability
densities which get above h infinitely often, but also get arbitrarily close
to zero infinitely often.
Why Is My Classifier Discriminatory?

Recent attempts to achieve fairness in predictive models focus on the balance
between fairness and accuracy. In sensitive applications such as healthcare or
criminal justice, this trade-off is often undesirable as any increase in
prediction error could have devastating consequences. In this work, we argue
that the fairness of predictions should be evaluated in context of the data,
and that unfairness induced by inadequate samples sizes or unmeasured
predictive variables should be addressed through data collection, rather than
by constraining the model. We decompose cost-based metrics of discrimination
into bias, variance, and noise, and propose actions aimed at estimating and
reducing each term. Finally, we perform case-studies on prediction of income,
mortality, and review ratings, confirming the value of this analysis. We find
that data collection is often a means to reduce discrimination without
sacrificing accuracy.
Multi-Layered Gradient Boosting Decision Trees

Multi-layered representation is believed to be the key ingredient of deep
neural networks especially in cognitive tasks like computer vision. While non-
differentiable models such as gradient boosting decision trees (GBDTs) are the
dominant methods for modeling discrete or tabular data, they are hard to
incorporate with such representation learning ability. In this work, we
propose the multi-layered GBDT forest (mGBDTs), with an explicit emphasis on
exploring the ability to learn hierarchical representations by stacking
several layers of regression GBDTs as its building block. The model can be
jointly trained by a variant of target propagation across layers, without the
need to derive back-propagation nor differentiability. Experiments and
visualizations confirmed the effectiveness of the model in terms of
performance and representation learning ability.
Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning

Learning how to act when there are many available actions in each state is a
challenging task for Reinforcement Learning (RL) agents, especially when many
of the actions are redundant or irrelevant. In such cases, it is easier to
learn which actions not to take. In this work, we propose the Action-
Elimination Deep Q-Network (AE-DQN) architecture that combines a Deep RL
algorithm with an Action Elimination Network (AEN) that eliminates sub-optimal
actions. The AEN is trained to predict invalid actions, supervised by an
external elimination signal provided by the environment. Simulations
demonstrate a considerable speedup and added robustness over vanilla DQN in
text-based games with over a thousand discrete actions.
Communication Efficient Parallel Algorithms for Optimization on Manifolds

The last decade has witnessed an explosion in the development of models,
theory and computational algorithms for ``big data'' analysis. In particular,
distributed inference has served as a natural and dominating paradigm for
statistical inference. However, the existing literature on parallel inference
almost exclusively focuses on Euclidean data and parameters. While this
assumption is valid for many applications, it is increasingly more common to
encounter problems where the data or the parameters lie on a non-Euclidean
space, like a manifold for example. Our work aims to fill a critical gap in
the literature by generalizing parallel inference algorithms to optimization
on manifolds. We show that our proposed algorithm is both communication
efficient and carries theoretical convergence guarantees. In addition, we
demonstrate the performance of our algorithm to the estimation of Fr'echet
means on simulated spherical data and the low-rank matrix completion problem
over Grassmann manifolds applied to the Netflix prize data set.
Neural Code Comprehension: A Learnable Representation of Code Semantics

With the recent success of embeddings in natural language processing, research
has been conducted into applying similar methods to code analysis. Most works
attempt to process the code directly or use a syntactic tree representation,
treating it like sentences written in a natural language. However, none of the
existing methods are sufficient to comprehend program semantics robustly, due
to structural features such as function calls, branching, and interchangeable
order of statements. In this paper, we propose a novel processing technique to
learn code semantics, and apply it to a variety of program analysis tasks. In
particular, we stipulate that a robust distributional hypothesis of code
applies to both human- and machine-generated programs. Following this
hypothesis, we define an embedding space, inst2vec, based on an Intermediate
Representation (IR) of the code that is independent of the source programming
language. We provide a novel definition of contextual flow for this IR,
leveraging both the underlying data- and control-flow of the program. We then
analyze the embeddings qualitatively using analogies and clustering, and
evaluate the learned representation on three different high-level tasks. We
show that with a single RNN architecture and pre-trained fixed embeddings,
inst2vec outperforms specialized approaches for performance prediction
(compute device mapping, optimal thread coarsening); and algorithm
classification from raw code (104 classes), where we set a new state-of-the-
art.
Tight Bounds for Collaborative PAC Learning via Multiplicative Weights

We study the collaborative PAC learning problem recently proposed in Blum et
al.\cite{BHPQ17}, in which we have $k$ players and they want to learn a
target function collaboratively, such that the learned function approximates
the target function well on all players' distributions simultaneously. The
quality of the collaborative learning algorithm is measured by the ratio
between the sample complexity of the algorithm and that of the learning
algorithm for a single distribution (called the overhead). We obtain a
collaborative learning algorithm with overhead $O(\ln k)$, improving the one
with overhead $O(\ln^2 k)$ in \cite{BHPQ17}. We also show that an $\Omega(\ln
k)$ overhead is inevitable when $k$ is polynomial bounded by the VC dimension
of the hypothesis class. Finally, our experimental study has demonstrated the
superiority of our algorithm compared with the one in Blum et
al.\cite{BHPQ17} on real-world datasets.
BinGAN: Learning Compact Binary Descriptors with a Regularized GAN

In this paper, we propose a novel regularization method for Generative
Adversarial Networks that allows the model to learn discriminative yet compact
binary representations of image patches (image descriptors). We exploit the
dimensionality reduction that takes place in the intermediate layers of the
discriminator network and train the binarized penultimate layer's low-
dimensional representation to mimic the distribution of the higher-dimensional
preceding layers. To achieve this, we introduce two loss terms that aim at:
(i) reducing the correlation between the dimensions of the binarized
penultimate layer's low-dimensional representation (i.e. maximizing joint
entropy) and (ii) propagating the relations between the dimensions in the
high-dimensional space to the low-dimensional space. We evaluate the resulting
binary image descriptors on two challenging applications, image matching and
retrieval, where they achieve state-of-the-art results.
Modern Neural Networks Generalize on Small Data Sets

In this paper, we use a linear program to empirically decompose fitted neural
networks into ensembles of low-bias sub-networks. We show that these sub-
networks are relatively uncorrelated which leads to an internal regularization
process, very much like a random forest, which can explain why a neural
network is surprisingly resistant to overfitting. We then demonstrate this in
practice by applying large neural networks, with hundreds of parameters per
training observation, to a collection of 116 real-world data sets from the UCI
Machine Learning Repository. This collection of data sets contains a much
smaller number of training examples than the types of image classification
tasks generally studied in the deep learning literature, as well as non-
trivial label noise. We show that even in this setting deep neural nets are
capable of achieving superior classification accuracy without overfitting.
Escaping Saddle Points in Constrained Optimization

In this paper, we focus on escaping from saddle points in smooth nonconvex
optimization problems subject to a convex set $\mathcal{C}$. We propose a
generic framework that yields convergence to a second-order stationary point
of the problem, if the convex set $\mathcal{C}$ is simple for a quadratic
objective function. To be more precise, our results hold if one can find a
$\rho$-approximate solution of a quadratic program subject to $\mathcal{C}$ in
polynomial time, where $\rho&lt;1$ is a positive constant that depends on the
structure of the set $\mathcal{C}$. Under this condition, we show that the
sequence of iterates generated by the proposed framework reaches an
$(\epsilon,\gamma)$-second order stationary point (SOSP) in at most
$\mathcal{O}(\max\{\epsilon^{-2},\rho^{-3}\gamma^{-3}\})$ iterations. We
further characterize the overall arithmetic operations to reach an SOSP when
the convex set $\mathcal{C}$ can be written as a set of quadratic constraints.
Finally, we extend our results to the stochastic setting and characterize the
number of stochastic gradient and Hessian evaluations to reach an
$(\epsilon,\gamma)$-SOSP.
Adversarial Attacks on Stochastic Bandits

We study adversarial attacks that manipulate the reward signals to control the
actions chosen by a stochastic multi-armed bandit algorithm. We propose the
first attack against two popular bandit algorithms: $\epsilon$-greedy and UCB,
\emph{without} knowledge of the mean rewards. The attacker is able to spend
only logarithmic effort, multiplied by a problem-specific parameter that
becomes smaller as the bandit problem gets easier to attack. The result means
the attacker can easily hijack the behavior of the bandit algorithm to promote
or obstruct certain actions, say, a particular medical treatment. As bandits
are seeing increasingly wide use in practice, our study exposes a significant
security threat.
Optimal Subsampling with Influence Functions

Subsampling is a common and often effective method to deal with the
computational challenges of large datasets. However, for most statistical
models, there is no well-motivated approach for drawing a non-uniform
subsample. We show that the concept of an asymptotically linear estimator and
the associated influence function leads to asymptotically optimal sampling
probabilities for a wide class of popular models. This is the only tight
optimality result for subsampling we are aware of as other methods only
provide probabilistic error bounds or optimal rates. Furthermore, for linear
regression models, which have well-studied procedures for non-uniform
subsampling, we empirically show our optimal influence function based method
outperforms previous approaches even when using approximations to the optimal
probabilities.
Equality of Opportunity in Classification: A Causal Approach

Equalized Odds (EO) provides a sensible framework to reason about
discrimination against a specified protected group (e.g., gender, race) in
supervised learning. However, statistical tests based on the EO, as
acknowledged in (Hartz et al. 2016), are oblivious to the true data-generating
mechanisms, and thus unable to capture fundamental notions of unfairness such
as direct discrimination. This paper introduces a set of novel counterfactual
measures that allows one to explain the disparities measured by EO over the
underlying mechanisms in an arbitrary causal model. We operationalize these
estimands through a practical procedure to obtain an efficient classifier
compatible with basic human intuition about fairness.
Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization

Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) have
been widely used in distributed machine learning, e.g., training large
collaborative filtering systems and deep neural networks. Due to current
technical limit, however, establishing convergence properties of Async-MSGD
for these highly complicated nonoconvex problems is generally infeasible.
Therefore, we propose to analyze the algorithm through a simpler but
nontrivial nonconvex problems --- streaming PCA. This allows us to make
progress toward understanding Aync-MSGD and gaining new insights for more
general problems. Specifically, by exploiting the diffusion approximation of
stochastic optimization, we establish the asymptotic rate of convergence of
Async-MSGD for streaming PCA. Our results indicate a fundamental tradeoff
between asynchrony and momentum: To ensure convergence and acceleration
through asynchrony, we have to reduce the momentum (compared with Sync-MSGD).
To the best of our knowledge, this is the first theoretical attempt on
understanding Async-MSGD for distributed nonconvex stochastic optimization.
Numerical experiments on both streaming PCA and training deep neural networks
are provided to support our findings for Async-MSGD.
Unsupervised Attention-guided Image-to-Image Translation

Current unsupervised image-to-image translation techniques struggle to focus
their attention on individual objects without altering the background or the
way multiple objects interact within a scene. Motivated by the important role
of attention in human perception, we tackle this limitation by introducing
unsupervised attention mechanisms which are jointly adversarially trained with
the generators and discriminators. We empirically demonstrate that our
approach is able to attend to relevant regions in the image without requiring
any additional supervision, and that by doing so it achieves more realistic
mappings compared to recent approaches.
Inferring Networks From Random Walk-Based Node Similarities

Digital presence in the world of online social media entails significant
privacy risks \cite{korolova2008link,zheleva2012privacy}. In this work we
consider a privacy threat to a social network in which an attacker has access
to a subset of random walk-based node similarities, such as effective
resistances (i.e., commute times) or personalized PageRank scores. Using these
similarities, the attacker seeks to infer as much information as possible
about the network, including unknown pairwise node similarities and edges. For
the effective resistance metric, we show that with just a small subset of
measurements, one can learn a large fraction of edges in a social network. We
also show that it is possible to learn a graph which accurately matches the
underlying network on all other effective resistances. This second observation
is interesting from a data mining perspective, since it can be expensive to
compute all effective resistances or other random walk-based similarities. As
an alternative, our graphs learned from just a subset of effective resistances
can be used as surrogates in a range of applications that use effective
resistances to probe graph structure, including for graph clustering, node
centrality evaluation, and anomaly detection. We obtain our results by
formalizing the graph learning objective mathematically, using two
optimization problems. One formulation is convex and can be solved provably in
polynomial time. The other is not, but we solve it efficiently with projected
gradient and coordinate descent. We demonstrate the effectiveness of these
methods on a number of social networks obtained from Facebook. We also discuss
how our methods can be generalized to other random walk-based similarities,
such as personalized PageRank scores.
NEON 2: Finding Local Minima via First-Order Oracles

(this is a theory paper) We propose a reduction for non-convex optimization
that can (1) turn an stationary-point finding algorithm into an local-minimum
finding one, and (2) replace the Hessian-vector product computations with only
gradient computations. It works both in the stochastic and the deterministic
settings, without hurting the algorithm's performance. As applications, our
reduction turns Natasha2 into a first-order method without hurting its
theoretical performance. It also converts SGD, GD, and SCSG into algorithms
finding approximate local minima, outperforming some best known results.
Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization

As application demands for zeroth-order (gradient-free) optimization
accelerate, the need for variance reduced and faster converging approaches is
also intensifying. This paper addresses these challenges by presenting: a) a
comprehensive theoretical analysis of variance reduced zeroth-order (ZO)
optimization, b) a novel variance reduced ZO algorithm, called ZO-SVRG, and c)
an experimental evaluation of our approach in the context of two compelling
applications, black-box chemical material classification and generation of
adversarial examples from black-box deep neural network models. Our
theoretical analysis uncovers an essential difficulty in the analysis of ZO-
SVRG: the unbiased assumption on gradient estimates no longer holds. We prove
that compared to its first-order counterpart, ZO-SVRG with a two-point random
gradient estimator suffers an additional error of order $O(1/b)$, where $b$
the mini-batch size. To mitigate this error, we propose two accelerated
versions of ZO-SVRG utilizing variance reduced gradient estimators, which
achieve the best rate known for ZO stochastic optimization (in terms of
iterations). Our extensive experimental results show that our approaches
outperform other state-of-the-art ZO algorithms, and strike a balance between
the convergence rate and the function query complexity.
Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting

We introduce the Kronecker factored online Laplace approximation for
overcoming catastrophic forgetting in neural networks. The method is grounded
in a Bayesian online learning framework, where we recursively approximate the
posterior after every task with a Gaussian, leading to a quadratic penalty on
changes to the weights. The Laplace approximation requires calculating the
Hessian around a mode, which is typically intractable for modern
architectures. In order to make our method scalable, we leverage recent block-
diagonal Kronecker factored approximations to the curvature. Our algorithm
achieves over 90% test accuracy across a sequence of 50 instantiations of the
permuted MNIST dataset, substantially outperforming related methods for
overcoming catastrophic forgetting.
DeepProbLog:  Neural Probabilistic Logic Programming

We introduce DeepProbLog, a probabilistic logic programming language that
incorporates deep learning by means of neural predicates. We show how existing
inference and learning techniques can be adapted for the new language. Our
experiments demonstrate that DeepProbLog supports (i) both symbolic and
subsymbolic representations and inference, (ii) program induction, (iii)
probabilistic (logic) programming, and (iv) (deep) learning from examples. To
the best of our knowledge, this work is the first to propose a framework where
general-purpose neural networks and expressive probabilistic-logical modeling
and reasoning are integrated in a way that exploits the full expressiveness
and strengths of both worlds and can be trained end-to-end based on examples.
Convergence of Cubic Regularization for Nonconvex Optimization under KL Property

Cubic-regularized Newton's method (CR) is a popular algorithm that guarantees
to produce a second-order stationary solution for solving nonconvex
optimization problems. However, existing understandings of convergence rate of
CR are conditioned on special types of geometrical properties of the objective
function. In this paper, we explore the asymptotic convergence rate of CR by
exploiting the ubiquitous Kurdyka-Lojasiewicz (KL) property of the nonconvex
objective functions. In specific, we characterize the asymptotic convergence
rate of various types of optimality measures for CR including function value
gap, variable distance gap, gradient norm and least eigenvalue of the Hessian
matrix. Our results fully characterize the diverse convergence behaviors of
these optimality measures in the full parameter regime of the KL property.
Moreover, we show that the obtained asymptotic convergence rates of CR are
order-wise faster than those of first-order gradient descent algorithms under
the KL property.
Direct Estimation of Differences in Causal Graphs

We consider the problem of estimating the differences between two causal
directed acyclic graph (DAG) models given i.i.d. samples from each model. This
is of interest for example in genomics, where changes in the structure or edge
weights of the underlying causal graphs reflect alterations in the gene
regulatory networks. We here provide the first provably consistent method for
directly estimating the differences in a pair of causal DAGs without
separately learning two possibly large and dense DAG models and computing
their difference. Our two-step algorithm first uses invariance tests between
regression coefficients of the two data sets to estimate the skeleton of the
difference graph and then orients some of the edges using invariance tests
between regression residual variances. We demonstrate the properties of our
method through a simulation study and apply it to the analysis of gene
expression data from ovarian cancer and during T-cell activation.
Sublinear Time Low-Rank Approximation of Distance Matrices

Let $\PP=\{ p_1, p_2, \ldots p_n \}$ and $\QQ = \{ q_1, q_2 \ldots q_m \}$
be two point sets in an arbitrary metric space. Let $\AA$ represent the
$m\times n$ pairwise distance matrix with $\AA_{i,j} = d(p_i, q_j)$. Such
distance matrices are commonly computed in software packages and have
applications to learning image manifolds, handwriting recognition, and multi-
dimensional unfolding, among other things. In an attempt to reduce their
description size, we study low rank approximation of such matrices. Our main
result is to show that for any underlying distance metric $d$, it is possible
to achieve an additive error low rank approximation in sublinear time. We note
that it is provably impossible to achieve such a guarantee in sublinear time
for arbitrary matrices $\AA$, and our proof exploits special properties of
distance matrices. We develop a recursive algorithm based on additive
projection-cost preserving sampling. We then show that in general, relative
error approximation in sublinear time is impossible for distance matrices,
even if one allows for bicriteria solutions. Additionally, we show that if
$\PP = \QQ$ and $d$ is the squared Euclidean distance, which is not a metric
but rather the square of a metric, then a relative error bicriteria solution
can be found in sublinear time. Finally, we empirically compare our algorithm
with the SVD and input sparsity time algorithms. Our algorithm is several
hundred times faster than the SVD, and about $8$-$20$ times faster than input
sparsity methods on real-world and and synthetic datasets of size $10^8$.
Accuracy-wise, our algorithm is only slightly worse than that of the SVD
(optimal) and input-sparsity time algorithms.
Variational PDEs for Acceleration on Manifolds and Application to Diffeomorphisms

We consider the optimization of cost functionals on manifolds and derive a
variational approach to accelerated methods on manifolds. We demonstrate the
methodology on the infinite-dimensional manifold of diffeomorphisms, motivated
by registration problems in computer vision. We build on the variational
approach to accelerated optimization by Wibisono, Wilson and Jordan, which
applies in finite dimensions, and generalize that approach to infinite
dimensional manifolds. We derive the continuum evolution equations, which are
partial differential equations (PDE), and relate them to simple mechanical
principles. Our approach can also be viewed as a generalization of the $L^2$
optimal mass transport problem. Our approach evolves an infinite number of
particles endowed with mass, represented as a mass density. The density
evolves with the optimization variable, and endows the particles with
dynamics. This is different than current accelerated methods where only a
single particle moves and hence the dynamics does not depend on the mass. We
derive the theory, compute the PDEs for acceleration, and illustrate the
behavior of this new accelerated optimization scheme.
Bayesian Inference of Temporal Task Specifications from Demonstrations

When observing task demonstrations, human apprentices are able to identify
whether a given task is executed correctly long before they gain expertise in
actually performing that task. Prior research into learning from
demonstrations (LfD) has failed to capture this notion of the acceptability of
an execution; meanwhile, temporal logics provide a flexible language for
expressing task specifications. Inspired by this, we present Bayesian
specification inference, a probabilistic model for inferring task
specification as a temporal logic formula. We incorporate methods from
probabilistic programming to define our priors, along with a domain-
independent likelihood function to enable sampling-based inference. We
demonstrate the efficacy of our model for inferring true specifications with
over 90% similarity between the inferred specification and the ground truth,
both within a synthetic domain and a real-world table setting task.
Data center cooling using model-predictive control

Despite impressive recent advances in reinforcement learning (RL), its
deployment in real-world physical systems is often complicated by unexpected
events, limited data, and the potential for expensive failures. In this paper,
we describe an application of RL “in the wild” to the task of regulating
temperatures and airflow inside a large-scale data center (DC). Adopting a
data-driven, model-based approach, we demonstrate that an RL agent with little
prior knowledge is able to effectively and safely regulate conditions on a
server floor after just a few hours of exploration, while improving
operational efficiency relative to existing PID controllers.
Acceleration through Optimistic No-Regret Dynamics

We consider the problem of minimizing a smooth convex function by reducing the
optimization to computing the Nash equilibrium of a particular zero-sum
convex-concave game. Zero-sum games can be solved using no-regret learning
dynamics, and the standard approach leads to a rate of $O(1/T)$. But we are
able to show that the game can be solved at a rate of $O(1/T^2)$, extending
recent works of \cite{RS13,SALS15} by using \textit{optimistic learning} to
speed up equilibrium computation. The optimization algorithm that we can
extract from this equilibrium reduction coincides \textit{exactly} with the
well-known \NA \cite{N83a} method, and indeed the same story allows us to
recover several variants of the Nesterov's algorithm via small tweaks. This
methodology unifies a number of different iterative optimization methods: we
show that the \HB algorithm is precisely the non-optimistic variant of \NA,
and recent prior work already established a similar perspective on \FW
\cite{AW17,ALLW18}.
Minimax Estimation of Neural Net Distance

An important class of distance metrics proposed for training generative
adversarial networks (GANs) is the integral probability metric (IPM), in which
the neural net distance captures the practical GAN training via two neural
networks. This paper investigates the minimax estimation problem of the neural
net distance based on samples drawn from the distributions. We develop the
first known minimax lower bound on the estimation error of the neural net
distance, and an upper bound tighter than an existing bound on the estimator
error for the empirical neural net distance. Our lower and upper bounds match
not only in the order of the sample size but also in terms of the norm of the
parameter matrices of neural networks, which justifies the empirical neural
net distance as a good approximation of the true neural net distance for
training GANs in practice.
Leveraging the Exact Likelihood of Deep Latent Variable Models

Deep latent variable models (DLVMs) combine the approximation abilities of
deep neural networks and the statistical foundations of generative models.
Variational methods are commonly used for inference; however, the exact
likelihood of these models has been largely overlooked. The purpose of this
work is to study the general properties of this quantity and to show how they
can be leveraged in practice. We focus on important inferential problems that
rely on the likelihood: estimation and missing data imputation. First, we
investigate maximum likelihood estimation for DLVMs: in particular, we show
that most unconstrained models used for continuous data have an unbounded
likelihood function. This problematic behaviour is demonstrated to be a source
of mode collapse. We also show how to ensure the existence of maximum
likelihood estimates, and draw useful connections with nonparametric mixture
models. Finally, we describe an algorithm for missing data imputation using
the exact conditional likelihood of a deep latent variable model. On several
data sets, our algorithm consistently and significantly outperforms the usual
imputation scheme used for DLVMs.
Bipartite Stochastic Block Models with Tiny Clusters

We study the problem of finding planted clusters in bipartite graphs. We
present a simple two-step algorithm which provably finds even tiny clusters of
size O(n^ε), where n is the number of vertices in the graph and ε > 0.
Previous algorithms were only able to identify clusters of size Ω( sqrt(n) ).
We evaluated the algorithm on synthetic and on real-world data; the
experiments show that the algorithm can find extremely small clusters even in
presence of high destructive noise.
Learning sparse neural networks via sensitivity-driven regularization

The ever-increasing number of parameters in deep neural networks poses
challenges for memory-limited applications. Regularize-and-prune methods aim
at meeting these challenges by sparsifying the network weights. In this
context we quantify the output sensitivity to the parameters (i.e. their
relevance to the network output) and introduce a regularization term that
gradually lowers the absolute value of parameters with low sensitivity. Thus,
a very large fraction of the parameters approach zero and are eventually set
to zero by simple thresholding. Our method surpasses recent techniques both in
terms of sparsity and error rates. In some cases, the method reaches twice the
sparsity obtained by other techniques at equal error rates.
Faster Online Learning of Optimal Threshold for  Consistent F-measure Optimization

In this paper, we consider online F-measure optimization (OFO). Unlike
traditional performance metrics (e.g., classification error rate), F-measure
is non-decomposable over training examples and is a non-convex function of
model parameters, making it much more difficult to be optimized in an online
fashion. Most existing results of OFO usually suffer from high
memory/computational costs and/or lack statistical consistency guarantee for
optimizing F-measure at the population level. To advance OFO, we propose an
efficient online algorithm based on simultaneously learning a posterior
probability of class and learning an optimal threshold by minimizing a
stochastic strongly convex function with unknown strong convexity parameter. A
key component of the proposed method is a novel stochastic algorithm with low
memory and computational costs, which can enjoy a convergence rate of
$\widetilde O(1/\sqrt{n})$ for learning the optimal threshold under a mild
condition on the convergence of the posterior probability, where $n$ is the
number of processed examples. It is provably faster than its predecessor based
on a heuristic for updating the threshold. The experiments verify the
efficiency of the proposed algorithm in comparison with state-of-the-art OFO
algorithms.
Adversarial Examples that Fool both Computer Vision and Time-Limited Humans

Machine learning models are vulnerable to adversarial examples: small changes
to images can cause computer vision models to make mistakes such as
identifying a school bus as an ostrich. However, it is still an open question
whether humans are prone to similar mistakes. Here, we address this question
by leveraging recent techniques that transfer adversarial examples from
computer vision models with known parameters and architecture to other models
with unknown parameters and architecture, and by matching the initial
processing of the human visual system. We find that adversarial examples that
strongly transfer across computer vision models influence the classifications
made by time-limited human observers.
Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization

We study finite-sum nonconvex optimization problems, where the objective
function is an average of $n$ nonconvex functions. We propose a new stochastic
gradient descent algorithm based on nested variance reduction. Compared with
conventional stochastic variance reduced gradient (SVRG) algorithm that uses
two reference points to construct a semi-stochastic gradient with diminishing
variance in each epoch, our algorithm uses $K+1$ nested reference points to
build an semi-stochastic gradient to further reduce its variance in each
epoch. For smooth functions, the proposed algorithm converges to an
approximate first order stationary point (i.e., $|\nabla F(\xb)|_2\leq
\epsilon$) within $\tO(n\land \epsilon^{-2}+\epsilon^{-3}\land
n^{1/2}\epsilon^{-2})$\footnote{$\tO(\cdot)$ hides the logarithmic factors}
number of stochastic gradient evaluations, where $n$ is the number of
component functions, and $\epsilon$ is the optimization error. This improves
the best known gradient complexity of SVRG $O(n+n^{2/3}\epsilon^{-2})$ and the
best gradient complexity of SCSG $O(\epsilon^{-5/3}\land
n^{2/3}\epsilon^{-2})$. For gradient dominated functions, our algorithm
achieves $\tO(n\land \tau\epsilon^{-1}+\tau\cdot (n^{1/2}\land
(\tau\epsilon^{-1})^{1/2})$ gradient complexity, which again beats the
existing best gradient complexity $\tO(n\land \tau\epsilon^{-1}+\tau\cdot
(n^{1/2}\land (\tau\epsilon^{-1})^{2/3})$ achieved by SCSG. Thorough
experimental results on different nonconvex optimization problems back up our
theory.
Faster Neural Networks Straight from JPEG

The simple, elegant approach of training convolutional neural networks (CNNs)
directly from RGB pixels has enjoyed overwhelming empirical success. But can
more performance be squeezed out of networks by using different input
representations? In this paper we propose and explore a simple idea: train
CNNs directly on the blockwise discrete cosine transform (DCT) coefficients
computed and available in the middle of the JPEG codec. Intuitively, when
processing JPEG images using CNNs, it seems unnecessary to decompress a
blockwise frequency representation to an expanded pixel representation,
shuffle it from CPU to GPU, and then process it with a CNN that will learn
something similar to a transform back to frequency representation in its first
layers. Why not skip both steps and feed the frequency domain into the network
directly? In this paper we modify \libjpeg to produce DCT coefficients
directly, modify a ResNet-50 network to accommodate the differently sized and
strided input, and evaluate performance on ImageNet. We find networks that are
both faster and more accurate, as well as networks with about the same
accuracy but 1.77x faster than ResNet-50.
TopRank: A practical algorithm for online stochastic ranking

Online learning to rank is a sequential decision-making problem where in each
round the learning agent chooses a list of items and receives feedback in the
form of clicks from the user. Many sample-efficient algorithms have been
proposed for this problem that assume a specific click model connecting
rankings and user behavior. We propose a generalized click model that
encompasses many existing models, including the position-based and cascade
models. Our generalization motivates a novel online learning algorithm based
on topological sort, which we call TopRank. TopRank is (a) more natural than
existing algorithms, (b) has stronger regret guarantees than existing
algorithms with comparable generality, (c) has a more insightful proof that
leaves the door open to many generalizations, (d) outperforms existing
algorithms empirically.
Learning from discriminative feature feedback

We consider the problem of learning a multi-class classifier from labels as
well as simple explanations that we call "discriminative features". We show
that such explanations can be provided whenever the target concept is a
decision tree, or more generally belongs to a particular subclass of DNF
formulas. We present an efficient online algorithm for learning from such
feedback and we give tight bounds on the number of mistakes made during the
learning process. These bounds depend only on the size of the target concept
and not on the overall number of available features, which could be infinite.
We also demonstrate the learning procedure experimentally.
RetGK: Graph Kernels based on Return Probabilities of Random Walks

Graph-structured data arise in wide applications, such as computer vision,
bioinformatics, and social networks. Quantifying similarities among graphs is
a fundamental problem. In this paper, we develop a framework for computing
graph kernels, based on return probabilities of random walks. The advantages
of our proposed kernels are that they can effectively exploit various node
attributes, while being scalable to large datasets. We conduct extensive graph
classification experiments to evaluate our graph kernels. The experimental
results show that our graph kernels significantly outperform other state-of-
the-art approaches in both accuracy and computational efficiency.
Deep Generative Markov State Models

We propose a deep generative Markov State Model (DeepGenMSM) learning
framework for inference of metastable dynamical systems and prediction of
trajectories. After unsupervised training on time series data, the model
contains (i) a probabilistic encoder that maps from high-dimensional
configuration space to a small-sized vector indicating the membership to
metastable (long-lived) states, (ii) a Markov chain that governs the
transitions between metastable states and facilitates analysis of the long-
time dynamics, and (iii) a generative part that samples the conditional
distribution of configurations in the next time step. The model can be
operated in a recursive fashion to generate trajectories to predict the system
evolution from a defined starting state and propose new configurations. The
DeepGenMSM is demonstrated to provide accurate estimates of the long-time
kinetics and generate valid distributions for molecular dynamics (MD)
benchmark systems. Remarkably, we show that DeepGenMSMs are able to make long
time-steps in molecular configuration space and generate physically realistic
structures in regions that were not seen in training data.
Early Stopping for Nonparametric Testing

Early stopping of iterative algorithms is an algorithmic regularization method
to avoid over-fitting in estimation and classification. In this paper, we show
that early stopping can also be applied to obtain the minimax optimal testing
in a general non-parametric setup. Specifically, a Wald-type test statistic is
obtained based on an iterated estimate produced by functional gradient descent
algorithms in a reproducing kernel Hilbert space. A notable contribution is to
establish a ``sharp'' stopping rule: when the number of iterations achieves an
optimal order, testing optimality is achievable; otherwise, testing optimality
becomes impossible. As a by-product, a similar sharpness result is also
derived for minimax optimal estimation under early stopping. All obtained
results hold for various kernel classes, including Sobolev smoothness classes
and Gaussian kernel classes.
Efficient Anomaly Detection via Matrix Sketching

We consider the problem of finding anomalies in high-dimensional data using
popular PCA based anomaly scores. The naive algorithms for computing these
scores explicitly compute the PCA of the covariance matrix which uses space
quadratic in the dimensionality of the data. We give the first streaming
algorithms that use space that is linear or sublinear in the dimension. We
prove general results showing that \emph{any} sketch of a matrix that
satisfies a certain operator norm guarantee can be used to approximate these
scores. We instantiate these results with powerful matrix sketching techniques
such as Frequent Directions and random projections to derive efficient and
practical algorithms for these problems, which we validate over real-world
data sets. Our main technical contribution is to prove matrix perturbation
inequalities for operators arising in the computation of these measures.
Learning to Specialize with Knowledge Distillation for Visual Question Answering

Visual Question Answering (VQA) is a notoriously challenging problem because
it involves various heterogeneous tasks defined by questions within a unified
framework. Learning specialized models for individual types of tasks is
intuitively attracting but surprisingly difficult; it is not straightforward
to outperform naive independent ensemble approaches. We present a principled
algorithm to learn specialized models with knowledge distillation under a
multiple choice learning framework. The training examples are dynamically
assigned to a subset of models for specializing their functionality. The
assigned and non-assigned models are learned to predict ground-truth answers
and imitate their own base models before specialization, respectively. Our
approach alleviates the problem of data deficiency, which is a critical
limitation in existing frameworks on multiple choice learning, and allows each
model to learn its own specialized expertise without forgetting general
knowledge by knowledge distillation. Our experiments show that the proposed
algorithm achieves the superior performances compared to naive ensemble
methods and other baselines in VQA. Our framework is also effective for more
general tasks, e.g., image classification with a large number of labels, which
is known to be difficult under existing multiple choice learning schemes.
A Lyapunov-based Approach to Safe Reinforcement Learning

In many real-world reinforcement learning (RL) problems, besides optimizing
the main objective function, an agent must concurrently avoid violating a
number of constraints. In particular, besides optimizing performance it is
crucial to guarantee the \emph{safety} of an agent during training as well as
deployment (e.g. a robot should avoid taking actions - exploratory or not -
which irrevocably harm its hardware). To incorporate safety in RL, we derive
algorithms under the framework of Constrained Markov decision problems
(CMDPs), an extension of the standard Markov decision problems (MDPs)
augmented with constraints on expected cumulative costs. Our approach hinges
on a novel \emph{Lyapunov} method. We define and present a method for
constructing Lyapunov functions, which provide an effective way to guarantee
the global safety of a behavior policy during training via a set of local,
linear constraints. Leveraging these theoretical underpinnings, we show how to
use the Lyapunov approach to systematically transform dynamic programming (DP)
and RL algorithms into their safe counterparts. To illustrate their
effectiveness, we evaluate these algorithms in several CMDP planning and
decision-making tasks on a safety benchmark domain. Our results show that our
proposed method significantly outperforms existing baselines in balancing
constraint satisfaction and performance.
Credit Assignment For Collective Multiagent RL With Global Rewards

Scaling decision theoretic planning to large multiagent systems is challenging
due to uncertainty and partial observability in the environment. We focus on a
multiagent planning model subclass, relevant to urban settings, where agent
interactions are dependent on their ``collective influence'' on each other,
rather than their identities. Unlike previous work, we address a general
setting where system reward is not decomposable among agents. We develop
collective actor-critic RL approaches for this setting, and address the
problem of multiagent credit assignment, and computing low variance policy
gradient estimates that result in faster convergence to high quality
solutions. We also develop difference rewards based credit assignment methods
for the collective setting. Empirically our new approaches provide
significantly better solutions than previous methods in the presence of global
rewards on two real world problems modeling taxi fleet optimization and
multiagent patrolling, and a synthetic grid navigation domain.
Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

We consider stochastic gradient descent (SGD) for least-squares regression
with potentially several passes over the data. While several passes have been
widely reported to perform practically better in terms of predictive
performance on unseen data, the existing theoretical analysis of SGD suggests
that a single pass is statistically optimal. While this is true for low-
dimensional easy problems, we show that for hard problems, multiple passes
lead to statistically optimal predictions while single pass does not; we also
show that in these hard models, the optimal number of passes over the data
increases with sample size. In order to define the notion of hardness and show
that our predictive performances are optimal, we consider potentially
infinite-dimensional models and notions typically associated to kernel
methods, namely, the decay of eigenvalues of the covariance matrix of the
features and the complexity of the optimal predictor as measured through the
covariance matrix. We illustrate our results on synthetic experiments with
non-linear kernel methods and on a classical benchmark with a linear model.
Does mitigating ML's impact disparity require treatment disparity?

Following precedent in employment discrimination law, two notions of disparity
are widely-discussed in papers on fairness and ML. Algorithms exhibit
treatment disparity if they formally treat members of protected subgroups
differently; algorithms exhibit impact disparity when outcomes differ across
subgroups (even unintentionally). Naturally, we can achieve impact parity
through purposeful treatment disparity. One line of papers aims to reconcile
the two parities proposing disparate learning processes (DLPs). Here, the
sensitive feature is used during training but a group-blind classifier is
produced. In this paper, we show that: (i) when sensitive and (nominally)
nonsensitive features are correlated, DLPs will indirectly implement treatment
disparity, undermining the policy desiderata they are designed to address;
(ii) when group membership is partly revealed by other features, DLPs induce
within-class discrimination; and (iii) in general, DLPs provide suboptimal
trade-offs between accuracy and impact parity. Experimental results on several
real-world datasets highlight the practical consequences of applying DLPs.
Proximal Graphical Event Models

Event datasets include events that occur irregularly over the timeline and are
prevalent in numerous domains. We introduce proximal graphical event models
(PGEM) as a representation of such datasets. PGEMs belong to a broader family
of models that characterize relationships between various types of events,
where the rate of occurrence of an event type depends only on whether or not
its parents have occurred in the most recent history. The main advantage over
the state of the art models is that they are entirely data driven and do not
require additional inputs from the user, which can require knowledge of the
domain such as choice of basis functions or hyperparameters in graphical event
models. We theoretically justify our learning of optimal windows for parental
history and the choices of parental sets, and the algorithm are sound and
complete in terms of parent structure learning. We present additional
efficient heuristics for learning PGEMs from data, demonstrating their
effectiveness on synthetic and real datasets.
Bayesian Control of Large MDPs with Unknown Dynamics in Data-Poor Environments

We propose a Bayesian decision making framework for control of Markov Decision
Processes (MDPs) with unknown dynamics and large, possibly continuous, state,
action, and parameter spaces in data-poor environments. Most of the existing
adaptive controllers for MDPs with unknown dynamics are based on the
reinforcement learning framework and rely on large data sets acquired by
sustained direct interaction with the system or via a simulator. This is not
feasible in many applications, due to ethical, economic, and physical
constraints. The proposed framework addresses the data poverty issue by
decomposing the problem into an offline planning stage that does not rely on
sustained direct interaction with the system or simulator and an online
execution stage. In the offline process, parallel Gaussian process temporal
difference (GPTD) learning techniques are employed for near-optimal Bayesian
approximation of the expected discounted reward over a sample drawn from the
prior distribution of unknown parameters. In the online stage, the action with
the maximum expected return with respect to the posterior distribution of the
parameters is selected. This is achieved by an approximation of the posterior
distribution using a Markov Chain Monte Carlo (MCMC) algorithm, followed by
constructing multiple Gaussian processes over the parameter space for
efficient prediction of the means of the expected return at the MCMC sample.
The effectiveness of the proposed framework is demonstrated using a simple
dynamical system model with continuous state and action spaces, as well as a
more complex model for a metastatic melanoma gene regulatory network observed
through noisy synthetic gene expression data.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

Neural networks have many successful applications, while much less theoretical
understanding has been gained. Towards bridging this gap, we study the problem
of learning a two-layer overparameterized ReLU neural network for multi-class
classification via stochastic gradient descent (SGD) from random
initialization. In the overparameterized setting, when the data comes from
mixtures of well-separated distributions, we prove that SGD learns a network
with a small generalization error, albeit the network has enough capacity to
fit arbitrary labels. Furthermore, the analysis provides interesting insights
into several aspects of learning neural networks and can be verified based on
empirical studies on synthetic data and on the MNIST dataset.
Hamiltonian Variational Auto-Encoder

Variational Auto-Encoders (VAE) have become very popular techniques to perform
inference and learning in latent variable models as they allow us to leverage
the rich representational power of neural networks to obtain flexible
approximations of the posterior of latent variables as well as tight evidence
lower bounds (ELBO). Com- bined with stochastic variational inference, this
provides a methodology scaling to large datasets. However, for this
methodology to be practically efficient, it is neces- sary to obtain low-
variance unbiased estimators of the ELBO and its gradients with respect to the
parameters of interest. While the use of Markov chain Monte Carlo (MCMC)
techniques such as Hamiltonian Monte Carlo (HMC) has been previously suggested
to achieve this [23, 26], the proposed methods require specifying reverse
kernels which have a large impact on performance. Additionally, the resulting
unbiased estimator of the ELBO for most MCMC kernels is typically not amenable
to the reparameterization trick. We show here how to optimally select reverse
kernels in this setting and, by building upon Hamiltonian Importance Sampling
(HIS) [17], we obtain a scheme that provides low-variance unbiased estimators
of the ELBO and its gradients using the reparameterization trick. This allows
us to develop a Hamiltonian Variational Auto-Encoder (HVAE). This method can
be re-interpreted as a target-informed normalizing flow [20] which, within our
context, only requires a few evaluations of the gradient of the sampled
likelihood and trivial Jacobian calculations at each iteration.
Modelling and unsupervised learning of symmetric deformable object categories

We look at the problem of learning the structure of categories of symmetric
visual objects from raw images, without manual supervision. We show that we
can capture the intuitive notion of symmetry in natural objects, which often
clashes with its classical mathematical definition, by looking at the symmetry
not of geometric shapes, but of object deformations. We do so by building on
the recently-introduced object frame representation and show how the latter
can be extended to capture symmetries, mapping them to simple transformation
groups in representation space. An advantage of the original object frame is
that it is amenable to unsupervised learning. We show that our formulation
leads to a direct generalization of this learning strategy that allows
learning the symmetries of natural objects also in an unsupervised manner.
Finally, we show that our formulation provides an explanation of the
ambiguities in pose recovery that arise from certain symmetries and we provide
a way of discounting such ambiguities in learning.
Sequential Monte Carlo for probabilistic graphical models via twisted targets

Approximate inference in probabilistic graphical models (PGMs) can be grouped
into deterministic methods and Monte-Carlo-based methods. The former can often
provide accurate and rapid inferences, but are typically associated with
biases that are hard to quantify. The latter enjoy asymptotic consistency, but
can suffer from high computational costs. In this paper we present a way of
bridging the gap between deterministic and stochastic inference. Specifically,
we suggest an efficient sequential Monte Carlo (SMC) algorithm for PGMs which
can leverage the output from deterministic inference methods. While generally
applicable, we show explicitly how this can be done with loopy belief
propagation, expectation propagation, and Laplace approximations. The
resulting algorithm can be viewed as a post-correction of the biases
associated with these methods and, indeed, numerical results show clear
improvements over the baseline deterministic methods as well as over "plain"
SMC.
Statistical mechanics of low-rank tensor decomposition

Often, large, high dimensional datasets collected across multiple modalities
can be organized as a higher order tensor. Low-rank tensor decomposition then
arises as a powerful and widely used tool to discover simple low dimensional
structures underlying such data. However, we currently lack a theoretical
understanding of the algorithmic behavior of low-rank tensor decomposition. We
derive Bayesian approximate message passing (AMP) algorithms for recovering
arbitrarily shaped low-rank tensors buried within noise, and we employ dynamic
mean field theory to precisely characterize their performance. Our theory
reveals the existence of phase transitions between easy, hard and impossible
inference regimes, and displays an excellent match with simulations. Moreover
it reveals several qualitative surprises compared to the behavior of
symmetric, cubic tensor decomposition. Finally, we compare our AMP algorithm
to the most commonly used algorithm, the alternating least squares (ALS), and
demonstrate that AMP significantly outperforms ALS in the presence of noise.
Variational Bayesian Monte Carlo

Many probabilistic models of interest in scientific computing and machine
learning have expensive, black-box likelihoods that prevent the application of
standard techniques for Bayesian inference, such as MCMC, which would require
access to the gradient or a large number of likelihood evaluations. We
introduce here a novel sample-efficient inference framework, Variational
Bayesian Monte Carlo (VBMC). VBMC combines variational inference with
Gaussian-process based, active-sampling Bayesian quadrature, using the latter
to efficiently approximate the intractable integral in the variational
objective. Our method produces both a nonparametric approximation of the
posterior distribution and an approximate lower bound of the model evidence,
useful for model selection. We demonstrate VBMC both on several synthetic
likelihoods and on a neuronal model with data from real neurons. Across all
tested problems and dimensions (up to D = 10), VBMC performs consistently well
in reconstructing the posterior and the model evidence with a limited budget
of likelihood evaluations, unlike other methods that work only in very low
dimensions. Our framework shows great promise as a novel tool for posterior
and model inference with expensive, black-box likelihoods.
Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion

There is growing interest in combining model-free and model-based approaches
in reinforcement learning with the goal of achieving the high performance of
model-free algorithms with low sample complexity. This is difficult because an
imperfect dynamics model can degrade the performance of the learning
algorithm, and in sufficiently complex environments, the dynamics model will
always be imperfect. As a result, a key challenge is to combine model-based
approaches with model-free learning in such a way that errors in the model do
not degrade performance. We propose stochastic ensemble value expansion
(STEVE), a novel model-based technique that addresses this issue. By
dynamically interpolating between model rollouts of various horizon lengths,
STEVE ensures that the model is only utilized when doing so does not introduce
significant errors. Our approach outperforms model-free baselines on
challenging continuous control benchmarks with an order-of-magnitude increase
in sample efficiency.
Efficient Online Portfolio with Logarithmic Regret

We study the decades-old problem of online portfolio management and propose
the first algorithm with logarithmic regret that is not based on Cover's
Universal Portfolio algorithm and admits much faster implementation.
Specifically Universal Portfolio enjoys optimal regret $\mathcal{O}(N\ln T)$
for $N$ financial instruments over $T$ rounds, but requires log-concave
sampling and has a large polynomial running time. Our algorithm, on the other
hand, ensures a slightly larger but still logarithmic regret of
$\mathcal{O}(N^2(\ln T)^4)$, and is based on the well-studied Online Mirror
Descent framework with a novel regularizer that can be implemented via
standard optimization methods in time $\mathcal{O}(TN^{2.5})$ per round. The
regret of all other existing works is either polynomial in $T$ or has a
potentially unbounded factor such as the inverse of the smallest price
relative.
Algorithms and Theory for Multiple-Source Adaptation

This work includes a number of novel contributions for the multiple-source
adaptation problem. We present new normalized solutions with strong
theoretical guarantees for the cross-entropy loss and other similar losses. We
also provide new guarantees that hold in the case where the conditional
probabilities for the source domains are distinct. Moreover, we give new
algorithms for determining the distribution-weighted combination solution for
the cross-entropy loss and other losses. We report the results of a series of
experiments with real-world datasets. We find that our algorithm outperforms
competing approaches by producing a single robust model that performs well on
any target mixture distribution. Altogether, our theory, algorithms, and
empirical results provide a full solution for the multiple-source adaptation
problem with very practical benefits.
Online Reciprocal Recommendation with Theoretical Performance Guarantees

A reciprocal recommendation problem is one where the goal of learning is not
just to predict a user's preference towards a passive item (e.g., a book), but
to recommend the targeted user on one side another user from the other side
such that a mutual interest between the two exists. The problem thus is
sharply different from the more traditional items-to-users recommendation,
since a good match requires meeting the preferences of both users. We initiate
a rigorous theoretical investigation of the reciprocal recommendation task in
a specific framework of sequential learning. We point out general limitations,
formulate reasonable assumptions enabling effective learning and, under these
assumptions, we design and analyze a computationally efficient algorithm that
uncovers mutual likes at a pace comparable to those achieved by a clearvoyant
algorithm knowing all user preferences in advance. Finally, we validate our
algorithm against synthetic and real-world datasets, showing improved
empirical performance over simple baselines.
How SGD selects the global minima in over-parameterized learning: A stability perspective

The question of which global minima are accessible by an stochastic gradient
decent (SGD) algorithm with specific learning rate and batch size is studied
from the perspective of numerical stability. The concept of non-uniformity is
introduced, which, together with sharpness, characterizes the stability
property of a global minimum and hence the accessibility of a particular SGD
algorithm to that global minimum. In particular, this analysis shows that
learning rate and batch size play different roles in minima selection.
Extensive empirical results seem to correlate well with the theoretical
findings and provide further support to these claims.
Differentiable MPC for End-to-end Planning and Control

In this paper we present foundations for using model predictive control (MPC)
as a differentiable policy class in reinforcement learning. Specifically, we
differentiate through MPC by using the KKT conditions of the convex
approximation at a fixed point of the solver. Using this strategy, we are able
to learn the cost and dynamics of a controller via end-to-end learning in a
larger system. We empirically show results in an imitation learning setting,
demonstrating that we can recover the underlying dynamics and cost more
efficiently and reliably than with a generic neural network policy class.
Bilevel learning of the Group Lasso structure

Regression with group-sparsity penalty plays a central role in high-
dimensional prediction problems. However, most of existing methods require the
group structure to be known a priori. In practice, this strong assumption
often results in a degradation of the prediction performance. To circumvent
this issue, we present a method to estimate the group structure by means of a
continuous bilevel optimization problem where the data is split into training
and validation sets. Our approach relies on an approximation where the lower
problem is replaced by a smooth dual forward-backward scheme with Bregman
distances. We provide guarantees regarding its convergence to the exact
problem and demonstrate the well behaviour of the method on synthetic
experiments. Finally, a preliminary application to genes expression data is
tackled in order to unveil functional groups.
Generative Adversarial Examples

Adversarial examples are typically constructed by perturbing an existing data
point, and current defense methods are focused on guarding against this type
of attack. In this paper, we propose a new class of adversarial examples that
are synthesized entirely from scratch using a conditional generative model. We
first train an Auxiliary Classifier Generative Adversarial Network (AC-GAN) to
model the class-conditional distribution over inputs. Then, conditioned on a
desired class, we search over the AC-GAN latent space to find images that are
likely under the generative model and are misclassified by a target
classifier. We demonstrate through human evaluation that this new kind of
adversarial inputs, which we call Generative Adversarial Examples, are
legitimate and belong to the desired class. Our empirical results on the
MNIST, SVHN, and CelebA datasets show that generative adversarial examples can
easily bypass strong adversarial training and certified defense methods which
can foil existing adversarial attacks.
Information-theoretic Limits for Community Detection in Network Models

We analyze the information-theoretic limits for the recovery of node labels in
several network models. This includes the Stochastic Block Model, the
Exponential Random Graph Model, the Latent Space Model, the Directed
Preferential Attachment Model, and the Directed Small-world Model. For the
Stochastic Block Model, the non-recoverability condition depends on the
probabilities of having edges inside a community, and between different
communities. For the Latent Space Model, the non-recoverability condition
depends on the dimension of the latent space, and how far and spread are the
communities in the latent space. For the Directed Preferential Attachment
Model and the Directed Small-world Model, the non-recoverability condition
depends on the ratio between homophily and neighborhood size. We also consider
dynamic versions of the Stochastic Block Model and the Latent Space Model.
Distributionally Robust Graphical Models

In many structured prediction problems, complex relationships between
variables are compactly defined using graphical structures. The most prevalent
graphical prediction methods ---probabilistic graphical models and large
margin methods--- have their own distinct strengths but also come with
significant drawbacks. Conditional random fields (CRFs) are Fisher consistent,
but they do not permit integration of customized loss functions into their
learning process. Large-margin models, such as structured support vector
machines (SSVM), have the flexibility to incorporate customized loss metrics,
but lack Fisher consistency guarantees. We present adversarial graphical
models (AGM), a distributionally robust approach for constructing a predictor
that performs robustly for a class of data distributions defined using a
graphical structure. Our approach enjoys both the flexibility of incorporating
customized loss functions into its design as well as the statistical guarantee
of Fisher consistency. We present exact learning and prediction algorithms for
AGM requiring similar time complexity as existing graphical models and show
its practical benefits in our experiments.
Transfer Learning with Neural AutoML

We reduce the computational cost of Neural AutoML with transfer learning.
AutoML relieves human effort by automating the design of ML algorithms. Neural
AutoML has become popular for the design of deep learning architectures,
however, this method has a high computation cost. To address this we propose
Transfer Neural AutoML that uses knowledge from prior tasks to speed up
network design. We extend RL-based architecture search methods to support
parallel training on multiple tasks and then transfer the search strategy to
new tasks. On language and image classification data, Transfer Neural AutoML
reduces convergence time over single-task training by over an order of
magnitude on many tasks.
Stochastic Primal-Dual Method for Empirical Risk Minimization with O(1) Per-Iteration Complexity

Regularized empirical risk minimization problem with linear predictor appears
frequently in machine learning. In this paper, we propose a new stochastic
primal-dual method to solve this class of problems. Different from existing
methods, our proposed methods only require O(1) operations in each iteration.
We also develop a variance-reduction variant of the algorithm that converges
linearly. Numerical experiments suggest that our methods are faster than
existing ones such as proximal SGD, SVRG and SAGA on high-dimensional
problems.
On preserving non-discrimination when combining expert advice

We study the design of online learning algorithms that, when run on members of
different groups, do not discriminate against some group. We consider the most
basic question in such a setting: how can we design an online learning
algorithm that, given access to individually non-discriminatory predictors,
guarantees the classical no-regret property and overall non-discrimination at
the same time? We show a strong impossibility result for this goal with
respect to "equal opportunity" that requires equal false negative rates across
groups. On the positive side, we show that for another notion of non-
discrimination, "equalized error rates", such a guarantee is achievable.
Learning safe policies with expert guidance

We propose a framework for ensuring safe behavior of a reinforcement learning
agent when the reward function may be difficult to specify. In order to do
this, we rely on the existence of demonstrations from expert policies, and we
provide a theoretical framework for the agent to optimize in the space of
rewards consistent with its existing knowledge. We propose two methods to
solve the resulting optimization: an exact ellipsoid-based method and a method
in the spirit of the "follow-the-perturbed-leader" algorithm. Our experiments
demonstrate the behavior of our algorithm in both discrete and continuous
problems. The trained agent safely avoids states with potential negative
effects while imitating the behavior of the expert in the other states.
Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data

Precision medicine aims for personalized prognosis and therapeutics by
utilizing recent genome-scale high-throughput profiling techniques, including
next-generation sequencing (NGS). However, translating NGS data faces several
challenges. First, NGS count data are often overdispersed, requiring
appropriate modeling. Second, compared to the number of involved molecules and
system complexity, the number of available samples for studying complex
disease, such as cancer, is often limited, especially considering disease
heterogeneity. The key question is whether we may integrate available data
from all different sources or domains to achieve reproducible disease
prognosis based on NGS count data. In this paper, we develop a Bayesian Multi-
Domain Learning (BMDL) model that derives domain-dependent latent
representations of overdispersed count data based on hierarchical negative
binomial factorization for accurate cancer subtyping even if the number of
samples for a specific cancer type is small. Experimental results from both
our simulated and NGS datasets from The Cancer Genome Atlas (TCGA) demonstrate
the promising potential of BMDL for effective multi-domain learning without
``negative transfer'' effects often seen in existing multi-task learning and
transfer learning methods.
Learning SMaLL Predictors

We introduce a new framework for learning in severely resource-constrained
settings. Our technique delicately amalgamates the representational richness
of multiple linear predictors with the sparsity of Boolean relaxations, and
thereby yields classifiers that are compact, interpretable, and accurate. We
provide a rigorous formalism of the learning problem, and establish fast
convergence of the ensuing algorithm via relaxation to a minimax saddle point
objective. We corroborate the theoretical foundations of our work with an
extensive empirical evaluation. Our method, Sparse Multiprototype Linear
Learner (SMaLL), achieves state-of-the-art performance on several OpenML
datasets.
Phase Retrieval Under a Generative Prior

We introduce a novel deep-learning inspired formulation of the \textit{phase
retrieval problem}, which asks to recover a signal $y_0 \in \R^n$ from $m$
quadratic observations, under structural assumptions on the underlying signal.
As is common in many imaging problems, previous methodologies have considered
natural signals as being sparse with respect to a known basis, resulting in
the decision to enforce a generic sparsity prior. However, these methods for
phase retrieval have encountered possibly fundamental limitations, as no
computationally efficient algorithm for sparse phase retrieval has been proven
to succeed with fewer than $O(k^2\log n)$ generic measurements, which is
larger than the theoretical optimum of $O(k \log n)$. In this paper, we
sidestep this issue by considering a prior that a natural signal is in the
range of a generative neural network $G : \R^k \rightarrow \R^n$. We introduce
an empirical risk formulation that has favorable global geometry for gradient
methods, as soon as $m = O(k)$, under the model of a multilayer fully-
connected neural network with random weights. Specifically, we show that there
exists a descent direction outside of a small neighborhood around the true
$k$-dimensional latent code and a negative multiple thereof. This formulation
for structured phase retrieval thus benefits from two effects: generative
priors can more tightly represent natural signals than sparsity priors, and
this empirical risk formulation can exploit those generative priors at an
information theoretically optimal sample complexity, unlike for a sparsity
prior. We corroborate these results with experiments showing that exploiting
generative models in phase retrieval tasks outperforms both sparse and general
phase retrieval methods.
Quadrature-based features for kernel approximation

We consider the problem of improving kernel approximation via randomized
feature maps. These maps arise as Monte Carlo approximation to integral
representations of kernel functions and scale up kernel methods for larger
datasets. Based on an efficient numerical integration technique, we propose a
unifying approach that reinterprets the previous random features methods and
extends to better estimates of the kernel approximation. We derive the
convergence behavior and conduct an extensive empirical study that supports
our hypothesis.
Reducing Network Agnostophobia

Agnostophobia, the fear of the unknown, can be experienced by deep learning
engineers while applying their networks to real-world applications.
Unfortunately, network behavior is not well defined for inputs far from its
training. In an uncontrolled environment, networks face many instances which
are not of interest to them and have to be rejected in order to avoid a false
positive. This problem has previously been tackled by researchers by either a)
thresholding softmax, which by construction must return one of the known
classes, or b) using an additional background or garbage class. In this paper,
we show that both of these approaches help, but are generally insufficient
when previously unseen classes are encountered. We introduce a new evaluation
metric which focuses on comparing the performance of multiple approaches in
scenarios where unknowns are encountered. Our major contributions are our
simple yet effective Entropic Openset, and Objectosphere losses, which similar
to the current approaches train with negative samples. However, these novel
losses are designed to maximize entropy for unknown inputs while also
increasing separation in deep feature magnitude between known and unknown
classes. Experiments on MNIST and CIFAR-10 show that our novel loss is
significantly better at dealing with unknown inputs from datasets such as
letters, Not MNIST, Devanagari, and SVHN.
A Stein variational Newton method

Stein variational gradient descent (SVGD) was recently proposed as a general
purpose nonparametric variational inference algorithm: it minimizes the
Kullback–Leibler divergence between the target distribution and its
approximation by implementing a form of functional gradient descent on a
reproducing kernel Hilbert space [Liu & Wang, NIPS 2016]. In this paper, we
accelerate and generalize the SVGD algorithm by including second-order
information, thereby approximating a Newton-like iteration in function space.
We also show how second-order information can lead to more effective choices
of kernel. We observe significant computational gains over the original SVGD
algorithm in multiple test cases.
Watch Your Step: Learning Node Embeddings via Graph Attention

Graph embedding methods represent nodes in a continuous vector space,
preserving different types of relational information from the graph. There are
many hyper-parameters to these methods (e.g. the length of a random walk)
which have to be manually tuned for every graph. In this paper, we replace
previously fixed hyper-parameters with trainable ones that we automatically
learn via backpropagation. In particular, we learn a novel attention model on
the power series of the transition matrix, which guides the random walk to
optimize an upstream objective. Unlike previous approaches to attention
models, the method that we propose utilizes attention parameters exclusively
on the data itself (e.g. on the random walk), and are not used by the model
for inference. We experiment on link prediction tasks, as we aim to produce
embeddings that best-preserve the graph structure, generalizing to unseen
information. We improve state-of-the-art on a comprehensive suite of real
world datasets including social, collaboration, and biological networks.
Adding attention to random walks can reduce the error by 20% to 45% on
datasets we attempted. We show that our learned attention parameters can vary
significantly for different graphs, and correspond to the optimal choice of
hyper-parameter if we manually tune existing methods.
Visual Goal-Conditioned Reinforcement Learning by Representation Learning

For an autonomous agent to fulfill a wide range of user-specified goals at
test time, it must be able to learn broadly applicable and general-purpose
skill repertoires. Furthermore, to provide the requisite level of generality,
these skills must handle raw sensory input such as images. In this paper, we
propose an algorithm that acquires such general-purpose skills by combining
unsupervised representation learning and reinforcement learning of goal-
conditioned policies. Since the particular goals that might be required at
test-time are not known in advance, the agent performs a self-supervised
"practice" phase where it imagines goals and attempts to achieve them. We
learn a visual representation with three distinct purposes: sampling goals for
self-supervised practice, providing a structured transformation of raw sensory
inputs, and computing a reward signal for goal reaching. We also propose a
retroactive goal relabeling scheme to further improve the sample-efficiency of
our method. Our off-policy algorithm is efficient enough to learn policies
that operate on raw image observations and goals in a real-world physical
system, and substantially outperforms prior techniques.
Deep Predictive Coding Network with Local Recurrent Processing for Object Recognition

Inspired by "predictive coding" - a theory in neuroscience, we develop a bi-
directional and dynamical neural network with local recurrent processing,
namely predictive coding network (PCN). Unlike any feedforward-only
convolutional neural network, PCN includes both feedback connections, which
carry top-down predictions, and feedforward connections, which carry bottom-up
errors of prediction. Feedback and feedforward connections enable adjacent
layers to interact locally and recurrently to refine representations towards
minimization of layer-wise prediction errors. When unfolded over time, the
recurrent processing gives rise to an increasingly deeper hierarchy of non-
linear transformation, allowing a shallow network to dynamically extend itself
into an arbitrarily deep network. We train and test PCN for image
classification with SVHN, CIFAR and ImageNet datasets. Despite notably fewer
layers and parameters, PCN achieves competitive performance compared to
classical and state-of-the-art models. Further analysis shows that the
internal representations in PCN converge over time and yield increasingly
better accuracy in object recognition. Errors of top-down prediction also map
visual saliency or bottom-up attention. This work takes us one step closer to
bridging human and machine intelligence in vision.
PAC-Bayes bounds for stable algorithms with instance-dependent priors

PAC-Bayes bounds have been proposed to get risk estimates based on a training
sample. In this paper the PAC-Bayes approach is combined with stability of the
hypothesis learned by a Hilbert space valued algorithm. The PAC-Bayes setting
is used with a Gaussian prior centered at the expected output. Thus a novelty
of our paper is using priors defined in terms of the data-generating
distribution. Our main result estimates the risk of the randomized algorithm
in terms of the hypothesis stability coefficients. We also provide a new bound
for the SVM classifier, which is compared to other known bounds
experimentally. Ours appears to be the first stability-based bound that
evaluates to non-trivial values.
Beyond Grids: Learning Graph Representations for Visual Recognition

We propose learning graph representations from 2D feature maps for visual
recognition. Our method draws inspiration from region based recognition, and
learns to transform a 2D image into a graph structure. The vertices of the
graph define clusters of pixels ("regions"), and the edges measure the
similarity between these clusters in a feature space. Our method further
learns to propagate information across all vertices on the graph, and is able
to project the learned graph representation back into 2D grids. Our graph
representation facilitates reasoning beyond regular grids and can capture long
range dependencies among regions. We demonstrate that our model can be trained
from end-to-end, and is easily integrated into existing networks. Finally, we
evaluate our method on three challenging recognition tasks: semantic
segmentation, object detection and object instance segmentation. For all
tasks, our method outperforms state-of-the-art methods.
The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization

Motivated by applications in Optimization, Game Theory, and the training of
Generative Adversarial Networks, the convergence properties of first order
methods in min-max problems have received extensive study. It has been
recognized that they may cycle, and there is no good understanding of their
limit points when they do not. When they converge, do they converge to local
min-max solutions? We characterize the limit points of two basic first order
methods, namely Gradient Descent/Ascent (GDA) and Optimistic Gradient Descent
Ascent (OGDA). We show that both dynamics avoid unstable critical points for
almost all initializations. Moreover, for small step sizes and under mild
assumptions, the set of {OGDA}-stable critical points is a superset of
{GDA}-stable critical points, which is a superset of local min-max solutions
(strict in some cases). The connecting thread is that the behavior of these
dynamics can be studied from a dynamical systems perspective.
Coordinate Descent with Bandit Sampling

Coordinate descent methods minimize a cost function by updating a single
decision variable (corresponding to one coordinate) at a time. Ideally, we
would update the decision variable that yields the largest marginal decrease
in the cost function. However, finding this coordinate would require checking
all of them, which is not computationally practical. Therefore, we propose a
new adaptive method for coordinate descent. First, we define a lower bound on
the decrease of the cost function when a coordinate is updated and, instead of
calculating this lower bound for all coordinates, we use a multi-armed bandit
algorithm to learn which coordinates result in the largest marginal decrease
and simultaneously perform coordinate descent. We show that our approach
improves the convergence of the coordinate methods both theoretically and
experimentally.
Deep Dynamical Modeling and Control of Unsteady Fluid Flows

The design of flow control systems remains a challenge due to the nonlinear
nature of the equations that govern fluid flow. However, recent advances in
computational fluid dynamics (CFD) have enabled the simulation of complex
fluid flows with high accuracy, opening the possibility of using learning-
based approaches to facilitate controller design. We present a method for
learning the forced and unforced dynamics of airflow over a cylinder directly
from CFD data. The proposed approach, grounded in Koopman theory, is shown to
produce stable dynamical models that can predict the time evolution of the
cylinder system over extended time horizons. Finally, by performing model
predictive control with the learned dynamical models, we are able to find a
straightforward, interpretable control law for suppressing vortex shedding in
the wake of the cylinder.
Confounding-Robust Policy Improvement

We study the problem of learning personalized decision policies from
observational data while accounting for possible unobserved confounding in the
data-generating process. Unlike previous approaches which assume
unconfoundedness, i.e. no unobserved confounders affected treatment assignment
as well as outcome, we calibrate policy learning for realistic violations of
this unverifiable assumption with uncertainty sets motivated by sensitivity
analysis in causal inference. Our framework for confounding-robust policy
improvement optimizes the minimax regret of a candidate policy against a
baseline or reference "status quo" policy, over a uncertainty set around
nominal propensity weights. We prove that if the uncertainty set is well-
specified, robust policy learning can do no worse than the baseline, and only
improve if the data supports it. We characterize the adversarial subproblem
and use efficient algorithmic solutions to optimize over parametrized spaces
of decision policies such as logistic treatment assignment. We assess our
methods on synthetic data and a large clinical trial, demonstrating that
confounded selection can hinder policy learning and lead to unwarranted harm,
while our robust approach guarantees safety and focuses on well-evidenced
improvement.
The Importance of Sampling inMeta-Reinforcement Learning

We interpret meta-reinforcement learning as the problem of learning how to
quickly find a good sampling distribution in a new environment. This
interpretation leads to the development of two new meta-reinforcement learning
algorithms: E-MAML and E-$\text{RL}^2$. Results are presented on a new
environment we call `Krazy World': a difficult high-dimensional gridworld
which is designed to highlight the importance of correctly differentiating
through sampling distributions in meta-reinforcement learning. Further results
are presented on a set of maze environments. We show E-MAML and
E-$\text{RL}^2$ deliver better performance than baseline algorithms on both
tasks.
Representer Point Selection for Explaining Deep Neural Networks

We propose to explain the predictions of a deep neural network, by pointing to
the set of what we call representer points in the training set, for a given
black-box test point prediction. Specifically, we show that we can decompose
the pre-activation prediction of a neural network into a linear combination of
activations of training points, with the weights corresponding to what we call
representer values, which thus capture the importance of that training point
on the learned parameters of the network. But it provides a deeper
understanding of the network than simply training point influence: with
positive representer values corresponding to excitatory training points, and
negative values corresponding to inhibitory points, which as we show provides
considerably more insight. Our method is also much more scalable, allowing for
real-time feedback in a manner not feasible with influence functions.
The Effect of Network Width on the Performance of  Large-batch Training

Distributed implementations of mini-batch stochastic gradient descent (SGD)
suffer from communication overheads, attributed to the high frequency of
gradient updates inherent in small-batch training. Training with large batches
can reduce these overheads; however it besets the convergence of the algorithm
and the generalization performance. In this work, we take a first step towards
analyzing how the structure (width and depth) of a neural network affects the
performance of large-batch training. We present new theoretical results which
suggest that--for a fixed number of parameters--wider networks are more
amenable to fast large-batch training compared to deeper ones. We provide
extensive experiments on residual and fully-connected neural networks which
suggest that wider networks can be trained using larger batches without
incurring a convergence slow-down, unlike their deeper variants.
SNIPER: Efficient Multi-Scale Training

We present SNIPER, an algorithm for performing scale invariant training in
instance level visual recognition tasks. Instead of processing every pixel in
an image pyramid, SNIPER only processes context regions around ground-truth
instances (referred to as chips) at the appropriate scale. For background
sampling, these context-regions are generated using proposals extracted from a
region proposal network trained with a short learning schedule. Hence, the
number of chips generated per image during training adaptively changes based
on the scene complexity. SNIPER only processes 30% more pixels compared to the
commonly used single scale training at 800x1333 pixels on the COCO dataset.
But, it also observes samples from extreme resolutions of the image pyramid,
like 1400x2000 pixels. As SNIPER operates on low resolution chips (512x512
pixels), it can have a batch size as large as 20 on a single GPU even with a
ResNet-101 backbone. Therefore it can benefit from batch-normalization during
training without the need for synchronizing batch-normalization statistics
across GPUs. SNIPER brings training of instance level recognition tasks like
object detection closer to the protocol for image classification and suggests
that the commonly accepted guideline that it is important to train on high
resolution images for instance level visual recognition tasks might not be
correct. Our implementation based on Faster-RCNN with a ResNet-101 backbone
obtains an mAP of 47.6% on the COCO dataset for bounding box detection and can
process 5 images per second with a single GPU.
Sample Complexity of Nonparametric Semi-Supervised Learning

We study the sample complexity of semi-supervised learning (SSL) and introduce
new assumptions based on the mismatch between a mixture model learned from
unlabeled data and the true mixture model induced by the (unknown) class
conditional distributions. Under these assumptions, we establish an
$\Omega(K\log K)$ labeled sample complexity bound without imposing parametric
assumptions, where $K$ is the number of classes. Our results suggest that even
in nonparametric settings it is possible to learn a near-optimal classifier
using only a few labeled samples. Unlike previous theoretical work which
focuses on binary classification, we consider general multiclass
classification ($K&gt;2$), which requires solving a difficult permutation
learning problem. This permutation defines a classifier whose classification
error is controlled by the Wasserstein distance between mixing measures, and
we provide finite-sample results characterizing the behaviour of the excess
risk of this classifier. Finally, we describe three algorithms for computing
these estimators based on a connection to bipartite graph matching, and
perform experiments to illustrate the superiority of the MLE over the majority
vote estimator.
Hardware Conditioned Policies for Multi-Robot Transfer Learning

Deep reinforcement learning could be used to learn dexterous robotic policies
but it is extremely challenging to transfer them to new robots with vastly
different hardware properties. It is also prohibitively expensive to learn a
new policy from scratch for each robot hardware due to the high sample
complexity of modern state-of-the-art algorithms. We propose a novel approach
called \textit{Hardware Conditioned Policies} where we train a universal
policy conditioned on a vector representation of robot hardware. We considered
robots in simulation with varied dynamics, kinematic structure, kinematic
lengths and degrees-of-freedom. First, we use the kinematic structure directly
as the hardware encoding and show great zero-shot transfer to completely novel
robots not seen during training. For robots with lower zero-shot success rate,
we also demonstrate that fine-tuning the policy network is significantly more
sample-efficient than training a model from scratch. In tasks where knowing
the agent dynamics is crucial for success, we learn an embedding for robot
hardware and show that policies conditioned on the encoding of hardware tend
to generalize and transfer well.
Co-regularized Alignment for Unsupervised Domain Adaptation

Deep neural networks, trained with large amount of labeled data, can fail to
generalize well when tested with examples from a target domain whose
distribution differs from the training data distribution, referred as the
source domain. It can be expensive or even infeasible to obtain required
amount of labeled data in all possible domains. Unsupervised domain adaptation
sets out to address this problem, aiming to learn a good predictive model for
the target domain using labeled examples from the source domain but only
unlabeled examples from the target domain. Domain alignment approaches this
problem by matching the source and target feature distributions, and has been
used as a key component in many state-of-the-art domain adaptation methods.
However, matching the marginal feature distributions does not guarantee that
the corresponding class conditional distributions will be aligned across the
two domains. We propose co-regularized domain alignment for unsupervised
domain adaptation, which constructs multiple diverse feature spaces and aligns
source and target distributions in each of them individually, while
encouraging that alignments agree with each other with regard to the class
predictions on the unlabeled target examples. The proposed method is generic
and can be used to improve any domain adaptation method which uses domain
alignment. We instantiate it in the context of a recent state-of-the-art
method and observe that it provides significant performance improvements on
several domain adaptation benchmarks.
Statistical and Computational Trade-Offs in Kernel K-Means

We investigate the efficiency of k-means in terms of both statistical and
computational requirements. More precisely, we study a Nystr"om approach to
kernel k-means. We analyze the statistical properties of the proposed method
and show that it achieves the same accuracy of exact kernel k-means with only
a fraction of computations. Indeed, we prove under basic assumptions that
sampling $\sqrt{n}$ Nystr"om landmarks allows to greatly reduce computations
without incurring in any loss of accuracy. To the best of our knowledge this
is the first result showing in this kind for unsupervised learning.
Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures

The backpropagation of error algorithm (BP) is often said to be impossible to
implement in a real brain. The recent success of deep networks in machine
learning and AI, however, has inspired proposals for understanding how the
brain might learn across multiple layers, and hence how it might implement or
approximate BP. As of yet, none of these proposals have been rigorously
evaluated on tasks where BP-guided deep learning has proved critical, or in
architectures more structured than simple fully-connected networks. Here we
present the first results on scaling up biologically motivated models of deep
learning on datasets which need deep networks with appropriate architectures
to achieve good performance. We present results on MNIST, CIFAR-10, and
ImageNet and explore variants of the difference target-propagation (DTP)
algorithm. We focus on DTP and introduce weight-transport-free variants
modified to remove backpropagation from the penultimate layer, in both fully-
and locally-connected architectures. These algorithms perform well for MNIST,
but for CIFAR and ImageNet we find that DTP and variants perform significantly
worse than BP, especially for network composed of locally connected units,
opening questions about whether new architectures and algorithms are required
to scale these approaches. Our results and implementation details help
establish baselines for biologically motivated deep learning schemes going
forward.
Learning Attractor Dynamics for Generative Memory

A central challenge faced by memory systems is the robust retrieval of a
stored pattern in the presence of interference due to other stored patterns
and noise. A theoretically well-founded solution to robust retrieval is given
by attractor dynamics, which iteratively cleans up patterns during recall.
However, incorporating attractor dynamics into modern deep learning systems
poses difficulties: attractor basins are characterised by vanishing gradients,
which are known to make training neural networks difficult. In this work, we
exploit recent advances in variational inference and avoid the vanishing
gradient problem by training a generative distributed memory with a
variational lower-bound-based Lyapunov function. The model is minimalistic
with surprisingly few parameters. Experiments shows it converges to correct
patterns upon iterative retrieval and achieves competitive performance as both
a memory model and a generative model.
The emergence of multiple retinal cell types through efficient coding of natural movies

One of the most striking aspects of early visual processing in the retina is
the immediate parcellation of visual information into multiple parallel
pathways, formed by different retinal ganglion cell types each tiling the
entire visual field. Existing theories of efficient coding have been unable to
account for the functional advantages of such cell-type diversity in encoding
natural scenes. Here we go beyond previous theories to analyze how a simple
linear retinal encoding model with different convolutional cell types
efficiently encodes naturalistic spatiotemporal movies with a fixed firing
rate budget. We find that optimizing the receptive fields and cell densities
of two cell types makes them match the properties of two main cell types in
the primate retina, midget and parasol cells, in terms of spatial and temporal
sensitivity, cell spacing, and their relative ratio. Moreover, our theory
gives a precise account of how the ratio of midget to parasol cells decreases
with retinal eccentricity. Also, we train a nonlinear encoding model with a
rectifying nonlinearity to efficiently encode naturalistic movies, and again
find emergent receptive fields resembling those of midget and parasol cells
that are now further subdivided into ON and OFF types. Thus our work provides
a theoretical justification, based on the efficient coding of naturalistic
movies, for the existence of the four most dominant cell types in the primate
retina that together comprise 90% of all ganglion cells.
Gather-Excite: Exploiting feature context in ConvNets

The powerful image representations learned by deep convolutional neural
networks (ConvNets) have propelled this family of models to a state of
dominance in image classification. But by constructing features in a strictly
bottom-up manner with local operators, ConvNets may be unable to efficiently
exploit contextual information that resides in the relationships between
features. The focus of this work is to propose a simple, lightweight solution
to the issue of limited context propagation in ConvNets. Our approach, which
we formulate as a gather-scatter operator pair, propagates context across a
group of neurons by aggregating responses over their extent and redistributing
the aggregates back through the group. The simplicity of our approach brings
several benefits: the operators add few parameters, minimal computational
overhead and, importantly, can be directly integrated into existing
architectures to improve performance without careful hyperparameter tuning. We
present evidence that integration of gather-scatter operators into a ConvNet
produces qualitatively different intermediate feature representations.
Moreover, we show with experiments on the CIFAR-10, CIFAR-100 and ImageNet
datasets that improving context diffusion can be just as important as
increasing the depth of a network, at a fraction of the cost. In fact, we find
that by supplementing a ResNet-50 model with gather-scatter operators, it is
able to outperform its 101-layer counterpart on ImageNet with no additional
learnable parameters.
Quantifying Linguistic Shifts: The Global Anchor Method and Its Applications

Language is dynamic, constantly evolving and adapting with respect to time,
domain or topic. The adaptability of language is an active research area,
where researchers discover social, cultural and domain-specific changes in
language using distributional tools such as word embeddings. In this paper, we
introduce the global anchor method for detecting corpus-level language shifts.
We show both theoretically and empirically that the global anchor method is
equivalent to the alignment method, a widely-used method for comparing word
embeddings, in terms of detecting corpus-level language shifts. Despite their
equivalence in terms of detection abilities, we demonstrate that the global
anchor method is superior in terms of applicability as it can compare
embeddings of different dimensionalities. Furthermore, the global anchor
method has implementation and parallelization advantages. We show that the
global anchor method reveals fine structures in the evolution of language and
domain adaptation. When combined with the graph Laplacian technique, the
global anchor method recovers the evolution trajectory and domain clustering
of disparate text corpora.
Identification and Estimation of Causal Effects from Dependent Data

The assumption that data samples are independent and identically distributed
(iid) is standard in many areas of statistics and machine learning.
Nevertheless, in some settings, such as social networks, infectious disease
modeling, and reasoning with spatial and temporal data, this assumption is
false. An extensive literature exists on making causal inferences under the
iid assumption [12, 8, 21, 16], but, as pointed out in [14], causal inference
in non-iid contexts is challenging due to the combination of unobserved
confounding bias and data dependence. In this paper we develop a general
theory describing when causal inferences are possible in such scenarios. We
use segregated graphs [15], a generalization of latent projection mixed graphs
[23], to represent causal models of this type and provide a complete algorithm
for non-parametric identification in these models. We then demonstrate how
statistical inferences may be performed on causal parameters identified by
this algorithm, even in cases where parts of the model exhibit full
interference, meaning only a single sample is available for parts of the model
[19]. We apply these techniques to a synthetic data set which considers the
adoption of fake news articles given the social network structure, articles
read by each person, and baseline demographics and socioeconomic covariates.
Deepcode: Feedback Codes via Deep Learning

The design of codes for communicating reliably over a statistically well
defined channel is an important endeavor involving deep mathematical research
and wide- ranging practical applications. In this work, we present the first
family of codes obtained via deep learning, which significantly beats state-
of-the-art codes designed over several decades of research. The communication
channel under consideration is the Gaussian noise channel with feedback, whose
study was initiated by Shannon; feedback is known theoretically to improve
reliability of communication, but no practical codes that do so have ever been
successfully constructed. We break this logjam by integrating information
theoretic insights harmoniously with recurrent-neural-network based encoders
and decoders to create novel codes that outperform known codes by 3 orders of
magnitude in reliability. We also demonstrate several desirable properties in
the codes: (a) generalization to larger block lengths; (b) composability with
known codes; (c) adaptation to practical constraints. This result also
presents broader ramifications to coding theory: even when the channel has a
clear mathematical model, deep learning methodologies, when combined with
channel specific information-theoretic insights, can potentially beat state-
of-the-art codes, constructed over decades of mathematical research.
Learning and Testing Causal Models with Interventions

We consider testing and learning problems on causal Bayesian networks as
defined by Pearl (Pearl, 2009). Given a causal Bayesian network M on a graph
with n discrete variables and bounded in-degree and bounded ``confounded
components'', we show that O(\log n) interventions on an unknown causal
Bayesian network X on the same graph, and O~(n/epsilon^2) samples per
intervention, suffice to efficiently distinguish whether X=M or whether there
exists some intervention under which X and M are farther than epsilon in total
variation distance. We also obtain sample/time/intervention efficient
algorithms for: (i) testing the identity of two unknown causal Bayesian
networks on the same graph; and (ii) learning a causal Bayesian network on a
given graph. Although our algorithms are non-adaptive, we show that adaptivity
does not help in general: Omega(log n) interventions are necessary for testing
the identity of two unknown causal Bayesian networks on the same graph, even
adaptively. Our algorithms are enabled by a new subadditivity inequality for
the squared Hellinger distance between two causal Bayesian networks.
Implicit Bias of Gradient Descent on Linear Convolutional Networks

We show that gradient descent on full-width linear convolutional networks of
depth $L$ converges to a linear predictor related to the $\ell_{2/L}$ bridge
penalty in the frequency domain. This is in contrast to linearly fully
connected networks, where gradient descent converges to the hard margin linear
SVM solution, regardless of depth.
DAGs with NO TEARS: Continuous Optimization for Structure Learning

Estimating the structure of directed acyclic graphs (DAGs, also known as
{Bayesian networks}) is a challenging problem since the search space of DAGs
is combinatorial and scales superexponentially with the number of nodes.
Existing approaches rely on various local heuristics for enforcing the
acyclicity constraint. In this paper, we introduce a fundamentally different
strategy: we formulate the structure learning problem as a purely
\emph{continuous} optimization problem over real matrices that avoids this
combinatorial constraint entirely. This is achieved by a novel
characterization of acyclicity that is not only smooth but also exact. The
resulting problem can be efficiently solved by standard numerical algorithms,
which also makes implementation effortless. The proposed method outperforms
existing ones, without imposing any structural assumptions on the graph such
as bounded treewidth or in-degree.
PAC-Bayes Tree: weighted subtrees with guarantees

We present a weighted-majority classification approach over subtrees of a
fixed tree, which provably achieves excess-risk of the same order as the best
tree-pruning. Furthermore, the computational efficiency of pruning is
maintained at both training and testing time despite having to aggregate over
an exponential number of subtrees. We believe this is the first subtree
aggregation approach with such guarantees.
Multi-objective Maximization of Monotone Submodular Functions with Cardinality Constraint

We consider the problem of multi-objective maximization of monotone submodular
functions subject to cardinality constraint, often formulated as
$\max_{|A|=k}\min_{i\in\{1,\dots,m\}}f_i(A)$. While it is well known that
greedy methods work well for a single objective, the problem becomes much
harder with multiple objectives. In fact, Krause et al.\ (2008) showed that
when the number of objectives $m$ grows as the cardinality $k$ i.e.,
$m=\Omega(k)$, the problem is inapproximable (unless $P=NP$). On the other
hand when $m$ is constant, there is a randomized $(1-1/e)-\epsilon$
approximation with runtime (number of queries to function oracle)
$n^{m/\epsilon^3}$ due to Chekuri et al.\ (2010). %In fact, the result of
Chekuri et al.\ (2010) is for the far more general case of matroid constant.
In this paper, we focus on finding a fast and practical algorithm that has
(asymptotic) approximation guarantees even when $m$ is super constant. We
first modify the algorithm of Chekuri et al.\ (2010) to achieve an $(1-1/e)$
approximation for $m=o(\frac{k}{\log^3 k})$, demonstrating a steep transition
from constant factor approximability to inapproximability around $\Omega(k)$.
More importantly, using Multiplicative-Weight-Updates (MWU) we find a much
faster $\tilde{O}(n/\delta^3)$ time, asymptotic $(1-1/e)^2-\delta$
approximation. While the above results are all randomized, we also give a
simple deterministic $(1-1/e)-\epsilon$ approximation with runtime
$kn^{m/\epsilon^4}$. Finally, we run synthetic experiments on Kronecker graphs
and find that our MWU inspired heuristic outperforms existing heuristics.
Sanity Checks for Saliency Maps

Saliency methods have emerged as a popular tool to highlight features in an
input deemed relevant for the prediction of a learned model. Several saliency
methods have been proposed, often guided by visual appeal on image data. In
this work, we propose an actionable methodology to evaluate what kinds of
explanations a given method can and cannot provide. We find that reliance,
solely, on visual assessment can be misleading. Through extensive experiments
we show that some existing saliency methods are independent both of the model
and of the data generating process. Consequently, methods that fail the
proposed tests are inadequate for tasks that are sensitive to either data or
model, such as, finding outliers in the data, explaining the relationship
between inputs and outputs that the model learned, and debugging the model. We
interpret our findings through an analogy with edge detection in images, a
technique that requires neither training data nor model. Theory in the case of
a linear model and a single-layer convolutional neural network supports our
experimental findings.
Probabilistic Model-Agnostic Meta-Learning

Meta-learning for few-shot learning entails acquiring a prior over previous
tasks and experiences, such that new tasks be learned from small amounts of
data. However, a critical challenge in few-shot learning is task ambiguity:
even when a powerful prior can be meta-learned from a large number of prior
tasks, a small dataset for a new task can simply be too ambiguous to acquire a
single model (e.g., a classifier) for that task that is accurate. In this
paper, we propose a probabilistic meta-learning algorithm that can sample
models for a new task from a model distribution. Our approach extends model-
agnostic meta-learning, which adapts to new tasks via gradient descent, to
incorporate a parameter distribution that is trained via a variational lower
bound. At meta-test time, our algorithm adapts via a simple procedure that
injects noise into gradient descent, and at meta-training time, the model is
trained such that this stochastic adaptation procedure produces samples from
the approximate model posterior. Our experimental results show that our method
can sample plausible classifiers and regressors in ambiguous few-shot learning
problems.
Reinforcement Learning with Multiple Experts: A Bayesian Model Combination Approach

Potential based reward shaping is a powerful technique for accelerating
convergence of reinforcement learning algorithms. Typically, such information
includes an estimate of the optimal value function and is often provided by a
human expert or other sources of domain knowledge. However, this information
is often biased or inaccurate and can mislead many reinforcement learning
algorithms. In this paper, we apply Bayesian Model Combination with multiple
experts in a way which learns to trust the best combination of experts as
training progresses. This approach is both computationally efficient and
general, and is shown numerically to improve convergence of various
reinforcement learning algorithms across many domains.
e-SNLI: Natural Language Inference with Natural Language Explanations

In order for machine learning to garner widespread public adoption, models
must be able to provide interpretable and robust explanations for their
decisions, as well as learn from natural language explanations. In this work,
we extend the Stanford Natural Language Inference dataset with an additional
layer of human-annotated free-form explanations of the entailment relations.
We further implement models that incorporate these explanations into their
training process and output them at test time. We show that our corpus of
explanations can be used for various goals, such as obtaining full sentence
justifications of a model's decisions and providing consistent improvements on
a range of tasks compared to universal sentence representations learned
without explanations. Our dataset opens up a range of research directions for
using natural language explanations, both for improving models and for
asserting their trust.
Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis

Optimization algorithms that leverage gradient covariance information, such as
variants of natural gradient descent (Amari, 1998), offer the prospect of
yielding more effective descent directions. For models with many parameters,
the covariance matrix they are based on however becomes gigantic, rendering
them inapplicable in their original form. This has motivated research into
both simple diagonal approximations and more sophisticated factored
approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse &
Martens, 2016). In the present work we draw inspiration from both to propose a
novel approximation that is both provably better than KFAC and that is
amendable to cheap partial updates. It consists in tracking a diagonal
variance, not in parameter coordinates, but in a Kronecker-factored
eigenbasis, in which the diagonal approximation is likely to be more
effective. Experiments show improvements over KFAC in optimization speed for
several deep network architectures, both in number of iterations and in wall-
clock-time.
Learning convex bounds for linear quadratic control policy synthesis

Learning to make decisions from observed data in dynamic environments remains
a problem of fundamental importance in a numbers of fields, from artificial
intelligence and robotics, to medicine and finance. This paper concerns the
problem of learning control policies for unknown linear dynamical systems so
as to maximize a quadratic reward function. We present a method to optimize
the expected value of the reward over the posterior distribution of the
unknown system parameters, given data. The algorithm involves sequential
convex programing, and enjoys reliable local convergence and robust stability
guarantees. Numerical simulations and stabilization of a real-world inverted
pendulum are used to demonstrate the approach, with strong performance and
robustness properties observed in both.
Neural Proximal Gradient Descent for Compressive Imaging

Recovering high-resolution images from limited sensory data typically leads to
a serious ill-posed inverse problem, demanding inversion algorithms that
effectively capture the prior information. Learning a good inverse mapping
from training data faces severe challenges, including: (i) scarcity of
training data; (ii) need for plausible reconstructions that are physically
feasible; (iii) need for fast reconstruction, especially in real-time
applications. We develop a successful system solving all these challenges,
using as basic architecture the repetitive application of alternating proximal
and data fidelity constraints. We learn a proximal map that works well with
real images based on residual networks with recurrent blocks. Extensive
experiments are carried out under different settings: (a) reconstructing
abdominal MRI of pediatric patients from highly undersampled k-space data and
(b) super-resolving natural face images. Our key findings include: 1. a
recurrent ResNet with a single residual block (10-fold repetition) yields an
effective proximal which accurately reveals MR image details. 2. Our
architecture significantly outperforms conventional non-recurrent deep ResNets
by 2dB SNR; it is also trained much more rapidly. 3. It outperforms state-of-
the-art compressed-sensing Wavelet-based methods by 4dB SNR, with 100x
speedups in reconstruction time.
To What Extent Do Different Neural Networks Learn the Same Representation: A Neuron Activation Subspace Match Approach

Studying the learned representations is important for understanding deep
neural networks. In this paper, we investigate the similarity of
representations learned by two networks with identical architecture but
trained from different initializations. Instead of resorting to heuristic
methods, we develop a rigorous theory based on the neuron activation subspace
match model. The theory gives a complete characterization of the structure of
neuron activation subspace matches, where the core concepts are maximum match
and simple match which describe the overall and the finest similarity between
sets of neurons in two networks respectively. We also propose efficient
algorithms to find the maximum match and simple matches. Finally, experimental
study using our algorithms suggests that, somewhat surprisingly,
representations learned by the same convolutional layers of two networks are
not as similar as prevalently expected.
Optimal Algorithms for Continuous   Non-monotone Submodular and DR-Submodular Maximization

In this paper we study the fundamental problems of maximizing a continuous non
monotone submodular function over a hypercube, with and without coordinate-
wise concavity. This family of optimization problems has several applications
in machine learning, economics, and communication systems. Our main result is
the first 1/2 approximation algorithm for continuous submodular function
maximization; this approximation factor of is the best possible for algorithms
that use only polynomially many queries. For the special case of DR-submodular
maximization, we provide a different 1/2-approximation algorithm that runs in
quasi-linear time. Both of these results improve upon prior work [Bian et al.,
2017]. Our first algorithm uses novel ideas such as reducing the guaranteed
approximation problem to analyzing a zero-sum game for each coordinate, and
incorporates the geometry of this zero-sum game to fix the value at this
coordinate. Our second algorithm exploits coordinate-wise concavity to
identify a monotone equilibrium condition sufficient for getting the required
approximation guarantee, and hunts for the equilibrium point using binary
search. We further run experiments to verify the performance of our proposed
algorithms in related machine learning applications.
An intriguing failing of convolutional neural networks and the CoordConv solution

Few ideas have enjoyed as large an impact on deep learning as convolution. For
any problem involving pixels or spatial representations, common intuition
holds that convolutional networks may be appropriate. In this paper we show a
striking counterexample to this intuition via the seemingly trivial coordinate
transform problem, which simply requires learning a mapping between
coordinates in (x,y) Cartesian space and coordinates in pixel space. Although
convolutional networks would seem appropriate for this task, we show that they
fail spectacularly. We demonstrate and carefully analyze the failure first on
a toy problem, at which point a simple fix becomes obvious. We call this
solution CoordConv, which works by giving convolution access to its own input
coordinates through the use of extra coordinate channels. Without sacrificing
the computational and parametric efficiency of ordinary convolution, CoordConv
allows networks to learn either perfect translation invariance or varying
degrees of translation dependence, as required by the end task. CoordConv
solves the coordinate transform problem 150 times faster, with 10-100 times
fewer parameters, and with perfect generalization. This stark contrast leads
to a final question: to what extent has this inability of convolution
persisted insidiously inside other tasks, subtly hampering performance from
within? A complete answer to this question will likely require much follow up
work, but we show preliminary evidence that swapping convolution for CoordConv
can improve models on a diverse set of tasks. We show that using CoordConv in
GANs results in less mode collapse as the transform between high-level spatial
latents and pixels becomes easier to learn. We show small but statistically
significant improvements by simply adding a CoordConv layer to ResNet-50, and
we show significant improvements in the RL domain by giving agents playing
Atari games access to CoordConv layers.
Trading robust representations for sample complexity through self-supervised visual experience

Learning in small sample regimes is among the most remarkable features of the
human perceptual system. This ability is related to robustness to
transformations, which is acquired through visual experience in the form of
weak- or self-supervision during development. We explore the idea of allowing
artificial systems to learn representations of visual stimuli through weak
supervision prior to downstream supervised tasks. We introduce a novel loss
function for representation learning using unlabeled image sets and video
sequences, and experimentally demonstrate that these representations support
one-shot learning and reduce the sample complexity of multiple recognition
tasks. We establish the existence of a trade-off between the sizes of weakly
supervised, automatically obtained from video sequences, and fully supervised
data sets. Our results suggest that equivalence sets other than class labels,
which are abundant in unlabeled visual experience, can be used for self-
supervised learning of semantically relevant image embeddings.
Invertibility of Convolutional Generative Networks from Partial Measurements

In this work, we present new theoretical results on convolutional generative
neural networks, in particular their invertibility (i.e., the recovery of
input latent code given the network output). This inversion problem is highly
non-convex, which is in general computationally challenging and has no
performance guarantee. However, we rigorously prove that, even when the
network output is only partially observed (e.g., with missing pixels), the
input of a two-layer convolutional generative network can always be computed
from the network output, using simple gradient descent. This new theoretical
finding implies that the mapping from the low-dimensional latent space to the
high-dimensional image space is bijective (i.e., one-to-one). Our theorem
holds for 2-layer convolutional generative network with relu as the activation
function, but we demonstrate that the same conclusion empirically extends to
multi-layer networks and networks with other activation functions (including
the leaky relu, sigmoid and tanh). Our proof is built on our newly proposed
permutation technique, which can potentially be generalized to networks with
multiple layers and in other theoretical studies on convolutional neural
networks, and thus is a merit on its own.
Ex ante coordination and collusion in zero-sum multi-player extensive-form games

Recent milestones in equilibrium computation, such as the success of Libratus,
show that it is possible to compute strong solutions to two-player zero-sum
games in theory and practice. This is not the case for games with multiple
players, which remains one of the main open challenges in computational game
theory. This paper focuses on zero-sum games where a team of players faces an
opponent, as is the case, for example, in Bridge, collusion in poker, and many
non-recreational applications such as war, where the colluders do not have
time or means of communicating during battle, collusion in bidding, where
communication during the auction is illegal, and coordinated swindling in
public. The possibility for the team members to communicate before game
play—that is, coordinate their strategies ex ante—makes the use of behavioral
strategies unsatisfactory. The reasons for this are closely related to the
fact that the team can be represented as a single player with imperfect
recall. We propose a new game representation, the realization form, that
generalizes the sequence form but can also be applied to imperfect-recall
games. Then, we use it to derive an auxiliary game that is equivalent to the
original one. It provides a sound way to map the problem of finding an optimal
ex-ante-correlated strategy for the team to the well-understood Nash
equilibrium-finding problem in a (larger) two-player zero-sum perfect-recall
game. By reasoning over the auxiliary game, we devise an anytime algorithm,
Fictitious Team-Play, that is guaranteed to converge to an optimal coordinated
strategy for the team against an optimal opponent, and that is dramatically
faster than the prior state-of-the-art algorithm for this problem.
Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization

Despite the success of single-agent reinforcement learning, multi-agent
reinforcement learning (MARL) remains challenging due to complex interactions
between agents. Motivated by decentralized applications such as sensor
networks, swarm robotics, and power grids, we study policy evaluation in MARL,
where agents with jointly observed state-action pairs and private local
rewards collaborate to learn the value of a given policy. In this paper, we
propose a double averaging scheme, where each agent iteratively performs
averaging over both space and time to incorporate neighboring gradient
information and local reward information, respectively. We prove that the
proposed algorithm converges to the optimal solution at a global geometric
rate. In particular, such an algorithm is built upon a primal-dual
reformulation of the mean squared Bellman error minimization problem, which
gives rise to a decentralized convex-concave saddle-point problem. To the best
of our knowledge, the proposed double averaging primal-dual optimization
algorithm is the first to achieve fast finite-time convergence on
decentralized convex-concave saddle-point problems.
Improving Online Algorithms via ML Predictions

In this work we study the problem of using machine learned predictions to
improve performance of online algorithms. We consider two classical problems,
ski rental and non-clairvoyant job scheduling, and obtain new online
algorithms that use predictions to make their decisions. These algorithms are
oblivious to the performance of the predictor, improve with better
predictions, but do not degrade much if the predictions are poor.
Non-convex Optimization with Discretized Diffusions

An Euler discretization of the Langevin diffusion is known to converge to the
global minimizers of certain convex and non-convex optimization problems. We
show that this property holds for any suitably smooth diffusion and that
different diffusions are suitable for optimizing different classes of convex
and non-convex functions. This allows us to design diffusions suitable for
globally optimizing non-convex functions not covered by the existing Langevin
theory. Our non-asymptotic analysis establishes explicit, finite-time
convergence rates to global minima, and is based on a multidimensional version
of Stein's method with new explicit bounds on the solutions of Poisson
equations.
Theoretical guarantees for EM under misspecified Gaussian mixture models

Recent years have witnessed substantial progress in understanding the behavior
of EM for mixture models that are correctly specified. Given that model
misspecification is common in practice, it is important to understand EM in
this more general setting. We provide non-asymptotic guarantees for population
and sample-based EM for parameter estimation under a few specific univariate
settings of misspecified Gaussian mixture models. Due to misspecification, the
EM iterates no longer converge to the true model and instead converge to the
projection of the true model over the set of models being searched over. We
provide two classes of theoretical guarantees: first, we characterize the bias
introduced due to the misspecification; and second, we prove that population
EM converges at a geometric rate to the model projection under a suitable
initialization condition. This geometric convergence rate for population EM
imply a statistical complexity of order $1/\sqrt{n}$ when running EM with $n$
samples. We validate our theoretical findings in different cases via several
numerical examples.
Coupled Variational Bayes via Optimization Embedding

Variational inference plays a vital role in learning graphical models,
especially on large-scale datasets. Much of its success depends on a proper
choice of auxiliary distribution class for posterior approximation. However,
how to pursue an auxiliary distribution class that achieves both good
approximation ability and computation efficiency remains a core challenge. In
this paper, we construct such a distribution class, termed optimization
embedding, since it takes root in an optimization procedure. This flexible
function class couples the variational distribution with the original
parameters in the graphical models, allowing end-to-end learning of the
graphical models by back-propagation through the variational distribution.
Theoretically, we establish an interesting connection to gradient flow and
demonstrate the extremely flexibility of this implicit distribution family in
the limit sense. Practically, the proposed technique allows to significantly
accelerate the learning procedure, i.e., the proposed coupled variational
Bayes, by reducing the searching space to a large extent. We further
demonstrate the significant superiority of the proposed method on multiple
graphical models with either continuous or discrete latent variables comparing
to state-of-the-art methods.
Improving Explorability in Variational Inference with Annealed Variational Objectives

Despite the advances in the representational capacity of approximate
distributions for variational inference, the optimization process can still
limit the density that is ultimately learned. We demonstrate the drawbacks of
biasing the true posterior to be unimodal, and introduce Annealed Variational
Objectives (AVO) into the training of hierarchical variational methods.
Inspired by Annealed Importance Sampling, the proposed method facilitates
learning by incorporating energy tempering into the optimization objective. In
our experiments, we demonstrate our method's robustness to deterministic warm
up, and the benefits of encouraging exploration in the latent space.
Latent Alignment and Variational Attention

Attention, a method for learning a soft alignment function embedded in a
neural network, is central for many state-of-the-art models in natural
language processing and related domains. Attention networks are easy to train
and interpret; however, the standard (soft) attention approach is a fully
feed-forward approach and does not marginalize over the latent alignment in a
traditional sense. This property makes it difficult to compare attention to
other alignment approaches, to compose it with probabilistic models, and to
perform posterior inference conditioned on observed data. A popular latent
variable approach, hard attention, fixes these issues, but is unfortunately
generally less accurate and harder to train. In this work, we explore the
space of modern variational inference approaches as alternatives for learning
latent variable alignment models. Variational attention generalizes hard
attention and can provide a tighter approximation bound. We further propose
methods for reducing the variance of gradients to make these approaches
computationally feasible. Experiments show that for machine translation and
visual question answering, while hard attention performs worse than soft
attention, exact latent variable models can outperform both. Furthermore
variational attention retains almost all of this performance gain with
training speed comparable to soft attention.
Towards Deep Conversational Recommendations

There has been growing interest in using neural networks and deep learning
techniques to create dialogue systems. Conversational recommendation is an
interesting setting for the scientific exploration of dialogue with natural
language as the associated discourse involves goal-driven dialogue that often
transforms naturally into more free-form chat. This paper provides two
contributions. First, until now there has been no publicly available large-
scale data set consisting of real-world dialogues centered around
recommendations. To address this issue and to facilitate our exploration here,
we have collected a data set consisting of over 10,000 conversations centered
around the theme of providing movie recommendations. We intend to make this
data available to the community for further research. Second, we use this
dataset to explore multiple facets of conversational recommendations. In
particular we explore new neural architectures, mechanisms and methods
suitable for composing conversational recommendation systems. Our dataset
allows us to systematically probe model sub-components addressing different
parts of the overall problem domain ranging from: sentiment analysis and cold-
start recommendation generation to detailed aspects of how natural language is
used in this setting in the real world. We combine such sub-components into a
full-blown dialogue system and examine its behavior.
Unsupervised Depth Estimation, 3D Face Rotation and Replacement

We present an unsupervised approach for learning to estimate three dimensional
(3D) facial structure from a single image while also predicting 3D viewpoint
transformations that match a desired pose and facial geometry. We achieve this
by inferring the depth of facial keypoints of an input image in an
unsupervised way. We show how it is possible to use these depths as
intermediate computations within a new backproppable loss to predict the
parameters of a 3D affine transformation matrix that maps inferred 3D
keypoints of an input face to the corresponding 2D keypoints on a desired
target facial geometry or pose. Our resulting approach can therefore be used
to infer plausible 3D transformations from one face pose to another, allowing
faces to be frontalized, transformed into 3D models or even warped to another
pose and facial geometry. Lastly, we identify certain shortcomings with our
formulation, and explore adversarial image translation techniques as a post-
processing step to re-synthesize complete head shots for faces re-targeted to
different poses or identities.
Generalization Bounds for Uniformly Stable Algorithms

Uniform stability of a learning algorithm is a classical notion of algorithmic
stability introduced to derive high-probability bounds on the generalization
error (Bousquet and Elisseeff, 2002). Specifically, for a loss function with
range bounded in $[0,1]$, the generalization error of $\gamma$-uniformly
stable learning algorithm on $n$ samples is known to be at most $O((\gamma
+1/n) \sqrt{n \log(1/\delta)})$ with probability at least $1-\delta$.
Unfortunately, this bound does not lead to meaningful generalization bounds in
many common settings where $\gamma \geq 1/\sqrt{n}$. At the same time the
bound is known to be tight only when $\gamma = O(1/n)$. Here we prove
substantially stronger generalization bounds for uniformly stable algorithms
without any additional assumptions. First, we show that the generalization
error in this setting is at most $O(\sqrt{(\gamma + 1/n) \log(1/\delta)})$
with probability at least $1-\delta$. In addition, we prove a tight bound of
$O(\gamma^2 + 1/n)$ on the second moment of the generalization error. The best
previous bound on the second moment of the generalization error is $O(\gamma +
1/n)$. Our proofs are based on new analysis techniques and our results imply
substantially stronger generalization guarantees for several well-studied
algorithms.
Deep Anomaly Detection Using Geometric Transformations

We consider the problem of anomaly detection in images, and present a new
detection technique. Given a sample of images, all known to belong to a
normal class (e.g., dogs), we show how to train a deep neural model that
can detect out-of-distribution images (i.e., non-dog objects). The main idea
behind our scheme is to train a multi-class model to discriminate between
dozens of geometric transformations applied on all the given images. The
auxiliary expertise learned by the model generates feature detectors that
effectively identify, at test time, anomalous images based on the softmax
activation statistics of the model when applied on transformed images. We
present extensive experiments using the proposed detector, which indicate that
our algorithm improves state-of-the-art methods by a wide margin.
Large Scale computation of Means and Clusters for Persistence Diagrams using Optimal Transport

Persistence diagrams (PDs) are now routinely used to summarize the underlying
topology of sophisticated data encountered in challenging learning problems.
Despite several appealing properties, integrating PDs in learning pipelines
can be challenging because their natural geometry is not Hilbertian. In
particular, algorithms to average a family of PDs have only been considered
recently and are known to be computationally prohibitive. We propose in this
article a tractable framework to carry out fundamental tasks on PDs, namely
evaluating distances, computing barycenters and carrying out clustering. This
framework builds upon a formulation of PD metrics as optimal transport (OT)
problems, for which recent computational advances, in particular entropic
regularization and its convolutional formulation on regular grids, can all be
leveraged to provide efficient and (GPU) scalable computations. We demonstrate
the efficiency of our approach by carrying out clustering on PDs at scales
never seen before in the literature.
Entropy Rate Estimation for Markov Chains with Large State Space

Entropy estimation is one of the prototypical problems in distribution
property testing. To consistently estimate the Shannon entropy of a
distribution on $S$ elements with independent samples, the optimal sample
complexity scales sublinearly with $S$ as $\Theta(\frac{S}{\log S})$ as shown
by Valiant and Valiant \cite{Valiant--Valiant2011}. Extending the theory and
algorithms for entropy estimation to dependent data, this paper considers the
problem of estimating the entropy rate of a stationary reversible Markov chain
with $S$ states from a sample path of $n$ observations. We show that
\begin{itemize} \item Provided the Markov chain mixes not too slowly,
\textit{i.e.}, the relaxation time is at most $O(\frac{S}{\ln^3 S})$,
consistent estimation is achievable when $n \gg \frac{S^2}{\log S}$. \item
Provided the Markov chain has some slight dependency, \textit{i.e.}, the
relaxation time is at least $1+\Omega(\frac{\ln^2 S}{\sqrt{S}})$, consistent
estimation is impossible when $n \lesssim \frac{S^2}{\log S}$. \end{itemize}
Under both assumptions, the optimal estimation accuracy is shown to be
$\Theta(\frac{S^2}{n \log S})$. In comparison, the empirical entropy rate
requires at least $\Omega(S^2)$ samples to be consistent, even when the Markov
chain is memoryless. In addition to synthetic experiments, we also apply the
estimators that achieve the optimal sample complexity to estimate the entropy
rate of the English language in the Penn Treebank and the Google One Billion
Words corpora, which provides a natural benchmark for language modeling and
relates it directly to the widely used perplexity measure.
Adaptive Methods for Nonconvex Optimization

Adaptive gradient methods that rely on scaling gradients down by the square
root of exponential moving averages of past squared gradients, such RMSProp,
Adam, Adadelta have found wide application in optimizing the nonconvex
problems that arise in deep learning. However, it has been recently
demonstrated that such methods can fail to converge even in simple convex
optimization settings. In this work, we provide a new analysis of such methods
applied to nonconvex stochastic optimization problems, characterizing the
effect of increasing minibatch size. Our analysis shows that under this
scenario such methods do converge to stationarity up to the statistical limit
of variance in the stochastic gradients (scaled by a constant factor). In
particular, our result implies that increasing minibatch sizes enables
convergence, thus providing a way to circumvent the non-convergence issues.
Furthermore, we provide a new adaptive optimization algorithm, Yogi, which
controls the increase in effective learning rate, leading to even better
performance with similar theoretical guarantees on convergence. Extensive
experiments show that Yogi with very little hyperparameter tuning outperforms
methods such as Adam in several challenging machine learning tasks.
Object-Oriented Dynamics Predictor

Generalization has been one of the major challenges for learning dynamics
models in model-based reinforcement learning. However, previous work on
action-conditioned dynamics prediction focuses on learning the pixel-level
motion and thus does not generalize well to novel environments with different
object layouts. In this paper, we present a novel object-oriented framework,
called object-oriented dynamics predictor (OODP), which decomposes the
environment into objects and predicts the dynamics of objects conditioned on
both actions and object-to-object relations. It is an end-to-end neural
network and can be trained in an unsupervised manner. To enable the
generalization ability of dynamics learning, we design a novel CNN-based
relation mechanism that is class-specific (rather than object-specific) and
exploits the locality principle. Empirical results show that OODP
significantly outperforms previous methods in terms of generalization over
novel environments with various object layouts. OODP is able to learn from
very few environments and accurately predict dynamics in a large number of
unseen environments. In addition, OODP learns semantically and visually
interpretable dynamics models.
Adaptive Skip Intervals: Temporal Abstraction for Recurrent Dynamical Models

We introduce a method which enables a recurrent dynamics model to be
temporally abstract. Our approach, which we call Adaptive Skip Intervals
(ASI), is based on the observation that in many sequential prediction tasks,
the exact time at which events occur is irrelevant to the underlying
objective. Moreover, in many situations, there exist prediction intervals
which result in particularly easy-to-predict transitions. We show that there
are prediction tasks for which we gain both computational efficiency and
prediction accuracy by allowing the model to make predictions at a sampling
rate which it can choose itself.
Scalable End-to-End Autonomous Vehicle Testing via Rare-event Simulation

While recent developments in autonomous vehicle (AV) technology highlight
substantial progress, we lack the tools for rigorous, scalable testing that
are so important for these safety-critical systems. Real-world testing, the
\textit{de facto} evaluation environment, places the public in danger, and,
due to the rare nature of accidents, will require billions of miles to
validate empirically. We implement a simulation framework,
\emph{pseudoreality}, which can put an entire autonomous driving system under
test -- this includes deep-learning based perception and control algorithms,
but also the underlying dynamics models and near-photo-realistic rendering
engine required to complete the loop. While complete systems of this
complexity are currently intractable for formal verification, we demonstrate
full-scale testing using a \emph{risk-based framework} where our goal is to
evaluate the probability of an accident under a base distribution governing
standard traffic behavior. Further, we address fundamental challenges in the
sample complexity of risk evaluation through the use of adaptive importance-
sampling methods. We demonstrate our framework on a highway scenario, showing
that it is possible to accelerate system evaluation by $10$-$50 \mathsf{P}$
times that of real-world testing ($\mathsf{P}$ is the number of processors),
and $1.5$-$5$ times that of naive Monte Carlo sampling methods.
Reinforcement Learning for Solving the Vehicle Routing Problem

We present an end-to-end framework for solving the Vehicle Routing Problem
(VRP) using reinforcement learning. In this approach, we train a single model
that finds near-optimal solutions for problem instances sampled from a given
distribution, only by observing the reward signals and following feasibility
rules. Our model represents a parameterized stochastic policy, and by applying
a policy gradient algorithm to optimize its parameters, the trained model
produces the solution as a sequence of consecutive actions in real time,
without the need to re-train for every new problem instance. On capacitated
VRP, our approach outperforms classical heuristics and Google's OR-Tools on
medium-sized instances in solution quality with comparable computation time
(after training). We demonstrate how our approach can handle problems with
split delivery and explore the effect of such deliveries on the solution
quality. Our proposed framework can be applied to other variants of the VRP
such as the stochastic VRP, and has the potential to be applied more generally
to combinatorial optimization problems.
ATOMO: Communication-efficient Learning via Atomic Sparsification

Distributed model training suffers from communication overheads due to
frequent gradient updates transmitted between compute nodes. To mitigate these
overheads, several studies propose the use of sparsified stochastic gradients.
We argue that these are facets of a general sparsification method that can
operate on any possible atomic decomposition. Notable examples include
element-wise, singular value, and Fourier decompositions. We present ATOMO, a
general framework for atomic sparsification of stochastic gradients. Given a
gradient, an atomic decomposition, and a sparsity budget, ATOMO gives a random
unbiased sparsification of the atoms minimizing variance. We show that methods
such as QSGD and TernGrad are special cases of ATOMO and show that
sparsifiying gradients in their singular value decomposition (SVD), rather
than the coordinate-wise one, can lead to significantly faster distributed
training.
Dynamic Network Model from Partial Observations

Can evolving networks be inferred and modeled without directly observing their
nodes and edges? In many applications, the edges of a dynamic network might
not be observed, but one can observe the dynamics of stochastic cascading
processes (e.g., information diffusion, virus propagation) occurring over the
unobserved network. While there have been efforts to infer networks based on
such data, providing a generative probabilistic model that is able to identify
the underlying time-varying network remains an open question. Here we consider
the problem of inferring generative dynamic network models based on network
cascade diffusion data. We propose a novel framework for providing a non-
parametric dynamic network model---based on a mixture of coupled hierarchical
Dirichlet processes---based on data capturing cascade node infection times.
Our approach allows us to infer the evolving community structure in networks
and to obtain an explicit predictive distribution over the edges of the
underlying network---including those that were not involved in transmission of
any cascade, or are likely to appear in the future. We show the effectiveness
of our approach using extensive experiments on synthetic as well as real-world
networks.
Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies

Intelligent behaviour in the real-world requires the ability to acquire new
knowledge from an ongoing sequence of experiences while preserving and reusing
past knowledge. We propose a novel algorithm for unsupervised representation
learning from piece-wise stationary visual data: Variational Autoencoder with
Shared Embeddings (VASE). Based on the Minimum Description Length principle,
VASE automatically detects shifts in the data distribution and allocates spare
representational capacity to new knowledge, while simultaneously protecting
previously learnt representations from catastrophic forgetting. Our approach
encourages the learnt representations to be disentangled, which imparts a
number of desirable properties: VASE can deal sensibly with ambiguous inputs,
it can enhance its own representations through imagination-based exploration,
and most importantly, it exhibits semantically meaningful sharing of latents
between different datasets. Compared to baselines with entangled
representations, our approach is able to reason beyond surface-level
statistics and perform semantically meaningful cross-domain inference.
Maximizing acquisition functions for Bayesian optimization

Bayesian optimization is a sample-efficient approach for global optimization
and relies on acquisition functions to guide the search process. Maximizing
these functions is inherently complicated, especially in the parallel setting,
where acquisition functions are routinely non-convex, high-dimensional and
intractable. We present two modern approaches for maximizing acquisition
functions and show that 1) sample-path derivatives can be used to optimize
acquisition functions and 2) parallel formulations of many acquisition
functions are submodular and can therefore be efficiently maximized in greedy
fashion with guaranteed near-optimality.
On Markov Chain Gradient Descent

Stochastic gradient methods are the workhorse (algorithms) of large-scale
optimization problems in machine learning, signal processing, and other
computational sciences and engineering. This paper studies Markov chain
gradient descent, a variant of stochastic gradient descent where the random
samples are taken on the trajectory of a Markov chain. Existing results of
this method assume convex objectives and a reversible Markov chain and thus
have their limitations. We establish new non-ergodic convergence under wider
step sizes, for nonconvex problems, and for non-reversible finite-state Markov
chains. Nonconvexity makes our method applicable to broader problem classes.
Non-reversible finite-state Markov chains, on the other hand, can mix
substatially faster. To obtain these results, we introduce a new technique
that varies the mixing levels of the Markov chains. The reported numerical
results validate our contributions.
Variance-Reduced Stochastic Gradient Descent on Streaming Data

We present an algorithm STRSAGA for efficiently maintaining a machine learning
model over constantly arriving streaming data that can quickly update the
model as new training data is observed. We present a competitive analysis
comparing the sub optimality of the model maintained by STRSAGA with that of
an offline algorithm that is given the entire data beforehand, and analyze the
risk-competitiveness of STRSAGA under different arrival patterns. Our
theoretical and experimental results show that the risk of STRSAGA is
comparable to that of offline algorithms on a variety of input arrival
patterns, and its experimental performance is significantly better than prior
algorithms on streaming data, such as SSVRG.
Online Robust Policy Learning in the Presence of Unknown Adversaries

The growing prospect of deep reinforcement learning (DRL) being used in cyber-
physical systems has raised concerns around safety and robustness of
autonomous agents. Recent work on generating adversarial attacks have shown
that it is computationally feasible for a bad actor to fool a DRL policy into
behaving sub optimally. Although certain adversarial attacks with specific
attack models have been addressed, most studies are only interested in off-
line optimization in the data space (e.g., example fitting, distillation).
This paper introduces a Meta-Learned Advantage Hierarchy (MLAH) framework that
is attack model-agnostic and more suited to reinforcement learning, via
handling the attacks in the decision space (as opposed to data space) and
directly mitigating learned bias introduced by the adversary. In MLAH, we
learn separate sub-policies (nominal and adversarial) in an online manner, as
guided by a supervisory master agent that detects the presence of the
adversary by leveraging the advantage function for the sub-policies. We
demonstrate that the proposed algorithm enables policy learning with
significantly lower bias as compared to the state-of-the-art policy learning
approaches even in the presence of heavy state information attacks. We present
algorithm analysis and simulation results using popular OpenAI Gym
environments.
Uplift Modeling from Separate Labels

Uplift modeling is aimed at estimating the incremental impact of an action on
an individual's behavior, which is useful in various application domains such
as targeted marketing (advertisement campaigns) and personalized medicine
(medical treatments). Conventional methods of uplift modeling require every
instance to be jointly equipped with two types of labels: the taken action and
its outcome. However, obtaining two labels for each instance at the same time
is difficult or expensive in many real-world problems. In this paper, we
propose a novel method of uplift modeling that is applicable to a more
practical setting where only one type of labels is available for each
instance. We provide a generalization error bound of the proposed method and
demonstrate its effectiveness through experiments.
Learning Invariances using the Marginal Likelihood

In many supervised learning tasks, learning what changes do not affect the
predic-tion target is as crucial to generalisation as learning what does. Data
augmentationis a common way to enforce a model to exhibit an invariance:
training data is modi-fied according to an invariance designed by a human and
added to the training data.We argue that invariances should be incorporated
the model structure, and learnedusing themarginal likelihood, which can
correctly reward the reduced complexityof invariant models. We incorporate
invariances in a Gaussian process, due to goodmarginal likelihood
approximations being available for these models. Our maincontribution is a
derivation for a variational inference scheme for invariant Gaussianprocesses
where the invariance is described by a probability distribution that canbe
sampled from, much like how data augmentation is implemented in practice
Non-delusional Q-learning and Value-iteration

We identify a fundamental source of error in Q-learning and other forms of
dynamic programming with function approximation. Delusional bias arises when
the approximation architecture limits the class of expressible greedy
policies. Since standard Q-updates make globally uncoordinated action choices
with respect to any policy class, inconsistent or even conflicting Q-value
estimates can result, leading to pathological behaviour such as over/under-
estimation and even divergence. To solve this problem, we introduce a new
notion of policy consistency and define a local backup process that ensures
global consistency through the use of information sets—sets that record
constraints on policies consistent with backed-up Q-values. We prove that
model-based and model-free algorithms using this backup fully resolve
delusional bias, yielding the first known algorithms that can guarantee
optimal results under general conditions. These algorithms only require
polynomially many information sets (from a potentially exponential support).
Finally, we suggest other heuristics for value-iteration and Q-learning that
attempt to reduce this bias.
Using Large Ensembles of Control Variates for Variational Inference

Variational inference is increasingly being addressed with stochastic
optimization. While control variates are commonly used to reduce stochastic
gradient variance, they are typically looked at in isolation. This paper
clarifies the large number of control variates that are available by giving a
systematic view of how they are derived. We give a Bayesian risk minimization
framework in which the quality of a procedure for combining control variates
is quantified by its effect on optimization convergence rates, which leads to
a very simple combination rule. Results show that combining a large number of
control variates this way significantly improves the speed and robustness of
inference over using any single control variate in isolation.
Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization

Training deep neural networks requires an exorbitant amount of computation
resources, including a heterogeneous mix of GPU and CPU devices. It is
critical to place operations in a neural network on these devices in an
optimal way, so that the training process can complete within the shortest
amount of time. The state-of-the-art uses reinforcement learning to learn
placement skills by repeatedly performing Monte-Carlo experiments. However,
due to its equal treatment of placement samples, we argue that there remains
ample room for significant improvements. In this paper, we propose a new joint
learning algorithm, called Post, that integrates cross-entropy minimization
and proximal policy optimization to achieve theoretically guaranteed optimal
efficiency. In order to incorporate the cross-entropy method as a sampling
technique, we propose to represent placements using discrete probability
distributions, which allows us to estimate an optimal probability mass by
maximal likelihood estimation, a powerful tool with the best possible
efficiency. We have implemented Post in the Google Cloud platform, and our
extensive experiments with several popular neural network benchmarks have
demonstrated clear evidence of superior performance: with the same amount of
learning time, it leads to placements that have training times up to 63.7%
shorter over the state-of-the-art.
Learning to Reason with Third Order Tensor Products

We combine Recurrent Neural Networks with Tensor Product Representations to
learn “near-symbolic,” interpretable, combinatorial representations of
sequential data. The new architecture is trained end-to-end through gradient
descent on a variety of natural language reasoning tasks, outperforming
current state-of-the-art models in joint and single task settings. When
training and test data exhibit systematic differences, it generalises more
systematically than previous state-of-the-art methods.
Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

This paper presents MAPO: a novel policy optimization formulation that
incorporates a memory buffer of promising trajectories to reduce the variance
of policy gradient estimates for deterministic environments with discrete
actions. The formulation expresses the expected return objective as a weighted
sum of two terms: an expectation over a memory of trajectories with high
rewards, and a separate expectation over the trajectories outside the memory.
We propose 3 techniques to make an efficient training algorithm for MAPO: (1)
distributed sampling from inside and outside memory with an actor-learner
architecture; (2) a marginal likelihood constraint over the memory to initiate
training; (3) systematic exploration to discover new high reward trajectories.
MAPO improves the sample efficiency and robustness of policy gradient,
especially on tasks with a sparse reward. We evaluate MAPO on weakly
supervised program synthesis from natural language / semantic parsing. On the
WikiTableQuestions benchmark we improve the state-of-the-art by 2.5%,
achieving an accuracy of 46.2%, and on the WikiSQL benchmark, MAPO achieves an
accuracy of 74.9% with only weak supervision, outperforming the state-of-the-
art with full supervision.
Persistence Fisher Kernel: A Riemannian Manifold Kernel for Persistence Diagrams

Algebraic topology methods have recently played an important role for
statistical analysis with complicated geometric structured data such as
shapes, linked twist maps, and material data. Among them, \textit{persistent
homology} is a well-known tool to extract robust topological features, and
outputs as \textit{persistence diagrams} (PDs). However, PDs are point multi-
sets which can not be used in machine learning algorithms for vector data. To
deal with it, an emerged approach is to use kernel methods, and an appropriate
geometry for PDs is an important factor to measure the similarity of PDs. A
popular geometry for PDs is the \textit{Wasserstein metric}. However,
Wasserstein distance is not \textit{negative definite}. Thus, it is limited to
build positive definite kernels upon the Wasserstein distance \textit{without
approximation}. In this work, we rely upon the alternative \textit{Fisher
information geometry} to propose a positive definite kernel for PDs
\textit{without approximation}, namely the Persistence Fisher (PF) kernel.
Then, we analyze eigensystem of the integral operator induced by the proposed
kernel for kernel machines. Based on that, we derive generalization error
bounds via covering numbers and Rademacher averages for kernel machines with
the PF kernel. Additionally, we show some nice properties such as stability
and infinite divisibility for the proposed kernel. Furthermore, we also
propose a linear time complexity over the number of points in PDs for an
approximation of our proposed kernel with a bounded error. Throughout
experiments with many different tasks on various benchmark datasets, we
illustrate that the PF kernel compares favorably with other baseline kernels
for PDs.
Neural Voice Cloning with a Few Samples

Voice cloning is a highly desired feature for personalized speech interfaces.
We introduce a neural voice cloning system that learns to synthesize a
person's voice from only a few audio samples. We study two approaches: speaker
adaptation and speaker encoding. Speaker adaptation is based on fine-tuning a
multi-speaker generative model. Speaker encoding is based on training a
separate model to directly infer a new speaker embedding, which will be
applied to a multi-speaker generative model. In terms of naturalness of the
speech and similarity to the original speaker, both approaches can achieve
good performance, even with a few cloning audios. While speaker adaptation can
achieve slightly better naturalness and similarity, cloning time and required
memory for the speaker encoding approach are significantly less, making it
more favorable for low-resource deployment.
Blind Deconvolutional Phase Retrieval via Convex Programming

We consider the task of recovering two real or complex $m$-vectors from
phaseless Fourier measurements of their circular convolution. Our method is a
novel convex relaxation that is based on a lifted matrix recovery formulation
that allows a nontrivial convex relaxation of the bilinear measurements from
convolution. We prove that if the two signals belong to known random subspaces
of dimensions $k$ and $n$, then they can be recovered up to the inherent
scaling ambiguity with $m &gt;&gt; (k+n) \log^2 m$ phaseless measurements. Our
method provides the first theoretical recovery guarantee for this problem by a
computationally efficient algorithm and does not require a solution estimate
to be computed for initialization. Our proof is based Rademacher complexity
estimates. Additionally, we provide an ADMM implementation of the method and
provide numerical experiments that verify the theory.
Scalable Laplacian K-modes

We advocate Laplacian K-modes for joint clustering and density mode finding,
and propose a concave-convex relaxation of the problem, which yields a
parallel algorithm that scales up to large datasets and high dimensions. We
optimize a tight bound (auxiliary function) of our relaxation, which, at each
iteration, amounts to computing an independent update for each cluster-
assignment variable, with guaranteed convergence. Therefore, our bound
optimizer can be trivially distributed for large-scale data sets. Furthermore,
we show that the density modes can be obtained as byproducts of the assignment
variables via simple maximum-value operations whose additional computational
cost is linear in the number of data points. Our formulation does not need
storing a full affinity matrix and computing its eigenvalue decomposition,
neither does it perform expensive projection steps and Lagrangian-dual inner
iterates for the simplex constraints of each point. Furthermore, unlike mean-
shift, our density-mode estimation does not require inner-loop gradient-ascent
iterates. It has a complexity independent of feature-space dimension, yields
modes that are valid data points in the input set and is applicable to
discrete domains as well as arbitrary kernels. We report comprehensive
experiments over various data sets, which show that our algorithm yields very
competitive performances in term of optimization quality (i.e., the value of
the discrete-variable objective at convergence) and clustering accuracy.
A Retrieve-and-Edit Framework for Predicting Structured Outputs

Generic sequence-to-sequence models have trouble generating outputs with
highly-structured dependencies such as source code. Motivated by the
observation that editing is easier than writing from scratch, we propose a
general retrieve-and-edit paradigm that can leverage any base sequence-to-
sequence model: given a test input, we first retrieve a training example and
then edit the retrieved output into the final predicted output using the base
model. The key challenge is to efficiently learn a retriever that is sensitive
to the prediction task. We propose first learning a joint variational
autoencoder over input-output pairs and then regressing a conditional
retriever on the joint embeddings. On the Hearthstone cards benchmark, we show
that applying the retrieve-and-edit paradigm to a vanilla sequence-to-sequence
model results in BLEU scores approaching those of specialized AST-based code
generation models. Additionally, we introduce a new autocomplete task on
Python code from GitHub, on which we demonstrate the benefits of retrieve-and-
edit.
Testing for Families of Distributions via the Fourier Transform

We study the general problem of testing whether an unknown discrete
distribution belongs to a specified family of distributions. More
specifically, given a distribution family P and sample access to an unknown
discrete distribution D , we want to distinguish (with high probability)
between the case that D in P and the case that D is ε-far, in total variation
distance, from every distribution in P . This is the prototypical hypothesis
testing problem that has received significant attention in statistics and,
more recently, in computer science. The main contribution of this work is a
simple and general testing technique that is applicable to all distribution
families whose Fourier spectrum satisfies a certain approximate sparsity
property. We apply our Fourier-based framework to obtain near sample-optimal
and computationally efficient testers for the following fundamental
distribution families: Sums of Independent Integer Random Variables (SIIRVs),
Poisson Multinomial Distributions (PMDs), and Discrete Log-Concave
Distributions. For the first two, ours are the first non-trivial testers in
the literature, vastly generalizing previous work on testing Poisson Binomial
Distributions. For the third, our tester improves on prior work in both sample
and time complexity.
Thwarting Adversarial Examples: An $L_0$-Robust Sparse Fourier Transform

We give a new algorithm for approximating the Discrete Fourier transform of an
approximately sparse signal that is robust to worst-case $L_0$ corruptions,
namely that some coordinates of the signal can be corrupt arbitrarily. Our
techniques generalize to a wide range of linear transformations that are used
in data analysis such as the Discrete Cosine and Sine transforms, the Hadamard
transform, and their high-dimensional analogs. We use our algorithm to
successfully defend against worst-case $L_0$ adversaries in the setting of
image classification. We give experimental results on the Jacobian-based
Saliency Map Attack (JSMA) and the CW $L_0$ attack on the MNIST and Fashion-
MNIST datasets as well as the Adversarial Patch on the ImageNet dataset.
Blockwise Parallel Decoding for Deep Autoregressive Models

Deep autoregressive sequence-to-sequence models have demonstrated impressive
performance across a wide variety of tasks in recent years. While several
common architecture classes including recurrent, convolutional, and self-
attention networks make different trade-offs between the amount of computation
needed per layer and the length of the critical path at training time,
inference for novel inputs still remains an inherently sequential process. We
propose a novel blockwise parallel decoding scheme that takes advantage of the
fact that some architectures can score sequences in sublinear time. By
generating predictions for multiple time steps at once then backing off to the
longest prefix validated by the scoring model, we can substantially improve
the speed of greedy decoding without compromising performance. When tested on
state-of-the-art self-attention models for machine translation and image
super-resolution, our approach achieves iteration reductions of up to 2x over
a baseline greedy decoder with no loss in quality. Relaxing the acceptance
criterion and fine tuning model parameters allows for reductions of up to 7x
in exchange for a slight decrease in performance. Our fastest models achieve a
4x speedup in wall-clock time.
Low-rank Tucker decomposition of large tensors using TensorSketch

We propose two randomized algorithms for low-rank Tucker decompositions of
tensors. The algorithms, which incorporate sketching, only require a single
pass of the input tensor and can handle tensors whose elements are streamed in
any order. To the best of our knowledge, ours are the only algorithms which
can do this. We test our algorithms on sparse synthetic data and compare them
to multiple other methods. We also apply one of our algorithms to a real dense
38 GB tensor representing a video and use the resulting decomposition to
correctly classify frames containing disturbances.
A Simple Cache Model for Image Recognition

Training large-scale image recognition models is computationally expensive.
This raises the question of whether there might be simple ways to improve the
test performance of an already trained model without having to re-train or
even fine-tune it with new data. Here, we show that, surprisingly, this is
indeed possible. The key observation we make is that the layers of a deep
network close to the output layer contain independent, easily extractable
class-relevant information that is not contained in the output layer itself.
We propose to extract this extra class-relevant information using a simple
key-value cache memory to improve the classification performance of the model
at test time. Our cache memory is directly inspired by a similar cache model
previously proposed for language modeling (Grave et al., 2017). This cache
component does not require any training or fine-tuning; it can be applied to
any pre-trained model and, by properly setting only two hyper-parameters,
leads to significant improvements in its classification performance.
Improvements are observed across several architectures and datasets. In the
cache component, using features extracted from layers close to the output (but
not from the output layer itself) as keys leads to the largest improvements.
Concatenating features from multiple layers to form keys can further improve
performance over using single-layer features as keys. The cache component also
has a regularizing effect, a simple consequence of which is that it
substantially increases the robustness of models against adversarial attacks.
Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network

Recent work by Cohen et al. has achieved state-of-the-art results for learning
spherical images in a rotation invariant way by using ideas from group
representation theory and noncommutative harmonic analysis. In this paper we
propose a generalization of this work that generally exhibits improved
performace, but from an implementation point of view is actually simpler. An
unusual feature of the proposed architecture is that it uses the Clebsch--
Gordan transform as its only source of nonlinearity, thus avoiding repeated
forward and backward Fourier transforms. The underlying ideas of the paper
generalize to constructing neural networks that are invariant to the action of
other compact groups.
Bayesian Nonparametric Spectral Estimation

Spectral estimation (SE) aims to identify how the energy of a signal (e.g.,
time series) is distributed across different frequencies. This is a
challenging task when only partial and noisy observations are available, where
current methods fail to find expressive representations of the data and handle
uncertainty appropriately. In this context, we propose a joint probabilistic
model for signals, observations and spectra, where SE is addressed as an
inference problem. Assuming a Gaussian process prior over the signal, we apply
Bayes' rule to find the analytic posterior distribution of the spectrum given
a set of observations. Besides its expressiveness and natural ability to
represent spectral uncertainty, the proposed model provides a functional-form
estimate of the power spectral density which can be optimised efficiently. We
include a comparison to previous methods for SE and validation on three
experiments using synthetic and real-world data.
A Spectral View of Adversarially Robust Features

Given the apparent difficulty of learning models that are robust to
adversarial perturbations, we propose tackling the simpler problem of
developing adversarially robust features. Specifically, given a dataset and
metric of interest, the goal is to return a function (or multiple functions)
that 1) is robust to adversarial perturbations, and 2) has significant
variation across the datapoints. We establish strong connections between
adversarially robust features, and a natural spectral property of the geometry
of the dataset and metric of interest. This connection can be leveraged both
to provide robust features, and to provide a lower bound on the robustness of
any function that has significant variance across the dataset. Finally, we
provide empirical evidence that the adversarially robust features yielded via
this spectral approach can be be fruitfully leveraged to learn a robust (and
accurate) model.
Synaptic Strength For Convolutional Neural Network

Convolutional Neural Networks(CNNs) are both computation and memory intensive
which hindered their deployment in many resource efficient devices. Inspired
by neural science research, we propose the synaptic pruning: a data-driven
method to prune connections between convolution layers with a newly proposed
class of parameters called Synaptic Strength. Synaptic Strength is designed to
capture the importance of a synapse based on the amount of information it
transports. Experimental results show the effectiveness of our approach
empirically. On CIFAR-10, we can prune various CNN models with up to $96%$
connections removed, which results in significant size reduction and
computation saving. Further evaluation on ImageNet demonstrates that synaptic
pruning is able to discover efficient models which are competitive to state-
of-the-art compact CNNs such as MobileNet-V$2$ and NasNet-Mobile. Our
contribution is summarized as follows: (1) We introduce Synaptic Strength, a
new class of parameters for convolution layer to indicate the importance of
each connection. (2) Our approach can prune various CNN models with high
compression without compromising accuracy. (3) Further investigation shows,
the proposed Synaptic Strength is a better indicator for kernel pruning
compare with the previous approach both in empirical results and theoretical
analysis.
Human-in-the-Loop Interpretability Prior

We often desire our models to be interpretable as well as accurate. Prior work
on optimizing models for interpretability has relied on easy-to-quantify
proxies for interpretability, such as sparsity or the number of operations
required. In this work, we optimize for interpretability by directly including
humans in the optimization loop. We develop an algorithm that minimizes the
number of user studies to find models that are both predictive and
interpretable and demonstrate our approach on several data sets. Our human
subjects results show trends towards different proxy notions of
interpretability on different datasets, which suggests that different proxies
are preferred on different tasks.
Learning To Learn Around A Common Mean

The problem of learning-to-learn (LTL) or meta-learning is gaining increasing
attention due to recent empirical evidence of its effectiveness in
applications. Motivated by recent work on few-shot learning, in this paper we
tackle the LTL problem by a novel approach, in which the training datasets
received by the meta-algorithm are splitted into two subsets used to train and
test the underlying algorithm, respectively. As the underlying algorithm we
consider a form of Ridge Regression, in which the regularizer is the square
distance to an unknown mean vector. We observe that, in this setting, the LTL
problem can be reformulated as a Least Squares (LS) problem and we exploit a
stochastic procedure to efficiently solve it. Under specific assumptions, we
present a bound for the generalization error of out meta-algorithm. An
implication of this bound is that our approach provides a consistent estimate
of the transfer risk as the number of tasks grows, even if the sample size is
kept constant. Preliminary experiments highlight the advantage offered by our
approach.
Backpropagation with Callbacks: Towards Efficient and Expressive Differentiable Programming

Deep learning rests in crucial ways on gradient-descent optimization and end-
to-end differentiation. Under the slogan of differentiable programming, there
is an increasing demand for efficient automatic gradient computation for
emerging network architectures that incorporate dynamic control flow. In this
paper we take a fresh look at backpropagation, and propose an implementation
using functions with callbacks, where the forward pass is executed as a
sequence of function calls and the backward pass when the functions return. A
key realization is that this technique of chaining callbacks is well known in
the programming languages community under the name continuation-passing style
(CPS), and any program can be converted to this form using standard
techniques. Our approach achieves the same flexibility as other reverse-mode
automatic differentiation (AD) techniques, but it can be implemented without
any auxiliary data structures, and it can easily be combined with native code
generation techniques, leading to a highly efficient implementation that
combines the performance benefits of deep learning frameworks based on
explicit reified computation graphs (e.g., TensorFlow) with the expressiveness
of pure library approaches (e.g., PyTorch).
Learning with SGD and Random Features

Sketching and stochastic gradient methods are arguably the most common
techniques to derive efficient large scale learning algorithms. In this paper,
we investigate their application in the context of nonparametric statistical
learning. More precisely, we study the estimator defined by stochastic
gradient with mini batches and random features. The latter can be seen as form
of nonlinear sketching and used to define approximate kernel methods. The
considered estimator is not explicitly penalized/constrained and
regularization is implicit. Indeed, our study highlight how different
parameters, such as number of features, iterations, step-size and mini-batch
size control the learning properties of the solutions. We do this by deriving
optimal finite sample bounds, under standard assumptions. The obtained results
are corroborated and illustrated by numerical experiments.
Total stochastic gradient algorithms and applications in reinforcement learning

Back propagation and the chain rule of derivatives have been prominent,
however, the total derivative rule has not enjoyed the same amount of
attention. In this work we show how the total derivative rule leads to an
intuitive visual framework for creating gradient estimators on graphical
models. In particular, previous ”policy gradient theorems” are easily derived.
We derive new gradient estimators based on density estimation, as well as a
likelihood ratio gradient, which ”jumps” to an intermediate node, not directly
to the objective function. We evaluate our methods on model-based policy
gradient algorithms, achieve good performance, and present evidence towards
demystifying the success of the popular PILCO algorithm.
Glow: Generative Flow with Invertible 1x1 Convolutions

Flow-based generative models are conceptually attractive due to tractability
of the exact log-likelihood, tractability of exact latent-variable inference,
and parallelizability of both training and synthesis. In this paper we propose
Glow, a simple type of generative flow using invertible 1x1 convolution. Using
our method we demonstrate a significant improvement in log-likelihood and
qualitative sample quality. Perhaps most strikingly, we demonstrate that a
generative model optimized towards the plain log-likelihood objective is
capable of efficient synthesis of large and subjectively realistic-looking
images.
Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions

Embedding complex objects as vectors in low dimensional spaces is a
longstanding problem in machine learning. We propose in this work an extension
of that approach, which consists in embedding objects as elliptical
probability distributions, namely distributions whose densities have
elliptical level sets. We endow these measures with the 2-Wasserstein metric,
with two important benefits: \emph{(i)} For such measures, the squared
2-Wasserstein metric has a closed form, equal to the sum of the squared
Euclidean distance between means and the squared Bures metric between
covariance matrices. The latter is a Riemannian metric between positive semi-
definite matrices, which turns out to be Euclidean on a suitable factor
representation of such matrices, which is valid on the entire geodesic between
these matrices. \emph{(ii)} The 2-Wasserstein distance boils down to the usual
Euclidean metric when comparing Diracs, and therefore provides the natural
framework to extend point embeddings. We show that for these reasons
Wasserstein elliptical embeddings are more intuitive and yield tools that are
better behaved numerically than the alternative choice of Gaussian embeddings
with the Kullback-Leibler divergence. In particular, and unlike previous work
based on the KL geometry, we learn elliptical distributions that are not
necessarily diagonal. We demonstrate the interest of elliptical embeddings by
using them for visualization, to compute embeddings of words, and to reflect
entanglement or hypernymy.
Learning to Share and Hide Intentions using Information Regularization

Learning to cooperate with friends and compete with foes is a key component of
multi-agent reinforcement learning. Typically to do so, one requires access to
either a model of or interaction with the other agent(s). Here we show how to
learn effective strategies for cooperation and competition in an asymmetric
information game with no such model or interaction. Our approach is to
encourage an agent to reveal or hide their intentions using an information-
theoretic regularizer. We consider both the mutual information between goal
and action given state, as well as the mutual information between goal and
state. We show how to stochastically optimize these regularizers in a way that
is easy to integrate with policy gradient reinforcement learning. Finally, we
demonstrate that cooperative (competitive) policies learned with our approach
lead to more (less) reward for a second agent in two simple asymmetric
information games.
Predictive Approximate Bayesian Computation via Saddle Points

Approximate Bayesian Computation (ABC) has been an important methodology for
Bayesian inference when the likelihood function is intractable. Traditional
sampling-based ABC algorithms such as ABC rejection and K2-ABC are inefficient
performance-wise, while the regression-based algorithms such as K-ABC and DR-
ABC are hard to scale. In this paper, we introduce an optimization-based
framework for ABC that addresses these deficiencies. Leveraging a generative
model for posterior and joint distribution matching, we show that ABC can be
framed into saddle point problems, whose objectives can be accessed directly
with samples. We present \emph{the predictive ABC algorithm (P-ABC)}, and
provide a PAC bound guaranteeing its learning consistency. Numerical
experiment shows that, when compared to K2-ABC and DR-ABC, the proposed P-ABC
outperforms both with large margins.
Learning conditional GAN using noisy labels

We study the problem of learning conditional generators from noisy samples,
where the labels are corrupted by random noise. Naively training a standard
conditional GAN not only produces samples with wrong labels, but also
generates poor quality samples. We consider two scenarios, depending on
whether the noise model is known or not. When the distribution of the noise is
known, we provide a novel, theoretically sound, and practical Robust
Conditional GAN (RCGAN) architecture. We give theoretical justification of our
architectural choices. We provide a sharp characterization of how performance
depends on the noise statistics, and provide sample complexity of the loss in
neural network distances under standard assumptions on the discriminator
class. When the distribution of the noise is not known, we provide an
extension of our architecture, RCGAN-U. We show experimentally that there is
almost no loss in not knowing the noise statistics; RCGAN-U consistently
improves upon baseline approaches, while closely matching that of RCGAN.
Robust Learning of Fixed-Structure Bayesian Networks

We investigate the problem of learning Bayesian networks in a robust model
where an $\epsilon$-fraction of the samples are adversarially corrupted. In
this work, we study the fully observable discrete case where the structure of
the network is given. Even in this basic setting, previous learning algorithms
either run in exponential time or lose dimension-dependent factors in their
error guarantees. We provide the first computationally efficient robust
learning algorithm for this problem with dimension-independent error
guarantees. Our algorithm has near-optimal sample complexity, runs in
polynomial time, and achieves error that scales nearly-linearly with the
fraction of adversarially corrupted samples. Finally, we show on both
synthetic and semi-synthetic data that our algorithm performs well in
practice.
Improving Simple Models with Confidence Profiles

In this paper, we propose a new method called ProfWeight for transferring
information from a pre-trained deep neural network that has a high test
accuracy to a simpler interpretable model or a very shallow network of low
complexity and a priori low test accuracy. We are motivated by applications in
interpretability and model deployment in severely memory constrained
environments (like sensors). Our method uses linear probes to generate
confidence scores through flattened intermediate representations. Our transfer
method involves a theoretically justified weighting of samples during the
training of the simple model using confidence scores of these intermediate
layers. The value of our method is first demonstrated on CIFAR-10, where our
weighting method significantly improves (3-4%) networks with only a fraction
of the number of Resnet blocks of a complex Resnet model. We further
demonstrate operationally significant results on a real manufacturing problem,
where we dramatically increase the test accuracy of a CART model (the domain
standard) by roughly $13%$.
PCA of high dimensional stochastic processes

One technique to visualize the training of neural networks is to perform PCA
on the parameters over the course of training and plot the subspace spanned by
the first few PCA components. In this paper we compare this technique to the
PCA of a high dimensional random walk. We prove that in the limit of infinite
dimensions most of the variance is in the first few PCA components, and that
the projection of the trajectory onto any subspace spanned by PCA components
is a Lissajous curve. We generalize these results to Ornstein-Uhlenbeck
processes (i.e., a random walk in a quadratic potential) and show that in high
dimensions the walk is not mean reverting, but will instead be trapped at a
fixed distance from the minimum. We finally compare the distribution of PCA
variances and the PCA projected training trajectories of a linear model
trained on CIFAR-10 and ResNet-50-v2 trained on Imagenet. We show that early
in training the distribution of PCA variances is steeper than a random walk,
implying that the training is highly directed, whereas late in training the
distribution is flatter than a random walk and closer to a converged Ornstein-
Uhlenbeck process.
Learning to solve SMT formulas

We present a new approach for learning to solve SMT formulas. We phrase the
challenge of solving SMT formulas as a tree search problem where at each step
a transformation is applied to the input formula until the formula is solved.
Our approach works in two steps: first, we learn a model (e.g., based on
imitation learning) and then synthesize a strategy in the form of a loop-free
program with branches. This strategy is an interpretable representation of the
model's decisions and can be directly passed and used to guide the SMT solver,
without requiring any modification to the solver itself. We show that our
technique is practically effective: it solves 20.0% more formulas over a
number of benchmarks and achieves up to 1000x runtime improvement over a
state-of-the-art SMT solver.
Lifted Weighted Mini-Bucket

Many real-world problems, such as Markov Logic Networks (MLNs) with evidence,
possess highly symmetric sub-structures but no exact symmetries. Efficiently
exploiting the symmetric substructure of these problems to perform approximate
inference is a challenge for which few principled methods exist. In this
paper, we present a lifted variant of the Weighted Mini-Bucket elimination
algorithm which provides a principled way to 1) exploit the highly symmetric
substructure of MLN models, and 2) incorporate high-order inference terms,
which are necessary for high quality approximate inference. Our method
maintains a concrete connection to the ground problem and has significant
control over the accuracy-time trade-off of the approximation. Experimental
results demonstrate good anytime performance and the utility of this class of
approximations, especially in models with strong repulsive potentials.
Using Quantum Graphical Models to Perform Inference in Hilbert Space

Quantum Graphical Models (QGMs) generalize classical graphical models by
adopting the formalism for reasoning about uncertainty from quantum mechanics.
Unlike classical graphical models, QGMs represent uncertainty with density
matrices in complex Hilbert spaces. Hilbert space embeddings (HSEs) also
generalize Bayesian inference in Hilbert spaces. We investigate the link
between QGMs and HSEs and show that the sum rule and Bayes rule for QGMs are
equivalent to the kernel sum rule in HSEs and a special case of Nadaraya-
Watson kernel regression respectively. We show that these operations can be
kernelized, and use these insights to propose a Hilbert Space Embedding of
Hidden Quantum Markov Models (HSE-HQMM) to model dynamics. We present
experimental results showing that HSE-HQMMs can outperform state-of-the-art
models like LSTMs and PSRNNs on several datasets, while also providing a
nonparametric method for maintaining a probability distribution over
continuous-valued features.
Unsupervised Image-to-Image Translation Using Domain-Specific Variational Information Bound

Unsupervised image-to-image translation is a class of computer vision problems
which aims at modeling conditional distribution of images in the target
domain, given a set of unpaired images in the source and target domains. An
image in the source domain might have multiple representations in the target
domain. Therefore, ambiguity in modeling of the conditional distribution
arises, specially when the images in the source and target domains come from
different modalities. Current approaches mostly rely on simplifying
assumptions to map both domains into a shared-latent space. Consequently, they
are only able to model the domain-invariant information between the two
modalities. These approaches cannot model domain-specific information which
has no representation in the target domain. In this work, we propose an
unsupervised image-to-image translation framework which maximizes a domain-
specific variational information bound and learns the target domain-invariant
representation of the two domain. The proposed framework makes it possible to
map a single source image into multiple images in the target domain, utilizing
several target domain-specific codes sampled randomly from the prior
distribution, or extracted from reference images.
Adversarial Risk and Robustness for Discrete Distributions

We study adversarial perturbations when the instances are uniformly
distributed over {0,1}^n. We study both "inherent" bounds that apply to any
problems and any classifiers for them, as well as bounds that apply to
specific hypothesis classes. As the current literature contains multiple
definitions of adversarial risk and robustness, we start by giving a taxonomy
for these definitions based on their direct goals. Using two classical
algorithms for learning monotone conjunctions, the Find-S and the Swapping
Algorithm, we compare the implied bounds of the different definitions by
attacking the hypotheses using points that are drawn from the uniform
distribution over {0,1}^n. We observe that for the Find-S algorithm these
definitions lead to significantly different bounds. Thus, our results advocate
for the use of the definition that relies on the error region and guarantees
misclassification of adversarial examples, even though other definitions, in
other contexts with context-dependent assumptions, may coincide with the main
definition. Using the main definition of adversarial examples based on the
error region, we then study inherent bounds on risk and robustness of any
classifier for any classification problem whose instances are uniformly
distributed over {0,1}^n. Using the isoperimetric inequality for the Boolean
hypercube, we show that for initial error 0.01, there always exists an
adversarial perturbation that changes √n bits of the instances to increase the
risk to 0.5, making classifier's decisions meaningless. Furthermore, using the
central limit theorem we show that when n→∞, at most c√n bits of perturbations
for a universal constant c<1.17 suffice for increasing the risk to 0.5, and
the same c√n bits of perturbations on average suffice to increase the risk to
1, hence bounding the robustness by c√n.
Gaussian Process Prior Variational Autoencoders

Variational autoencoders (VAE) are a powerful and widely-used class of models
to learn complex data distributions in an unsupervised fashion. One
substantial limitation of VAEs is the prior assumption that latent sample
representations are independent. However, for many important applications,
such as time-series of images, this assumptions is too strong. Correlations,
such as those in time, need to be accounted for to achieve correct model
specification, and hence optimal results. Herein, we introduce a new model,
the Gaussian Process (GP) Prior Variational Autoencoder (GPVAE) to
specifically address this issue. The GPVAE aims to combine the power of VAEs
with the ability to model correlations afforded by GP priors. To achieve
efficient inference in this new class of models, we leverage structure in the
covariance matrix, but also introduce a new stochastic backpropagation
strategy that enables full batch gradient descent (GD) in a distributed
manner. In two image-based applications, we show that our method outperforms
conditional VAEs (CVAEs), and an adaptation of standard VAEs.
3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data

We present a convolutional network that is equivariant to rigid body motions.
The model uses scalar-, vector-, and tensor fields ove