tigerneil/rl_iclr2016.md

## rl_iclr2016.md

      
    Raw
  

              rl_iclr2016.md
            
          
    Prioritized Experience Replay

Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new state-of-the-art, outperforming DQN with uniform replay on 42 out of 57 games.
Authors: Tom Schaul schaul@gmail.com, John Quan johnquan@google.com, Ioannis Antonoglou ioannisa@google.com, David Silver davidsilver@google.com
Recurrent Reinforcement Learning: A Hybrid Approach

Successful applications of reinforcement learning in real-world problems often require dealing with partially observable states. It is in general very challenging to construct and infer hidden states as they often depend on the agent’s entire interaction history and may require substantial domain knowledge. In this work, we investigate a deep-learning approach to learning the representation of states in partially observable tasks, with minimal prior knowledge of the domain. In particular, we propose a new family of hybrid models that combines the strength of both supervised learning (SL) and reinforcement learning (RL), trained in a joint fashion: The SL component can be a recurrent neural networks (RNN) or its long short-term memory (LSTM) version, which is equipped with the desired property of being able to capture long-term dependency on history, thus providing an effective way of learning the representation of hidden states. The RL component is a deep Q-network (DQN) that learns to optimize the control for maximizing long-term rewards. Extensive experiments in a direct mailing campaign problem demonstrate the effectiveness and advantages of the proposed approach, which performs the best among a set of previous state-of-the-art methods.
Authors: Xiujun Li lixiujun@cs.wisc.edu, Lihong Li lihongli.cs@gmail.com, Jianfeng Gao jfgao@microsoft.com, Xiaodong He xiaohe@microsoft.com, Jianshu Chen jianshuc@microsoft.com, Li Deng deng@microsoft.com, Ji He jvking@uw.edu
DEEP REINFORCEMENT LEARNING WITH AN UNBOUNDED ACTION SPACE

In this paper, we propose the deep reinforcement relevance network (DRRN), a novel deep architecture, for handling an unbounded action space with applications to language understanding for text-based games. For a particular class of games, a user must choose among a variable number of actions described by text, with the goal of maximizing long-term reward. In these games, the best action is typically that which best fits to the current situation (modeled as a state in the DRRN), also described by text. Because of the exponential complexity of natural language with respect to sentence length, there is typically an unbounded set of unique actions. Therefore, it is difficult to pre-define the action set as in the deep Q-network (DQN). To address this challenge, the DRRN extracts high-level embedding vectors from the texts that describe states and actions, respectively, using the inner products between the state and action embedding vectors to approximate the Q-function. We evaluate the DRRN on two popular text games, showing superior performance over the DQN.
Authors: Ji He jvking@uw.edu, Jianshu Chen jianshuc@microsoft.com, Xiaodong He xiaohe@microsoft.com, Jianfeng Gao jfgao@microsoft.com, Li Deng deng@microsoft.com, Mari Ostendorf ostendor@uw.edu, Lihong Li lihongli.cs@gmail.com
Deep Reinforcement Learning in Parameterized Action Space

Recent work has shown that deep neural networks are capable of approximating both value functions and policies in reinforcement learning domains featuring continuous state and action spaces. However, to the best of our knowledge no previous work has succeeded at using deep neural networks in structured (parameterized) continuous action spaces. To fill this gap, this paper focuses on learning within the domain of simulated RoboCup soccer, which features a small set of discrete action types, each of which is parameterized with continuous variables. The best learned agent can score goals more reliably than the 2012 RoboCup champion agent. As such, this paper represents a successful extension of deep reinforcement learning to the class of parameterized action space MDPs.
Authors: Matthew Hausknecht mhauskn@cs.utexas.edu, Peter Stone pstone@cs.utexas.edu
Conditional Computation in Neural Networks for faster models

Deep learning has become the state-of-art tool in many applications, but the evaluation and training of deep models can be time-consuming and computationally expensive. Dropout has been shown to be an effective strategy to sparsify computations (by not involving all units), as well as to regularize models. In typical dropout, nodes are dropped uniformly at random. Our goal is to use reinforcement learning in order to design better, more informed dropout policies, which are data-dependent.
We cast the problem of learning activation-dependent dropout policies for blocks of units as a reinforcement learning problem. We propose a learning scheme motivated by computation speed, capturing the idea of wanting to have parsimonious activations while maintaining prediction accuracy. We apply a policy gradient algorithm for learning policies that optimize this loss function and propose a regularization mechanism that encourages diversification of the dropout policy. We present encouraging empirical results showing that this approach improves the speed of computation without impacting the quality of the approximation.
Authors: Emmanuel Bengio bengioe@gmail.com, Pierre-Luc Bacon pbacon@cs.mcgill.ca, Doina Precup dprecup@cs.mcgill.ca, Joelle Pineau jpineau@cs.mcgill.ca
Continuous control with deep reinforcement learning

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies "end-to-end": directly from raw pixel inputs.
Authors: Timothy Lillicrap countzero@google.com, Jonathan Hunt jjhunt@google.com, Alexander Pritzel apritzel@google.com, Nicolas Heess heess@google.com, Tom Erez etom@google.com, Yuval Tassa tassa@google.com, David Silver davidsilver@google.com, Daan Wierstra wierstra@google.com
POLICY DISTILLATION

Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. Furthermore, the same method can be used to consolidate multiple task-specific policies into a single policy. We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent.
Authors: Andrei Rusu andreirusu@google.com, Sergio Gomez sergomez@google.com, Caglar Gulcehre ca9lar@gmail.com, Guillaume Desjardins gdesjardins@google.com, James Kirkpatrick kirkpatrick@google.com, Razvan Pascanu razp@google.com, Volodymyr Mnih vmnih@google.com, Koray Kavukcuoglu korayk@google.com, Raia Hadsell raia@google.com
High-Dimensional Continuous Control Using Generalized Advantage Estimation

Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(Lambda).
We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks.
Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, learning a policy for getting the biped off the ground. In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.
Authors: John Schulman john.d.schulman@gmail.com, Philipp Moritz pcmoritz@eecs.berkeley.edu, Sergey Levine svlevine@eecs.berkeley.edu, Michael Jordan jordan@cs.berkeley.edu, Pieter Abbeel pabbeel@cs.berkeley.edu
Dueling Network Architectures for Deep Reinforcement Learning

In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning inspired by advantage learning. Our dueling architecture represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art Double DQN method of van Hasselt et al. (2015) in 46 out of 57 Atari games.
Authors: Ziyu Wang ziyu@google.com, Nando de Freitas nandodefreitas@google.com, Marc Lanctot lanctot@google.com
Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

The ability to act in multiple environments and transfer previous knowledge to new situations can be considered a critical aspect of any intelligent agent. Towards this goal, we define a novel method of multitask and transfer learning that enables an autonomous agent to learn how to behave in multiple tasks simultaneously, and then generalize its knowledge to new domains. This method, termed ``Actor-Mimic'', exploits the use of deep reinforcement learning and model compression techniques to train a single policy network that learns how to act in a set of distinct tasks by using the guidance of several expert teachers. We then show that the representations learnt by the deep policy network are capable of generalizing to new tasks, speeding up learning in novel environments. Although our method can in general be applied to a wide range of problems, we use Atari games as a testing environment to demonstrate these methods.
Authors: Emilio Parisotto eparisotto@cs.toronto.edu, Jimmy Ba jimmy@psi.utoronto.ca, Ruslan Salakhutdinov rsalakhu@cs.toronto.edu
Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models

Achieving efficient and scalable exploration in complex domains poses a major challenge in reinforcement learning. While Bayesian and PAC-MDP approaches to the exploration problem offer strong formal guarantees, they are often impractical in higher dimensions due to their reliance on enumerating the state-action space. Hence, exploration in complex domains is often performed with simple epsilon-greedy methods. In this paper, we consider the challenging Atari games domain, which requires processing raw pixel inputs and delayed rewards. We evaluate several more sophisticated exploration strategies, including Thompson sampling and Boltzman exploration, and propose a new exploration method based on assigning exploration bonuses from a concurrently learned model of the system dynamics. By parameterizing our learned model with a neural network, we are able to develop a scalable and efficient approach to exploration bonuses that can be applied to tasks with complex, high-dimensional state spaces. In the Atari domain, our method provides the most consistent improvement across a range of games that pose a major challenge for prior methods. In addition to raw game-scores, we also develop an AUC-100 metric for the Atari Learning domain to evaluate the impact of exploration on this benchmark.
Authors: Bradly Stadie bstadie@berkeley.edu, Sergey Levine svlevine@cs.berkeley.edu, Pieter Abbeel pabbeel@cs.berkeley.edu
Reinforcement Learning Neural Turing Machines

The Neural Turing Machine (NTM) is more expressive than all previously considered models because of its external memory. It can be viewed as a broader effort to use abstract external Interfaces and to learn a parametric model that interacts with them.
The capabilities of a model can be extended by providing it with proper Interfaces that interact with the world. These external Interfaces include memory, a database, a search engine, or a piece of software such as a theorem verifier. Some of these Interfaces are provided by the developers of the model. However, many important existing Interfaces, such as databases and search engines, are discrete.
We examine feasibility of learning models to interact with discrete Interfaces. We investigate the following discrete Interfaces: a memory Tape, an input Tape, and an output Tape. We use a Reinforcement Learning algorithm to train a neural network that interacts with such Interfaces to solve simple algorithmic tasks. Our Interfaces are expressive enough to make our model Turing complete.
Authors: Wojciech Zaremba woj.zaremba@gmail.com, Ilya Sutskever ilyasu@google.com