Zephyrnet Logo

NeurIPS 2020: Key Research Papers in Reinforcement Learning and More

Date:

reinforcement learning at NeurIPS 2020

Our team reviewed the papers accepted to NeurIPS 2020 and shortlisted the most interesting ones across different research areas. Here are the topics we cover:

If you’re interested in the remarkable keynote presentations, interesting workshops, and exciting tutorials presented at the conference, check our guide to NeurIPS 2020.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

Top Reinforcement learning Research Papers at NeurIPS 2020

Berkeley Artificial Intelligence Research lab (BAIR) remains one of the most productive research teams when it comes to cutting-edge research ideas in reinforcement learning. At this year’s NeurIPS conference, researchers from the BAIR team, Microsoft Research, Carnegie Mellon University, McGill University, and other research labs suggest:

  • new approaches to sample efficiency and data augmentation;
  • unsupervised environment design;
  • relabeling methods for multi-task reinforcement learning;
  • latent variable model for deep reinforcement learning;
  • conservative Q-learning for offline RL, and more.

Here are the papers we feature.

Novelty Search in Representational Space for Sample Efficient Exploration

David Tao (McGill University), Vincent Francois-Lavet (McGill University), Joelle Pineau (McGill University)

We present a new approach for efficient exploration which leverages a low-dimensional encoding of the environment learned with a combination of model-based and model-free objectives. Our approach uses intrinsic rewards that are based on the distance of nearest neighbors in the low dimensional representational space to gauge novelty. We then leverage these intrinsic rewards for sample-efficient exploration with planning routines in representational space for hard exploration tasks with sparse rewards. One key element of our approach is the use of information theoretic principles to shape our representations in a way so that our novelty reward goes beyond pixel similarity. We test our approach on a number of maze tasks, as well as a control problem and show that our exploration approach is more sample-efficient compared to strong baselines.

Code: official code implementation is available here.

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

Michael Dennis (BAIR), Natasha Jaques (Google Brain), Eugene Vinitsky (BAIR), Alexandre Bayen (BAIR), Stuart Russell (BAIR), Andrew Critch (BAIR), Sergey Levine (BAIR)

A wide range of reinforcement learning (RL) problems – including robustness, transfer learning, unsupervised RL, and emergent complexity – require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environment Design (UED) as an alternative paradigm, where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments. Existing approaches to automatically generating environments suffer from common failure modes: domain randomization cannot generate structure or adapt the difficulty of the environment to the agent’s learning progress, and minimax adversarial training leads to worst-case environments that are often unsolvable. To generate structured, solvable environments for our protagonist agent, we introduce a second, antagonist agent that is allied with the environment-generating adversary. The adversary is motivated to generate environments which maximize regret, defined as the difference between the protagonist and antagonist agent’s return. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED). Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments.

Code: official TensorFlow implementation is available here.

PAIRED at NeurIPS 2020

FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs

Alekh Agarwal (Microsoft Research), Sham Kakade (Microsoft Research), Akshay Krishnamurthy (Microsoft Research), Wen Sun (Microsoft Research)

In order to deal with the curse of dimensionality in reinforcement learning (RL), it is common practice to make parametric assumptions where values or policies are functions of some low dimensional feature space. This work focuses on the representation learning question: how can we learn such features? Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem. Structurally, we make precise connections between these low rank MDPs and latent variable models, showing how they significantly generalize prior formulations, such as block MDPs, for representation learning in RL. Algorithmically, we develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models. On a technical level, our analysis eliminates reachability assumptions that appear in prior results on the simpler block MDP model and may be of independent interest.

Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement

Benjamin Eysenbach (Carnegie Mellon University, Google Brain), Xinyang Geng (UC Berkeley), Sergey Levine (UC Berkeley, Google Brain), Russ Salakhutdinov (Carnegie Mellon University)

Multi-task reinforcement learning (RL) aims to simultaneously learn policies for solving many tasks. Several prior works have found that relabeling past experience with different reward functions can improve sample efficiency. Relabeling methods typically pose the question: if, in hindsight, we assume that our experience was optimal for some task, for what task was it optimal? Inverse RL answers this question. In this paper we show that inverse RL is a principled mechanism for reusing experience across tasks. We use this idea to generalize goal-relabeling techniques from prior work to arbitrary types of reward functions. Our experiments confirm that relabeling data using inverse RL outperforms prior relabeling methods on goal-reaching tasks, and accelerates learning on more general multi-task settings where prior methods are not applicable, such as domains with discrete sets of rewards and those with linear reward functions.

Hindsight Inference for Policy Improvement

Self-Paced Deep Reinforcement Learning

Pascal Klink (Technische Universität Darmstadt), Carlo D’Eramo (Technische Universität Darmstadt), Jan Peters (Technische Universität Darmstadt), Joni Pajarinen (Technische Universität Darmstadt, Aalto University)

Curriculum reinforcement learning (CRL) improves the learning speed and stability of an agent by exposing it to a tailored series of tasks throughout learning. Despite empirical successes, an open question in CRL is how to automatically generate a curriculum for a given reinforcement learning (RL) agent, avoiding manual design. In this paper, we propose an answer by interpreting the curriculum generation as an inference problem, where distributions over tasks are progressively learned to approach the target task. This approach leads to an automatic curriculum generation, whose pace is controlled by the agent, with solid theoretical motivation and easily integrated with deep RL algorithms. In the conducted experiments, the curricula generated with the proposed algorithm significantly improve learning performance across several environments and deep RL algorithms, matching or outperforming state-of-the-art existing CRL algorithms.

Code: official code implementation is available here.

Reinforcement Learning with Augmented Data

Misha Laskin (UC Berkeley), Kimin Lee (UC Berkeley), Adam Stooke (UC Berkeley), Lerrel Pinto (New York University), Pieter Abbeel (UC Berkeley & covariant.ai), Aravind Srinivas (UC Berkeley)

Learning from visual observations is a fundamental yet challenging problem in Reinforcement Learning (RL). Although algorithmic advances combined with convolutional neural networks have proved to be a recipe for success, current methods are still lacking on two fronts: (a) data-efficiency of learning and (b) generalization to new environments. To this end, we present Reinforcement Learning with Augmented Data (RAD), a simple plug-and-play module that can enhance most RL algorithms. We perform the first extensive study of general data augmentations for RL on both pixel-based and state-based inputs, and introduce two new data augmentations – random translate and random amplitude scale. We show that augmentations such as random translate, crop, color jitter, patch cutout, random convolutions, and amplitude scale can enable simple RL algorithms to outperform complex state-of-the-art methods across common benchmarks. RAD sets a new state-of-the-art in terms of data-efficiency and final performance on the DeepMind Control Suite benchmark for pixel-based control as well as OpenAI Gym benchmark for state-based control. We further demonstrate that RAD significantly improves test-time generalization over existing methods on several OpenAI ProcGen benchmarks.

Code: official codebase is available here.

Reinforcement Learning with Augmented Data

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model

Alex Lee (UC Berkeley), Anusha Nagabandi (UC Berkeley), Pieter Abbeel (UC Berkeley & covariant.ai), Sergey Levine (UC Berkeley)

Deep reinforcement learning (RL) algorithms can use high-capacity deep networks to learn directly from image observations. However, these high-dimensional observation spaces present a number of challenges in practice, since the policy must now solve two problems: representation learning and task learning. In this work, we tackle these two problems separately, by explicitly learning latent representations that can accelerate reinforcement learning from images. We propose the stochastic latent actor-critic (SLAC) algorithm: a sample-efficient and high-performing RL algorithm for learning policies for complex continuous control tasks directly from high-dimensional image inputs. SLAC provides a novel and principled approach for unifying stochastic sequential models and RL into a single method, by learning a compact latent representation and then performing RL in the model’s learned latent space. Our experimental evaluation demonstrates that our method outperforms both model-free and model-based alternatives in terms of final performance and sample efficiency, on a range of difficult image-based control tasks. Our code and videos of our results are available at our website.

Code: official implementation is available here.

Conservative Q-Learning for Offline Reinforcement Learning

Aviral Kumar (UC Berkeley), Aurick Zhou (UC Berkeley), George Tucker (Google Brain), Sergey Levine (UC Berkeley, Google Brain)

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions. 

Code: official implementation is available here.

Other Important Research Papers at NeurIPS 2020

We also want to highlight two remarkable research papers that do not fall into any specific application category:

  • AdaBelief Optimizer for deep learning by the research team from Yale University;
  • AI Feynman 2.0 for symbolic regression by researchers from MIT.

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Juntang Zhuang (Yale University), Tommy Tang (University of Illinois Urbana-Champaign), Yifan Ding (University of Central Florida), Sekhar C Tatikonda (Yale University), Nicha Dvornek (Yale University), Xenophon Papademetris (Yale University), James Duncan (Yale University)

Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability. We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the “belief” in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. 

Code: official implementation is available here

AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity

Silviu-Marian Udrescu (MIT), Andrew Tan (Massachusetts Institute of Technology), Jiahai Feng (MIT), Orisvaldo Neto (MIT), Tailin Wu (Stanford), Max Tegmark (MIT)

We present an improved method for symbolic regression that seeks to fit data to formulas that are Pareto-optimal, in the sense of having the best accuracy for a given complexity. It improves on the previous state-of-the-art by typically being orders of magnitude more robust toward noise and bad data, and also by discovering many formulas that stumped previous methods. We develop a method for discovering generalized symmetries (arbitrary modularity in the computational graph of a formula) from gradient properties of a neural network fit. We use normalizing flows to generalize our symbolic regression method to probability distributions from which we only have samples, and employ statistical hypothesis testing to accelerate robust brute-force search. 

Code: official implementation is available here.

Top Research Papers From 2020

To be prepared for NeurIPS, you should be aware of the major research papers published in the last year in popular topics such as computer vision, NLP, and general machine learning approaches, even if they are not being presented at this specific event. 

We’ve shortlisted top research papers in these areas so you can review them quickly: 

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

Source: https://www.topbots.com/neurips-2020-rl-research-papers/

spot_img

Latest Intelligence

spot_img