Paper Collection of Reinforcement Learning Exploration covers Exploration of Muti-Arm-Bandit, Reinforcement Learning and Multi-agent Reinforcement Learning.
Exploration Reinforcement Learning is an important topic in Reinforcement Learning research area, which is to essentially improve the sample efficiency in a MDP setting. Naive survey Slides and Document(Chinese) of Exploration problem are available.
A simple form of Exploration-exploitation dilemma can be seen from the Multi-Arm Bandit problems, and we include MAB papers because many theoretical idea can be drived from MAB studies.
In the early stage, most Exploration study focus on the sample efficiency of specific algorithm, many of them design different Exploration bonus to lead agents explore sufficient trajectories.
Recently, most DRL Exploration researches are focus on sparse reward settings and the target is a little different with the former studies, however we still classify those methods based on their methodology.
Many learning algorithm also consider the problem of efficient Exploration so that we also contain such work.
Exploration is a big topic so this paper list is just a scratch. Collected papers are sorted by time and classification. Any suggestions and pull requests are welcome.
The sharing principle of these references here is for research. If any authors do not want their paper to be listed here, please feel free to contact me (Email: ericliuof97 [AT] gmail.com).
- <Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability)> by Donald A Berry and Bert Fristedt, 1985.
-
<A contextual-bandit algorithm for mobile context-aware recommender system> by Djallel Bouneffouf, Amel Bouzeghoub, and Alda Lopes Gançarski, 2012.
-
<Value-difference based Exploration: adaptive control between epsilon-greedy and softmax> by Michel Tokic and Günther Palm, 2011.
-
<Adaptive ε-greedy Exploration in reinforcement learning based on value differences> by Michel Tokic, 2010.
-
<Finite-time regret bounds for the multiarmed bandit problem> by Nicolo Cesa-Bianchi and Paul Fischer, 1998.
-
[POKER] <Multi-armed bandit algorithms and empirical evaluation> by Joannes Vermorel and Mehryar Mohri, 2005
-
[UCB] <Finite-time analysis of the multiarmed bandit problem> by Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer, 2002.
-
[Pursuit] <A class of rapidly converging algorithms for learning automata> by MAL Thathachar, 1984.
-
[Beyesian UCB] <On bayesian up- per confidence bounds for bandit problems> by Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier, 2012.
-
<A modern bayesian look at the multi-armed bandit> by Steven L Scott, 2010.
-
<Optimal learning and experimentation in bandit problems> by Monica Brezzi and Tze Leung Lai, 2002.
-
<Bandit processes and dynamic allocation indices> by John C Gittins, 1979.
- [Interesting] <Algorithms for multi-armed bandit problems> by Kuleshov, Volodymyr and Precup, Doina, 2014.
Reward based Exploration (Intrinsic Reward \ Exploration Bonus \ Surprise \ Curiosity \ Uncertainty)
-
<Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010)> by Jürgen Schmidhuber, 2010.
-
<What is intrinsic motivation? A typology of computational approaches> by Pierre-Yves Oudeyer and Frederic Kaplan, 2007.
-
[MBIE, MBIE-EB] <A theoretical analysis of model-based interval estimation> by Alexander L Strehl and Michael L Littman, 2004.
-
[MBIE] <An empirical evaluation of interval estimation for markov decision processes> by Alexander L Strehl and Michael L Littman, 2004.
-
[R-max] <R-max-a general polynomial time algorithm for near-optimal reinforcement learning> by Ronen I Brafman and Moshe Tennenholtz, 2002.
-
[E3] <Near-optimal reinforcement learning in polynomial time> by Michael Kearns and Satinder Singh, 2002.
-
[MBIE] <Efficient model-based Exploration> by Marco Wiering and Jürgen Schmidhuber, 1998.
- <Near-bayesian Exploration in polynomial time> by J Zico Kolter and Andrew Y Ng, 2009.
-
<Unifying count-based Exploration and intrinsic motivation> by Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos, 2016.
-
<Count-based Exploration with neural density models>by Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and Rémi Munos, 2018
-
[CoEX] <Contingency-aware Exploration in reinforcement learning> by Choi, Jongwook and Guo, Yijie and Moczulski, Marcin and Oh, Junhyuk and Wu, Neal and Norouzi, Mohammad and Lee, Honglak, 2018.
-
<# Exploration: A study of count-based Exploration for deep reinforcement learning> by Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel, 2017.
-
<Variational information maximisation for intrinsically motivated reinforcement learning> by Shakir Mohamed and Danilo Jimenez Rezende, 2015.
-
<An information-theoretic approach to curiosity-driven reinforcement learning> by Susanne Still and Doina Precup, 2012.
- <Vime: Variational information maximizing Exploration> by Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel, 2016.
-
<Self-Supervised Exploration via Disagreement> by Deepak Pathak, Dhiraj Gandhi and Abhinav Gupta, 2019.
-
<Episodic curiosity through reachability> by Savinov, Nikolay and Raichuk, Anton and Marinier, Raphael and Vincent, Damien and Pollefeys, Marc and Lillicrap, Timothy and Gelly, Sylvain, 2018.
-
[RND] <Exploration by random network distillation> by Burda, Yuri and Edwards, Harrison and Storkey, Amos and Klimov, Oleg, 2018.
-
<Large-Scale Study of Curiosity-Driven Learning> by Burda, Yuri and Edwards, Harri and Pathak, Deepak and Storkey, Amos and Darrell, Trevor and Efros, Alexei A, 2018.
-
[ICM] <Curiosity-driven Exploration by self-supervised prediction> by Pathak, Deepak and Agrawal, Pulkit and Efros, Alexei A and Darrell, Trevor, 2017.
-
<Provably efficient RL with Rich Observations via Latent State Decoding> by Du, Simon S and Krishnamurthy, Akshay and Jiang, Nan and Agarwal, Alekh and Dudik, Miroslav and Langford, John, 2019.
-
<Parameter Space Noise for Exploration> by Plappert, Matthias and Houthooft, Rein and Dhariwal, Prafulla and Sidor, Szymon and Chen, Richard Y and Chen, Xi and Asfour, Tamim and Abbeel, Pieter and Andrychowicz, Marcin, 2018.
-
[SoftAC] <Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor> by Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey, 2018.
-
[SoftQ] <Reinforcement learning with deep energy-based policies> by Haarnoja, Tuomas and Tang, Haoran and Abbeel, Pieter and Levine, Sergey, 2017.
-
<Deep Exploration via bootstrapped DQN> by Osband, Ian and Blundell, Charles and Pritzel, Alexander and Van Roy, Benjamin, 2016.
-
<Taming the noise in reinforcement learning via soft updates> by Fox, Roy and Pakman, Ari and Tishby, Naftali, 2015.
-
[UCT] <Monte-Carlo Exploration for deterministic planning> by Hootan Nakhost and Martin Müller, 2009
-
<Bandit based monte-carlo planning> by Levente Kocsis and Csaba Szepesvári, 2006
- [Go-Explore] <Go-Explore: a New Approach for Hard-Exploration Problems> by Ecoffet, Adrien and Huizinga, Joost and Lehman, Joel and Stanley, Kenneth O and Clune, Jeff, 2019.
-
<CLEANing the reward: counterfactual actions to remove exploratory action noise in multiagent learning> by HolmesParker C, Taylor M E, Agogino A, et al., 2014.
-
<Classes of multiagent q-learning dynamics with epsilon-greedy Exploration> by Wunder M, Littman M L, Babes M., 2010.
-
<Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning> by Shariq Iqbal, Fei Sha, 2019.
-
<Coordinated Versus Decentralized Exploration In Multi-Agent Multi-Armed Bandits> by Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, Brendan Juba, 2017.
-
<Multi-Robot Coordination for Space Exploration> by Logan Yliniemi and Adrian Agogino and Kagan Tumer, 2013.
-
<Coordinated Multi-Agent Exploration> by Abraham Sanchez L. and Alfredo Toriz P., 2008.
-
<Coordinated exploration in multi-agent reinforcement learning: an application to load-balancing> by Katja Verbeeck, Ann Nowe and Karl Tuyls, 2005.