- basic policy gradient algorithm, good summary see [1]
- a2c architecture (adding the critic + return/simple advantage estimation), see [2] for details
- minibatch gradient descent with automatic differentiation (i.e
tf.GradientTape
) - shuffle and permutate-only for MGD
- ppo policy (apply ppo clip loss), see Schulmans paper [3]
- generalized advantage estimation, see GAE paper [4]
- general improvements in most implementations, see "Implementation Matters" [5]
- #1, #9: value function clipping and global (policy) gradient clipping
- #2, #5, #6, #7: reward/observation scaling and clipping, according to
stablebaselines3
[6] VecNormalize environment - #3, #4, #8: orth. layer initialization, Adam annealing, tanh activations
- further improvements
- minibatch-wise advantage normalization
- regularization and entropy loss for regularization/exploration
- stateless (learnable) log std as variance
- parallelized environments
- scaling actions to proper range for environment
- discrete action space agent
-
saving/loading
tf.keras
models -
tensorboard integration, logging of
- hyperparameters
- graph + image of model
- losses, optimizer lrs
- environment (rewards, actions, observations histograms)
- stateless logstd and clip ratio
-
remove prints in terminals, only use a progressbar and tensorboard for the rest
-
provide configs / gifs for some environments
-
compile seeds together for replicability
-
run_env file that loads model, runs env and prints reward + video if possible
-
force types in parameters
-
code point references to the optimizations made
ContCartpoalEnv | Episode Scores / Steps | ReachingDotEnv | Episode Scores / Steps |
---|---|---|---|
CartPole-v1 | Episode Scores / Steps | Pendulum-v0 | Episode Scores / Steps |
---|---|---|---|
ReachEnv-v0 | Episode Scores / Steps | ReachEnvRandom-v0 | Episode Scores / Steps |
---|---|---|---|
- pip requirements
- imagemagick for creating gifs of env runs
- graphviz for tf keras model to graph in tensorboard
- mujoco_py's offscreen rendering is buggy in gym, for using run_model (GIF generation)
- adjust mujoco_py.MjRenderContextOffscreen(sim, None, device_id=0) in gym/envs/mujoco/mujoco_env.MujocoEnv._get_viewer(...)
- [1] Basic Policy Gradient Algorithm -> https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html
- [2] A2C architecture -> Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. 2016.
- [3] Basic PPO Agent -> Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
- [4] GAE -> Schulman, John, et al. "High-dimensional continuous control using generalized advantage estimation." arXiv preprint arXiv:1506.02438 (2015).
- [5] Common Improvements -> Engstrom, Logan, et al. "Implementation Matters in Deep RL: A Case Study on PPO and TRPO." International Conference on Learning Representations. 2019.
- [6] StableBaselines3 -> Raffin et al, "StableBaselines3", GitHub, https://github.com/DLR-RM/stable-baselines3
- [7] ContCartpoalEnv -> this environment is from Ian Danforth https://gist.github.com/iandanforth/e3ffb67cf3623153e968f2afdfb01dc8