Skip to content

Latest commit

 

History

History
111 lines (81 loc) · 5.65 KB

README.md

File metadata and controls

111 lines (81 loc) · 5.65 KB

Proximal Policy Optimization with state-of-the-art code level optimizations

Implementation

basic policy gradient algorithm

  • basic policy gradient algorithm, good summary see [1]
  • a2c architecture (adding the critic + return/simple advantage estimation), see [2] for details
  • minibatch gradient descent with automatic differentiation (i.e tf.GradientTape)
  • shuffle and permutate-only for MGD

proper PPO agent and most common improvements

  • ppo policy (apply ppo clip loss), see Schulmans paper [3]
  • generalized advantage estimation, see GAE paper [4]
  • general improvements in most implementations, see "Implementation Matters" [5]
    • #1, #9: value function clipping and global (policy) gradient clipping
    • #2, #5, #6, #7: reward/observation scaling and clipping, according to stablebaselines3 [6] VecNormalize environment
    • #3, #4, #8: orth. layer initialization, Adam annealing, tanh activations
  • further improvements
    • minibatch-wise advantage normalization
    • regularization and entropy loss for regularization/exploration
    • stateless (learnable) log std as variance
  • parallelized environments
  • scaling actions to proper range for environment
  • discrete action space agent

tf features

  • saving/loading tf.keras models

  • tensorboard integration, logging of

    • hyperparameters
    • graph + image of model
    • losses, optimizer lrs
    • environment (rewards, actions, observations histograms)
    • stateless logstd and clip ratio
  • remove prints in terminals, only use a progressbar and tensorboard for the rest

  • provide configs / gifs for some environments

  • compile seeds together for replicability

  • run_env file that loads model, runs env and prints reward + video if possible

  • force types in parameters

  • code point references to the optimizations made

Sample Runs

Custom Environments

ContCartpoalEnv Episode Scores / Steps ReachingDotEnv Episode Scores / Steps

Classic Control Environments

CartPole-v1 Episode Scores / Steps Pendulum-v0 Episode Scores / Steps

SimFramework Environments

ReachEnv-v0 Episode Scores / Steps ReachEnvRandom-v0 Episode Scores / Steps

Dependencies

  • pip requirements
  • imagemagick for creating gifs of env runs
  • graphviz for tf keras model to graph in tensorboard
  • mujoco_py's offscreen rendering is buggy in gym, for using run_model (GIF generation)
    • adjust mujoco_py.MjRenderContextOffscreen(sim, None, device_id=0) in gym/envs/mujoco/mujoco_env.MujocoEnv._get_viewer(...)

References

  • [1] Basic Policy Gradient Algorithm -> https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html
  • [2] A2C architecture -> Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. 2016.
  • [3] Basic PPO Agent -> Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  • [4] GAE -> Schulman, John, et al. "High-dimensional continuous control using generalized advantage estimation." arXiv preprint arXiv:1506.02438 (2015).
  • [5] Common Improvements -> Engstrom, Logan, et al. "Implementation Matters in Deep RL: A Case Study on PPO and TRPO." International Conference on Learning Representations. 2019.
  • [6] StableBaselines3 -> Raffin et al, "StableBaselines3", GitHub, https://github.com/DLR-RM/stable-baselines3
  • [7] ContCartpoalEnv -> this environment is from Ian Danforth https://gist.github.com/iandanforth/e3ffb67cf3623153e968f2afdfb01dc8