Skip to content

Tensorflow implementation of Proximal Policy Optimization (Reinforcement Learning) and its common optimizations. Features Tensorboard integration and lots of sample runs on custom, classical and robotics oriented environments.

License

Notifications You must be signed in to change notification settings

denismegerle/rl-ppo-agent

Repository files navigation

Proximal Policy Optimization with state-of-the-art code level optimizations

Implementation

basic policy gradient algorithm

  • basic policy gradient algorithm, good summary see [1]
  • a2c architecture (adding the critic + return/simple advantage estimation), see [2] for details
  • minibatch gradient descent with automatic differentiation (i.e tf.GradientTape)
  • shuffle and permutate-only for MGD

proper PPO agent and most common improvements

  • ppo policy (apply ppo clip loss), see Schulmans paper [3]
  • generalized advantage estimation, see GAE paper [4]
  • general improvements in most implementations, see "Implementation Matters" [5]
    • #1, #9: value function clipping and global (policy) gradient clipping
    • #2, #5, #6, #7: reward/observation scaling and clipping, according to stablebaselines3 [6] VecNormalize environment
    • #3, #4, #8: orth. layer initialization, Adam annealing, tanh activations
  • further improvements
    • minibatch-wise advantage normalization
    • regularization and entropy loss for regularization/exploration
    • stateless (learnable) log std as variance
  • parallelized environments
  • scaling actions to proper range for environment
  • discrete action space agent

tf features

  • saving/loading tf.keras models

  • tensorboard integration, logging of

    • hyperparameters
    • graph + image of model
    • losses, optimizer lrs
    • environment (rewards, actions, observations histograms)
    • stateless logstd and clip ratio
  • remove prints in terminals, only use a progressbar and tensorboard for the rest

  • provide configs / gifs for some environments

  • compile seeds together for replicability

  • run_env file that loads model, runs env and prints reward + video if possible

  • force types in parameters

  • code point references to the optimizations made

Sample Runs

Custom Environments

ContCartpoalEnv Episode Scores / Steps ReachingDotEnv Episode Scores / Steps

Classic Control Environments

CartPole-v1 Episode Scores / Steps Pendulum-v0 Episode Scores / Steps

SimFramework Environments

ReachEnv-v0 Episode Scores / Steps ReachEnvRandom-v0 Episode Scores / Steps

Dependencies

  • pip requirements
  • imagemagick for creating gifs of env runs
  • graphviz for tf keras model to graph in tensorboard
  • mujoco_py's offscreen rendering is buggy in gym, for using run_model (GIF generation)
    • adjust mujoco_py.MjRenderContextOffscreen(sim, None, device_id=0) in gym/envs/mujoco/mujoco_env.MujocoEnv._get_viewer(...)

References

  • [1] Basic Policy Gradient Algorithm -> https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html
  • [2] A2C architecture -> Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. 2016.
  • [3] Basic PPO Agent -> Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  • [4] GAE -> Schulman, John, et al. "High-dimensional continuous control using generalized advantage estimation." arXiv preprint arXiv:1506.02438 (2015).
  • [5] Common Improvements -> Engstrom, Logan, et al. "Implementation Matters in Deep RL: A Case Study on PPO and TRPO." International Conference on Learning Representations. 2019.
  • [6] StableBaselines3 -> Raffin et al, "StableBaselines3", GitHub, https://github.com/DLR-RM/stable-baselines3
  • [7] ContCartpoalEnv -> this environment is from Ian Danforth https://gist.github.com/iandanforth/e3ffb67cf3623153e968f2afdfb01dc8

About

Tensorflow implementation of Proximal Policy Optimization (Reinforcement Learning) and its common optimizations. Features Tensorboard integration and lots of sample runs on custom, classical and robotics oriented environments.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published