A minimalistic implementation of OpenAI's proximal policy optimization algorithm. It learns to swing up a pendulum (from OpenAI Gym). There is much room for performance improvement, so far training computation happens on only 1 GPU even if more ressources are available.
Usage:
- To train the agent run
python main.py train
- To run an episode (after training) run
python main.py enjoy
The advantage value is calculated using the Generalized Advantage Estimation (GAE) Method. When reading through the source code one might wonder about the usge of LinearOperatorToeplitz
. This operator lets us calculate the GAE values with a simple Matrix-Vector multiplication: