We implement and train a single-agent actor-critic agent based on the CNN+LSTM+Actor/Critic architecture.
We perform hyperparameter optimization on 4 key parameters:
- Temperature
- Learning rate
- Gradient Clipping
- Backprop methodologies (TBPTT or BPTE)
In addition we benchmark our agent against A3C 1,4 and 16 agents implemented by ikostrikov:
https://github.com/ikostrikov/pytorch-a3c
An overview of our results:
Pong
Our agent achieves:
- human performance (9.3) after 4467 episodes
- max performance (18.4) after 56896 episodes
Breakout
Our agent achieves:
- human performance (31.8) after 22201 episodes
- max performance (253.5) after 274616 episodes