Recurrent Policies for Handling Partially Observable Environments with ReLAx

This repository contains an implementation of PPO-GAE algorithm with lagged LSTM policy (and critic) and its comparison with 0-lag MLP PPO-GAE.

To simulate partial observability in a controlled manner a gym.Wrapper which masks observation's array elements with zeros with eps probability was created. In our experiments, the degree of partial observability was controlled through altering eps value.

Experiments results are shown below:

As we can see, for the fully observable case (eps=0) MLP and LSTM policies show roughly the same performance. For a moderate degree of partial observability (eps=0.25) LSTM policy shows slightly faster learning at the early stages. For a considerable degree of partial observability (eps=0.5) LSTM policy shows significantly better performance comparing to MLP policy. However, both actors struggled to converge to fully observable case asymptotic performance. For a staggering degree of partial observability (eps=0.75) both policies failed to learn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Recurrent Policies for Handling Partially Observable Environments with ReLAx

Files

README.md

Latest commit

History

README.md

File metadata and controls

Recurrent Policies for Handling Partially Observable Environments with ReLAx