Implements code from LfL paper (http://proceedings.mlr.press/v97/jacq19a/jacq19a.pdf).
To reproduce results for experiment 6.1 (table 1) run
python soft_policy_inversion.py
To reproduce results for experiment 6.1 (table 1) run
python trajectory_spi.py
Paper results where obtained with mujoco_py version 1.50.1
Learning agents are trained via Proximal Policy Optimization (PPO).
PPO and LfL code is based on Pytorch for gradient differentiation.
We adapted the PPO implementation by Ilya Kostrikov, available at https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail.
To reproduce results for experiment 6.1:
- Generate learner trajecories by running
python learner.py
- Infer the reward function by running
python lfl.py
- Train the observer with the inferred reward by running
python observer.py