Muesli (LunarLander-v2)

Introduction

Here is simple implementation of Muesli algorithm. Muesli has same performance and network architecture as MuZero, but it can be trained without MCTS lookahead search, just use one-step lookahead. It can reduce computational cost significantly compared to MuZero.

Paper : Muesli: Combining Improvements in Policy Optimization, Hessel et al., 2021 (v2 version)

Objective

This repository will be developed as part of the collaborative research with UdeM. Thanks for making this great experience and I hope this things to be useful for further progress. This codebase needs the hand of many talented contributers. Please feel free to contribute and contact!

The goal is making distributed muesli algorithm for large scale training can be intergrated with below works,

https://github.com/AGI-Collective/mini_ada

https://github.com/AGI-Collective/u3

https://github.com/Farama-Foundation/Minigrid

And we consider using https://github.com/kakaobrain/brain-agent for distributed reinforcement learning.

How to use

Installation

Install Docker
Download Dockerfile
Build Dockerfile docker build --build-arg git_config_name="your_git_name" --build-arg git_config_email="your_git_email" --build-arg CACHEBUST=$(date +%s) -t muesli_image .
Run docker image (Adjust options for your device configuration) docker run --gpus all -p 8888:8888 -p 8080:8080 -p 6006:6006 -p 6007:6007 -p 6008:6008 --name mu --rm -it muesli_image
Copy the jupyterlab token (If you want to make it background process, press Ctrl + P,Q)
Login to the jupyterlab through browser or jupyterlab desktop http://your_local_or_server_ip:8888 with token
Launch HPO experiment with nni (on the jupyterlab terminal) nnictl create -f --config config.yml
Access nni through browser http://your_local_or_server_ip:8080
Launch Tensorboard on the jupyterlab terminal (use one more bash terminal) tensorboard --logdir ./nni-experiments/_latest/trials --bind_all (for seeing every experiment's TB logs in one page)
Access Tensorboard through browser http://your_local_or_server_ip:6006

Develope with jupyterlab

jupyterlab-git and jupyter-collaboration are installed.
Code was cloned into container when build, and it will be removed when container closed.
If you want use bash shell on jupyterlab, just type ‘bash’ and press enter on the default terminal.

See experiment’s progress on MS nni

You can see experiments on ‘Trials detail’ tab, and see hyperparameters by using Add/Remove columns button.
NOTE: the hyperparameters displayed in the nni page are mismatched with experiments, so using Tensorboard HPARAMS tab is recommended.
(log_dir of nni is changed for fixing issue about launching the TensorBoard)

TensorBoard

Launch TensorBoard through MS nni. Click the checkbox to the left of the trial number and click TensorBoard button.
Or use tensorboard --logdir ./nni-experiments/_latest/trials --bind_all for check every experiments.
About TensorBoard image slide precision
- TensorBoard use the reservoir sampling, so some images in the episode can be skipped. If you want slide rendered images more precisely, launch TensorBoard manually by this command tensorboard --logdir . --samples_per_plugin images=100 --bind_all (directory: nni-experiments/_latest/trials/your_trial_ID/output/tensorboard) (it can be checked on terminal output)

Debug with pdb

python -m pdb Muesli_code.py --debug

Wiki page

Wiki

Previous README.md

Introduction

Here is simple implementation of Muesli algorithm. Muesli has same performance and network architecture as MuZero, but it can be trained without MCTS lookahead search, just use one-step lookahead. It can reduce computational cost significantly compared to MuZero.

Paper : Muesli: Combining Improvements in Policy Optimization, Hessel et al., 2021 (v2 version)

You can run this code on colab demo link, train the agent and monitor with tensorboard, play LunarLander-v2 environment with trained network. This agent can solve LunarLander-v2 within 1~2 hours computed by Google Colab CPU backend. It can reach about > 250 average score.

Implemented

Todo

Retrace estimator
CNN representation network
LSTM dynamics network
Atari environment

Differences from paper

~~Self-play use agent network (originally target network)~~

Self-play

Flow of self-play.

Unroll structure

Target network 1-step unroll : When calculating v_pi_prior(s) and second term of L_pg+cmpo.

Unroll 5-step(agent network) : Unroll agent network to optimize.

1-step unrolls for L_m (target network) : When calculating pi_cmpo of L_m.

Results

Score graph Loss graph Lunarlander play length and last rewards Var variables of advantage normalization

Comment

Need your help! Welcome to contribute, advice, question, etc.

Contact : emtgit2@gmail.com (Available languages : English, Korean)

Links

Author's presentation : https://icml.cc/virtual/2021/poster/10769

Lunarlander-v2 env document : https://www.gymlibrary.dev/environments/box2d/lunar_lander/

Colab demo link (main branch)

Colab demo link (develop branch)

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
deprecated_ipynb		deprecated_ipynb
Dockerfile		Dockerfile
LICENSE		LICENSE
Muesli_lunar_rgb.py		Muesli_lunar_rgb.py
README.md		README.md
config.yml		config.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Muesli (LunarLander-v2)

Introduction

Objective