Here is simple implementation of Muesli algorithm. Muesli has same performance and network architecture as MuZero, but it can be trained without MCTS lookahead search, just use one-step lookahead. It can reduce computational cost significantly compared to MuZero.
Paper : Muesli: Combining Improvements in Policy Optimization, Hessel et al., 2021 (v2 version)
This repository will be developed as part of the collaborative research with UdeM. Thanks for making this great experience and I hope this things to be useful for further progress. This codebase needs the hand of many talented contributers. Please feel free to contribute and contact!
The goal is making distributed muesli algorithm for large scale training can be intergrated with below works,
https://github.com/AGI-Collective/mini_ada
https://github.com/AGI-Collective/u3
https://github.com/Farama-Foundation/Minigrid
And we consider using https://github.com/kakaobrain/brain-agent for distributed reinforcement learning.
- Install Docker
- Download Dockerfile
- Build Dockerfile
docker build --build-arg git_config_name="your_git_name" --build-arg git_config_email="your_git_email" --build-arg CACHEBUST=$(date +%s) -t muesli_image .
- Run docker image (Adjust options for your device configuration)
docker run --gpus all -p 8888:8888 -p 8080:8080 -p 6006:6006 -p 6007:6007 -p 6008:6008 --name mu --rm -it muesli_image
- Copy the jupyterlab token (If you want to make it background process, press Ctrl + P,Q)
- Login to the jupyterlab through browser or jupyterlab desktop
http://your_local_or_server_ip:8888
with token - Launch HPO experiment with nni (on the jupyterlab terminal)
nnictl create -f --config config.yml
- Access nni through browser
http://your_local_or_server_ip:8080
- Launch Tensorboard on the jupyterlab terminal (use one more bash terminal)
tensorboard --logdir ./nni-experiments/_latest/trials --bind_all
(for seeing every experiment's TB logs in one page) - Access Tensorboard through browser
http://your_local_or_server_ip:6006
- jupyterlab-git and jupyter-collaboration are installed.
- Code was cloned into container when build, and it will be removed when container closed.
- If you want use bash shell on jupyterlab, just type ‘bash’ and press enter on the default terminal.
- You can see experiments on ‘Trials detail’ tab, and see hyperparameters by using Add/Remove columns button.
- NOTE: the hyperparameters displayed in the nni page are mismatched with experiments, so using Tensorboard HPARAMS tab is recommended.
- (log_dir of nni is changed for fixing issue about launching the TensorBoard)
- Launch TensorBoard through MS nni. Click the checkbox to the left of the trial number and click TensorBoard button.
- Or use
tensorboard --logdir ./nni-experiments/_latest/trials --bind_all
for check every experiments. - About TensorBoard image slide precision
- TensorBoard use the reservoir sampling, so some images in the episode can be skipped. If you want slide rendered images more precisely, launch TensorBoard manually by this command
tensorboard --logdir . --samples_per_plugin images=100 --bind_all
(directory: nni-experiments/_latest/trials/your_trial_ID/output/tensorboard) (it can be checked on terminal output)
- TensorBoard use the reservoir sampling, so some images in the episode can be skipped. If you want slide rendered images more precisely, launch TensorBoard manually by this command
python -m pdb Muesli_code.py --debug
Previous README.md
Here is simple implementation of Muesli algorithm. Muesli has same performance and network architecture as MuZero, but it can be trained without MCTS lookahead search, just use one-step lookahead. It can reduce computational cost significantly compared to MuZero.
Paper : Muesli: Combining Improvements in Policy Optimization, Hessel et al., 2021 (v2 version)
You can run this code on colab demo link, train the agent and monitor with tensorboard, play LunarLander-v2 environment with trained network. This agent can solve LunarLander-v2 within 1~2 hours computed by Google Colab CPU backend. It can reach about > 250 average score.
- MuZero network
- 5 step unroll
- L_pg+cmpo
- L_v
- L_r
- L_m (5 step)
- Stacking 8 observations
- Mini-batch update
- Hidden state scaled within [-1,1]
- Gradient clipping by value [-1,1]
- Dynamics network gradient scale 1/2
- Target network(prior parameters) moving average update
- Categorical representation (value, reward model)
- Normalized advantage
- Tensorboard monitoring
- Retrace estimator
- CNN representation network
- LSTM dynamics network
- Atari environment
-
Self-play use agent network (originally target network)
Target network 1-step unroll : When calculating v_pi_prior(s) and second term of L_pg+cmpo.
Unroll 5-step(agent network) : Unroll agent network to optimize.
1-step unrolls for L_m (target network) : When calculating pi_cmpo of L_m.
Score graph Loss graph Lunarlander play length and last rewards Var variables of advantage normalization
Need your help! Welcome to contribute, advice, question, etc.
Contact : emtgit2@gmail.com (Available languages : English, Korean)
Author's presentation : https://icml.cc/virtual/2021/poster/10769
Lunarlander-v2 env document : https://www.gymlibrary.dev/environments/box2d/lunar_lander/