Skip to content

Commit

Permalink
Merge pull request #181 from edbeeching/multiagent_experimental
Browse files Browse the repository at this point in the history
Ready for testing 🧪 Multi-policy training support
  • Loading branch information
Ivan-267 authored May 15, 2024
2 parents d07845a + 26532d4 commit 39852ac
Show file tree
Hide file tree
Showing 9 changed files with 507 additions and 107 deletions.
6 changes: 6 additions & 0 deletions docs/ADV_RLLIB.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,13 @@

[RLlib](https://docs.ray.io/en/latest/rllib/index.html) is an open-source library for reinforcement learning (RL), offering support for production-level, highly distributed RL workloads while maintaining unified and simple APIs for a large variety of industry applications. Whether you would like to train your agents in a multi-agent setup, purely from offline (historic) datasets, or using externally connected simulators, RLlib offers a simple solution for each of your decision making needs.

## Usage with Rllib example (Recommended)

The updated [Rllib example](https://github.com/edbeeching/godot_rl_agents/blob/main/examples/rllib_example.py) script allows training environments with single and multiple different policies.
To use the new example, installation process is a bit different, you can find it described in the [training multiple policies](https://github.com/edbeeching/godot_rl_agents/blob/main/docs/TRAINING_MULTIPLE_POLICIES.md) guide.

## Installation
**Below is the older usage process, please refer to the previous section for recommended usage.**

If you want to train with rllib, create a new environment e.g.: `python -m venv venv.rllib` as rllib's dependencies can conflict with those of sb3 and other libraries.
Due to a version clash with gymnasium, stable-baselines3 must be uninstalled before installing rllib.
Expand Down
57 changes: 57 additions & 0 deletions docs/TRAINING_MULTIPLE_POLICIES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
This is a brief guide on training multiple policies focusing on Rllib specifically. If you don’t require agents with different action/obs spaces, you might also consider using Sample Factory (it’s fully supported on Linux), or for simpler multi-agent envs, SB3 might work using a single shared policy for all agents.

## Installation and configuration:

### Install dependencies:

`pip install https://github.com/edbeeching/godot_rl_agents/archive/refs/heads/main.zip` (to get the latest version)

`pip install ray[rllib]`

`pip install PettingZoo`

### Download the examples file and config file:

From https://github.com/edbeeching/godot_rl_agents/tree/main/examples, you will need `rllib_example.py` and `rllib_config.yaml.`

### Open the config file:

If your env has multiple different policies you wish to train (explained below), set `env_is_multiagent: true`, otherwise keep it `false`.

Change `env_path: None *# Set your env path here (exported executable from Godot) - e.g. 'env_path.exe' on Windows`* to point to your exported env from Godot. In-editor training with this script is not recommended as it will launch the env multiple times, to get info about different policy names, to train, and to export to onnx after training, so while possible, you would need to press `Play` in Godot editor multiple times during the process.

You can also adjust the stop criteria (set to 1200 seconds by default), and other settings.

## Configuring and exporting the Godot Env:

### Multipolicy env design differences:

When you set `env_is_multiagent` to `true`, if one agent (AIController) has `done = true` set, it will receive actions with zeros as values until all agents have set `done = true` at least once during that episode, at which point Rllib considers the episode for all agents to be done and will send a reset signal (this sets `needs_reset = true` in each AIController), and display episode rewards in stats.

If you notice individual agents standing still or behaving oddly (depending on what action values set to zeros do in the game), it’s possible that some agents had `done = true` set previously in the episode while others are still active.

In the example env, we have a training manager script that sets all agents `done` to true at the same time after a fixed amount of steps, and we’re ignoring the `needs_reset = true` signal as we’re manually resetting all agents once the episode is done. You could also handle resetting agents when `needs_reset` is set to `true` in your env instead (keep in mind that AIControllers also automatically set it to `true` after `reset_after` steps, you can override the behavior if needed).

**The behavior described above is different from setting `env_is_multiagent` to `false`, or e.g. using the [SB3 example to train](https://github.com/edbeeching/godot_rl_agents/blob/main/docs/ADV_STABLE_BASELINES_3.md)**, in which case a single policy will be trained as a vectorized environment, meaning that each agent can have its own episode lengths and it will continue to receive actions even after setting `done = true`, as the agents are considered to auto-reset in the env itself (the reset needs to be implemented in Godot as in the example envs).

### Setting policy names:
For each AIController, you can set a different policy name in Godot. Policies will be assigned to agents based on this name. E.g. if you have 10 agents assigned to `policy1`, they will all use policy 1, and if you have one agent assigned to `policy2`, it will use policy 2.

![setting-policy-names](https://github.com/edbeeching/godot_rl_agents/assets/61947090/13eb9b46-f7fb-467c-ad16-8609cda9f292)

Screenshot from [MultiAgent Simple env](https://github.com/edbeeching/godot_rl_agents_examples/tree/main/examples/MultiAgentSimple).

> [!IMPORTANT]
> All agents that have the same policy name must have the same observation and action space.
## Training:
After installing the prerequisites and adjusting the config, you can start training by using `python rllib_example.py` in your conda env/venv (if you are in the same folder).
Rllib will print out useful info in the console, such as the command to start `Tensorboard` to see the training logs for the session.
Onnx files will automatically be exported once training is done and their paths will be printed near the bottom of the console log (you can also stop mid training with `CTRL+C`, but if you press it twice in a row, saving/exporting will not be done).

For an example of a multi-policy env with 2 policies, check out the [MultiAgent Simple env](https://github.com/edbeeching/godot_rl_agents_examples/tree/main/examples/MultiAgentSimple).

Additional arguments:
- You can change the folder for logging, checkpoints, and onnx files by using:`--experiment_dir [experiment_path]`,
- You can resume stopped sessions by using: `--restore [resume_path]` argument (rllib will print out the path to resume in the console if you stop training),
- You can set the config file location using `--config_file [path_to_config.yaml]` (default is set to `rllib_config.yaml`).
60 changes: 60 additions & 0 deletions examples/rllib_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
algorithm: PPO

# Multi-agent-env setting:
# If true:
# - Any AIController with done = true will receive zeroes as action values until all AIControllers are done, an episode ends at that point.
# - ai_controller.needs_reset will also be set to true every time a new episode begins (but you can ignore it in your env if needed).
# If false:
# - AIControllers auto-reset in Godot and will receive actions after setting done = true.
# - Each AIController has its own episodes that can end/reset at any point.
# Set to false if you have a single policy name for all agents set in AIControllers
env_is_multiagent: false

checkpoint_frequency: 20

# You can set one or more stopping criteria
stop:
#episode_reward_mean: 0
#training_iteration: 1000
#timesteps_total: 10000
time_total_s: 10000000

config:
env: godot
env_config:
env_path: null # Set your env path here (exported executable from Godot) - e.g. env_path: 'env_path.exe' on Windows
action_repeat: null # Doesn't need to be set here, you can set this in sync node in Godot editor as well
show_window: true # Displays game window while training. Might be faster when false in some cases, turning off also reduces GPU usage if you don't need rendering.
speedup: 30 # Speeds up Godot physics

framework: torch # ONNX models exported with torch are compatible with the current Godot RL Agents Plugin

lr: 0.0003
lambda: 0.95
gamma: 0.99

vf_loss_coeff: 0.5
vf_clip_param: .inf
#clip_param: 0.2
entropy_coeff: 0.0001
entropy_coeff_schedule: null
#grad_clip: 0.5

normalize_actions: False
clip_actions: True # During onnx inference we simply clip the actions to [-1.0, 1.0] range, set here to match

rollout_fragment_length: 32
sgd_minibatch_size: 128
num_workers: 4
num_envs_per_worker: 1 # This will be set automatically if not multi-agent. If multi-agent, changing this changes how many envs to launch per worker.
# The value below needs changing per env
# Basic calculation for this value can be rollout_fragment_length * num_workers * num_envs_per_worker (how many AIControllers you have if not multi_agent, otherwise the value you set)
train_batch_size: 2048

num_sgd_iter: 4
batch_mode: truncate_episodes

num_gpus: 0
model:
vf_share_layers: False
fcnet_hiddens: [64, 64]
116 changes: 116 additions & 0 deletions examples/rllib_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Rllib Example for single and multi-agent training for GodotRL with onnx export,
# needs rllib_config.yaml in the same folder or --config_file argument specified to work.

import argparse
import os
import pathlib

import ray
import yaml
from ray import train, tune
from ray.rllib.algorithms.algorithm import Algorithm
from ray.rllib.env.wrappers.pettingzoo_env import ParallelPettingZooEnv
from ray.rllib.policy.policy import PolicySpec

from godot_rl.core.godot_env import GodotEnv
from godot_rl.wrappers.petting_zoo_wrapper import GDRLPettingZooEnv
from godot_rl.wrappers.ray_wrapper import RayVectorGodotEnv

if __name__ == "__main__":
parser = argparse.ArgumentParser(allow_abbrev=False)
parser.add_argument("--config_file", default="rllib_config.yaml", type=str, help="The yaml config file")
parser.add_argument("--restore", default=None, type=str, help="the location of a checkpoint to restore from")
parser.add_argument(
"--experiment_dir",
default="logs/rllib",
type=str,
help="The name of the the experiment directory, used to store logs.",
)
args, extras = parser.parse_known_args()

# Get config from file
with open(args.config_file) as f:
exp = yaml.safe_load(f)

is_multiagent = exp["env_is_multiagent"]

# Register env
env_name = "godot"
env_wrapper = None

def env_creator(env_config):
index = env_config.worker_index * exp["config"]["num_envs_per_worker"] + env_config.vector_index
port = index + GodotEnv.DEFAULT_PORT
seed = index
if is_multiagent:
return ParallelPettingZooEnv(GDRLPettingZooEnv(config=env_config, port=port, seed=seed))
else:
return RayVectorGodotEnv(config=env_config, port=port, seed=seed)

tune.register_env(env_name, env_creator)

policy_names = None
num_envs = None
tmp_env = None

if is_multiagent: # Make temp env to get info needed for multi-agent training config
print("Starting a temporary multi-agent env to get the policy names")
tmp_env = GDRLPettingZooEnv(config=exp["config"]["env_config"], show_window=False)
policy_names = tmp_env.agent_policy_names
print("Policy names for each Agent (AIController) set in the Godot Environment", policy_names)
else: # Make temp env to get info needed for setting num_workers training config
print("Starting a temporary env to get the number of envs and auto-set the num_envs_per_worker config value")
tmp_env = GodotEnv(env_path=exp["config"]["env_config"]["env_path"], show_window=False)
num_envs = tmp_env.num_envs

tmp_env.close()

def policy_mapping_fn(agent_id: int, episode, worker, **kwargs) -> str:
return policy_names[agent_id]

ray.init(_temp_dir=os.path.abspath(args.experiment_dir))

if is_multiagent:
exp["config"]["multiagent"] = {
"policies": {policy_name: PolicySpec() for policy_name in policy_names},
"policy_mapping_fn": policy_mapping_fn,
}
else:
exp["config"]["num_envs_per_worker"] = num_envs

tuner = None
if not args.restore:
tuner = tune.Tuner(
trainable=exp["algorithm"],
param_space=exp["config"],
run_config=train.RunConfig(
storage_path=os.path.abspath(args.experiment_dir),
stop=exp["stop"],
checkpoint_config=train.CheckpointConfig(checkpoint_frequency=exp["checkpoint_frequency"]),
),
)
else:
tuner = tune.Tuner.restore(
trainable=exp["algorithm"],
path=args.restore,
resume_unfinished=True,
)
result = tuner.fit()

# Onnx export after training if a checkpoint was saved
checkpoint = result.get_best_result().checkpoint

if checkpoint:
result_path = result.get_best_result().path
ppo = Algorithm.from_checkpoint(checkpoint)
if is_multiagent:
for policy_name in set(policy_names):
ppo.get_policy(policy_name).export_model(f"{result_path}/onnx_export/{policy_name}_onnx", onnx=12)
print(
f"Saving onnx policy to {pathlib.Path(f'{result_path}/onnx_export/{policy_name}_onnx').resolve()}"
)
else:
ppo.get_policy().export_model(f"{result_path}/onnx_export/single_agent_policy_onnx", onnx=12)
print(
f"Saving onnx policy to {pathlib.Path(f'{result_path}/onnx_export/single_agent_policy_onnx').resolve()}"
)
Loading

0 comments on commit 39852ac

Please sign in to comment.