Merge pull request #181 from edbeeching/multiagent_experimental

Ready for testing 🧪 Multi-policy training support
edbeeching · May 15, 2024 · 39852ac · 39852ac
2 parents d07845a + 26532d4
commit 39852ac
Show file tree

Hide file tree

Showing 9 changed files with 507 additions and 107 deletions.
diff --git a/docs/ADV_RLLIB.md b/docs/ADV_RLLIB.md
@@ -2,7 +2,13 @@
 
 [RLlib](https://docs.ray.io/en/latest/rllib/index.html) is an open-source library for reinforcement learning (RL), offering support for production-level, highly distributed RL workloads while maintaining unified and simple APIs for a large variety of industry applications. Whether you would like to train your agents in a multi-agent setup, purely from offline (historic) datasets, or using externally connected simulators, RLlib offers a simple solution for each of your decision making needs.
 
+## Usage with Rllib example (Recommended)
+
+The updated [Rllib example](https://github.com/edbeeching/godot_rl_agents/blob/main/examples/rllib_example.py) script allows training environments with single and multiple different policies.
+To use the new example, installation process is a bit different, you can find it described in the [training multiple policies](https://github.com/edbeeching/godot_rl_agents/blob/main/docs/TRAINING_MULTIPLE_POLICIES.md) guide.
+
 ## Installation
+**Below is the older usage process, please refer to the previous section for recommended usage.**
 
 If you want to train with rllib, create a new environment e.g.: `python -m venv venv.rllib` as rllib's dependencies can conflict with those of sb3 and other libraries.
 Due to a version clash with gymnasium, stable-baselines3 must be uninstalled before installing rllib.

diff --git a/docs/TRAINING_MULTIPLE_POLICIES.md b/docs/TRAINING_MULTIPLE_POLICIES.md
@@ -0,0 +1,57 @@
+This is a brief guide on training multiple policies focusing on Rllib specifically. If you don’t require agents with different action/obs spaces, you might also consider using Sample Factory (it’s fully supported on Linux), or for simpler multi-agent envs, SB3 might work using a single shared policy for all agents.
+
+## Installation and configuration:
+
+### Install dependencies:
+
+`pip install https://github.com/edbeeching/godot_rl_agents/archive/refs/heads/main.zip` (to get the latest version)
+
+`pip install ray[rllib]`
+
+`pip install PettingZoo`
+
+### Download the examples file and config file:
+
+From https://github.com/edbeeching/godot_rl_agents/tree/main/examples, you will need `rllib_example.py` and `rllib_config.yaml.`
+
+### Open the config file:
+
+If your env has multiple different policies you wish to train (explained below), set `env_is_multiagent: true`, otherwise keep it `false`. 
+
+Change `env_path: None *# Set your env path here (exported executable from Godot) - e.g. 'env_path.exe' on Windows`* to point to your exported env from Godot. In-editor training with this script is not recommended as it will launch the env multiple times, to get info about different policy names, to train, and to export to onnx after training, so while possible, you would need to press `Play` in Godot editor multiple times during the process.
+
+You can also adjust the stop criteria (set to 1200 seconds by default), and other settings.
+
+## Configuring and exporting the Godot Env:
+
+### Multipolicy env design differences:
+
+When you set `env_is_multiagent` to `true`, if one agent (AIController) has `done = true` set, it will receive actions with zeros as values until all agents have set `done = true` at least once during that episode, at which point Rllib considers the episode for all agents to be done and will send a reset signal (this sets `needs_reset = true` in each AIController), and display episode rewards in stats. 
+
+If you notice individual agents standing still or behaving oddly (depending on what action values set to zeros do in the game), it’s possible that some agents had `done = true` set previously in the episode while others are still active.
+
+In the example env, we have a training manager script that sets all agents `done` to true at the same time after a fixed amount of steps, and we’re ignoring the `needs_reset = true` signal as we’re manually resetting all agents once the episode is done. You could also handle resetting agents when `needs_reset` is set to `true` in your env instead (keep in mind that AIControllers also automatically set it to `true` after `reset_after` steps, you can override the behavior if needed).
+
+**The behavior described above is different from setting `env_is_multiagent` to `false`, or e.g. using the [SB3 example to train](https://github.com/edbeeching/godot_rl_agents/blob/main/docs/ADV_STABLE_BASELINES_3.md)**, in which case a single policy will be trained as a vectorized environment, meaning that each agent can have its own episode lengths and it will continue to receive actions even after setting `done = true`, as the agents are considered to auto-reset in the env itself (the reset needs to be implemented in Godot as in the example envs).
+
+### Setting policy names:
+For each AIController, you can set a different policy name in Godot. Policies will be assigned to agents based on this name. E.g. if you have 10 agents assigned to `policy1`, they will all use policy 1, and if you have one agent assigned to `policy2`, it will use policy 2.
+
+![setting-policy-names](https://github.com/edbeeching/godot_rl_agents/assets/61947090/13eb9b46-f7fb-467c-ad16-8609cda9f292)
+
+Screenshot from [MultiAgent Simple env](https://github.com/edbeeching/godot_rl_agents_examples/tree/main/examples/MultiAgentSimple).
+
+> [!IMPORTANT]  
+> All agents that have the same policy name must have the same observation and action space.
+
+## Training:
+After installing the prerequisites and adjusting the config, you can start training by using `python rllib_example.py` in your conda env/venv (if you are in the same folder).
+Rllib will print out useful info in the console, such as the command to start `Tensorboard` to see the training logs for the session.
+Onnx files will automatically be exported once training is done and their paths will be printed near the bottom of the console log (you can also stop mid training with `CTRL+C`, but if you press it twice in a row, saving/exporting will not be done).
+
+For an example of a multi-policy env with 2 policies, check out the [MultiAgent Simple env](https://github.com/edbeeching/godot_rl_agents_examples/tree/main/examples/MultiAgentSimple).
+
+Additional arguments:
+- You can change the folder for logging, checkpoints, and onnx files by using:`--experiment_dir [experiment_path]`,
+- You can resume stopped sessions by using: `--restore [resume_path]` argument (rllib will print out the path to resume in the console if you stop training),
+- You can set the config file location using `--config_file [path_to_config.yaml]` (default is set to `rllib_config.yaml`).
diff --git a/examples/rllib_config.yaml b/examples/rllib_config.yaml
@@ -0,0 +1,60 @@
+algorithm: PPO
+
+# Multi-agent-env setting:
+# If true:
+# - Any AIController with done = true will receive zeroes as action values until all AIControllers are done, an episode ends at that point.
+# - ai_controller.needs_reset will also be set to true every time a new episode begins (but you can ignore it in your env if needed).
+# If false:
+# - AIControllers auto-reset in Godot and will receive actions after setting done = true.
+# - Each AIController has its own episodes that can end/reset at any point.
+# Set to false if you have a single policy name for all agents set in AIControllers
+env_is_multiagent: false
+
+checkpoint_frequency: 20
+
+# You can set one or more stopping criteria
+stop:
+    #episode_reward_mean: 0
+    #training_iteration: 1000
+    #timesteps_total: 10000
+    time_total_s: 10000000
+
+config:
+    env: godot
+    env_config:
+        env_path: null # Set your env path here (exported executable from Godot) - e.g. env_path: 'env_path.exe' on Windows
+        action_repeat: null # Doesn't need to be set here, you can set this in sync node in Godot editor as well
+        show_window: true # Displays game window while training. Might be faster when false in some cases, turning off also reduces GPU usage if you don't need rendering.
+        speedup: 30 # Speeds up Godot physics
+
+    framework: torch # ONNX models exported with torch are compatible with the current Godot RL Agents Plugin
+
+    lr: 0.0003
+    lambda: 0.95
+    gamma: 0.99
+
+    vf_loss_coeff: 0.5
+    vf_clip_param: .inf
+    #clip_param: 0.2
+    entropy_coeff: 0.0001
+    entropy_coeff_schedule: null
+    #grad_clip: 0.5
+
+    normalize_actions: False
+    clip_actions: True # During onnx inference we simply clip the actions to [-1.0, 1.0] range, set here to match
+
+    rollout_fragment_length: 32
+    sgd_minibatch_size: 128
+    num_workers: 4
+    num_envs_per_worker: 1 # This will be set automatically if not multi-agent. If multi-agent, changing this changes how many envs to launch per worker.
+    # The value below needs changing per env
+    # Basic calculation for this value can be rollout_fragment_length * num_workers * num_envs_per_worker (how many AIControllers you have if not multi_agent, otherwise the value you set)
+    train_batch_size: 2048 
+
+    num_sgd_iter: 4
+    batch_mode: truncate_episodes
+
+    num_gpus: 0
+    model:
+        vf_share_layers: False
+        fcnet_hiddens: [64, 64]
diff --git a/examples/rllib_example.py b/examples/rllib_example.py
@@ -0,0 +1,116 @@
+# Rllib Example for single and multi-agent training for GodotRL with onnx export,
+# needs rllib_config.yaml in the same folder or --config_file argument specified to work.
+
+import argparse
+import os
+import pathlib
+
+import ray
+import yaml
+from ray import train, tune
+from ray.rllib.algorithms.algorithm import Algorithm
+from ray.rllib.env.wrappers.pettingzoo_env import ParallelPettingZooEnv
+from ray.rllib.policy.policy import PolicySpec
+
+from godot_rl.core.godot_env import GodotEnv
+from godot_rl.wrappers.petting_zoo_wrapper import GDRLPettingZooEnv
+from godot_rl.wrappers.ray_wrapper import RayVectorGodotEnv
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(allow_abbrev=False)
+    parser.add_argument("--config_file", default="rllib_config.yaml", type=str, help="The yaml config file")
+    parser.add_argument("--restore", default=None, type=str, help="the location of a checkpoint to restore from")
+    parser.add_argument(
+        "--experiment_dir",
+        default="logs/rllib",
+        type=str,
+        help="The name of the the experiment directory, used to store logs.",
+    )
+    args, extras = parser.parse_known_args()
+
+    # Get config from file
+    with open(args.config_file) as f:
+        exp = yaml.safe_load(f)
+
+    is_multiagent = exp["env_is_multiagent"]
+
+    # Register env
+    env_name = "godot"
+    env_wrapper = None
+
+    def env_creator(env_config):
+        index = env_config.worker_index * exp["config"]["num_envs_per_worker"] + env_config.vector_index
+        port = index + GodotEnv.DEFAULT_PORT
+        seed = index
+        if is_multiagent:
+            return ParallelPettingZooEnv(GDRLPettingZooEnv(config=env_config, port=port, seed=seed))
+        else:
+            return RayVectorGodotEnv(config=env_config, port=port, seed=seed)
+
+    tune.register_env(env_name, env_creator)
+
+    policy_names = None
+    num_envs = None
+    tmp_env = None
+
+    if is_multiagent:  # Make temp env to get info needed for multi-agent training config
+        print("Starting a temporary multi-agent env to get the policy names")
+        tmp_env = GDRLPettingZooEnv(config=exp["config"]["env_config"], show_window=False)
+        policy_names = tmp_env.agent_policy_names
+        print("Policy names for each Agent (AIController) set in the Godot Environment", policy_names)
+    else:  # Make temp env to get info needed for setting num_workers training config
+        print("Starting a temporary env to get the number of envs and auto-set the num_envs_per_worker config value")
+        tmp_env = GodotEnv(env_path=exp["config"]["env_config"]["env_path"], show_window=False)
+        num_envs = tmp_env.num_envs
+
+    tmp_env.close()
+
+    def policy_mapping_fn(agent_id: int, episode, worker, **kwargs) -> str:
+        return policy_names[agent_id]
+
+    ray.init(_temp_dir=os.path.abspath(args.experiment_dir))
+
+    if is_multiagent:
+        exp["config"]["multiagent"] = {
+            "policies": {policy_name: PolicySpec() for policy_name in policy_names},
+            "policy_mapping_fn": policy_mapping_fn,
+        }
+    else:
+        exp["config"]["num_envs_per_worker"] = num_envs
+
+    tuner = None
+    if not args.restore:
+        tuner = tune.Tuner(
+            trainable=exp["algorithm"],
+            param_space=exp["config"],
+            run_config=train.RunConfig(
+                storage_path=os.path.abspath(args.experiment_dir),
+                stop=exp["stop"],
+                checkpoint_config=train.CheckpointConfig(checkpoint_frequency=exp["checkpoint_frequency"]),
+            ),
+        )
+    else:
+        tuner = tune.Tuner.restore(
+            trainable=exp["algorithm"],
+            path=args.restore,
+            resume_unfinished=True,
+        )
+    result = tuner.fit()
+
+    # Onnx export after training if a checkpoint was saved
+    checkpoint = result.get_best_result().checkpoint
+
+    if checkpoint:
+        result_path = result.get_best_result().path
+        ppo = Algorithm.from_checkpoint(checkpoint)
+        if is_multiagent:
+            for policy_name in set(policy_names):
+                ppo.get_policy(policy_name).export_model(f"{result_path}/onnx_export/{policy_name}_onnx", onnx=12)
+                print(
+                    f"Saving onnx policy to {pathlib.Path(f'{result_path}/onnx_export/{policy_name}_onnx').resolve()}"
+                )
+        else:
+            ppo.get_policy().export_model(f"{result_path}/onnx_export/single_agent_policy_onnx", onnx=12)
+            print(
+                f"Saving onnx policy to {pathlib.Path(f'{result_path}/onnx_export/single_agent_policy_onnx').resolve()}"
+            )