[RLlib] AlgorithmConfig: Mostly dissolve `resources()` settings (e.g. num_learner_workers) into `learners()` and `env_runners()` methods. #45376

sven1977 · 2024-05-16T03:08:53Z

AlgorithmConfig: Mostly dissolve resources() settings (e.g. num_learner_workers) into learners() and env_runners() methods.

This will increase clarity of the AlgorithmConfig semantic structure, providing a clearer separation between "Learner workers" and "EnvRunner workers", which were previously sometimes both referred to as "workers".

config = (
    AlgorithmConfig()
    .env_runners(num_env_runners=16, num_cpus_per_env_runner=2)
    .learners(num_learners=4, num_gpus_per_learner=0.5)
)

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…rithm_config_dissolve_resources_method

Signed-off-by: sven1977 <svenmika1977@gmail.com>

simonsays1980

LGTM. Makes a lot of things clearer and gives a clean separation of the different classes . What needs to be emphasized more clearly imo is that a single learner can use multiple GPUs and for what num_gpu is correctly used (local envrunner, local learner, etc.).

simonsays1980 · 2024-05-16T13:13:25Z

doc/source/rllib/doc_code/new_api_stack.py

-        num_gpus_per_learner_worker=0,  # <- set this to 1, if you have at least 1 GPU
-        num_cpus_for_local_worker=1,
+    # `num_learners` to the number of available GPUs for multi-GPU training (and
+    # `num_gpus_per_learner=1`).


If each learner can use 1 GPU only, the argument num_gpus_per_learner is imo a bit off. Instead we might better use sth like learner_on_gpu=True/False.

It's not a necessary limit and we won't error out if the user sets it to >1. It could be for example that they have a model that doesn't fit on one GPU. We should leave this option flexible for future pipelining/model-parallelism support, I think.

simonsays1980 · 2024-05-16T13:15:11Z

doc/source/rllib/rllib-learner.rst

-            num_learner_workers=0  # Set this to greater than 1 to allow for DDP style
-                               # updates.
+        .learners(
+            num_learners=0  # Set this to greater than 1 to allow for DDP style updates.


Does this also work mwith data distributed learning, i.e. using a large batch and distributing it among different learner - still synchronous?

That's the idea, yes. If you do num_learners > 1, right now, RLlib automatically performs DDP-style training, splitting the batch automatically into n shards and sending them (synchronized) to the n Learners.

simonsays1980 · 2024-05-16T16:53:18Z

rllib/algorithms/algorithm.py

-        # Use `num_cpus_for_local_worker` and `num_gpus` for the local worker and
-        # `num_cpus_per_worker` and `num_gpus_per_worker` for the remote
-        # workers to determine their CPU/GPU resource needs.
+        # Use `num_cpus_for_main_process` and `num_gpus` for the local worker and


Users might ask, what if I run the learner on the main process - do I need num_gpus?

Good question, I haven't really thought about this. We should definitely retire this option. It's misleading.
If you want PPO for example to learn on the local Learner worker (num_learners=0) AND use the GPU, I guess, we should test, whether this already works with num_gpus_per_learner=1. This should, imo, allocate the 1 GPU together with the num_cpus_for_main_process CPUs in the placement group bundle. Let me check ...

simonsays1980 · 2024-05-16T16:54:41Z

rllib/algorithms/algorithm_config.py

+                For example, an Algorithm with 2 EnvRunners and 1 Learner (with
+                1 GPU) will request a placement group with the bundles:
+                [{"cpu": 1}, {"gpu": 1, "cpu": 1}, {"cpu": 1}, {"cpu": 1}], where the
+                first bundle is for the local (main Algorithm) process, the secon one


secon - > second

simonsays1980 · 2024-05-16T16:56:44Z

rllib/algorithms/algorithm_config.py

+                For multi-gpu training, you have to set `num_learners` to > 1 and set
+                `num_gpus_per_learner` accordingly (e.g. 4 GPUs total and model fits on
+                1 GPU: `num_learners=4; num_gpus_per_learner=1` OR 4 GPUs total and
+                model requires 2 GPUs: `num_learners=2; num_gpus_per_learner=2`).


This needs to be emphasized more clearly in the other sections where these arguments are mentioned and in the docs: A single learner can use multiple GPUs.

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 added 5 commits May 16, 2024 04:33

wip

cb937c6

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

c931bed

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

3ada50a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into algo…

bccdde6

…rithm_config_dissolve_resources_method

wip

496c5ee

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested review from avnishn, ArturNiederfahrenhorst, maxpumperla, kouroshHakha and simonsays1980 as code owners May 16, 2024 03:08

sven1977 assigned simonsays1980 May 16, 2024

wip

794b960

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested a review from a team as a code owner May 16, 2024 04:42

sven1977 enabled auto-merge (squash) May 16, 2024 09:42

github-actions bot added the go add ONLY when ready to merge, run all tests label May 16, 2024

simonsays1980 approved these changes May 16, 2024

View reviewed changes

fixes

5b7baf0

Signed-off-by: sven1977 <svenmika1977@gmail.com>

github-actions bot disabled auto-merge May 16, 2024 17:29

sven1977 merged commit 9ce581e into ray-project:master May 16, 2024
6 checks passed

sven1977 deleted the algorithm_config_dissolve_resources_method branch May 16, 2024 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] AlgorithmConfig: Mostly dissolve `resources()` settings (e.g. num_learner_workers) into `learners()` and `env_runners()` methods. #45376

[RLlib] AlgorithmConfig: Mostly dissolve `resources()` settings (e.g. num_learner_workers) into `learners()` and `env_runners()` methods. #45376

sven1977 commented May 16, 2024 •

edited

Loading

simonsays1980 left a comment

simonsays1980 May 16, 2024

sven1977 May 16, 2024

simonsays1980 May 16, 2024

sven1977 May 16, 2024

simonsays1980 May 16, 2024

sven1977 May 16, 2024

simonsays1980 May 16, 2024

sven1977 May 16, 2024

simonsays1980 May 16, 2024

[RLlib] AlgorithmConfig: Mostly dissolve resources() settings (e.g. num_learner_workers) into learners() and env_runners() methods. #45376

[RLlib] AlgorithmConfig: Mostly dissolve resources() settings (e.g. num_learner_workers) into learners() and env_runners() methods. #45376

Conversation

sven1977 commented May 16, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

simonsays1980 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[RLlib] AlgorithmConfig: Mostly dissolve `resources()` settings (e.g. num_learner_workers) into `learners()` and `env_runners()` methods. #45376

[RLlib] AlgorithmConfig: Mostly dissolve `resources()` settings (e.g. num_learner_workers) into `learners()` and `env_runners()` methods. #45376

sven1977 commented May 16, 2024 •

edited

Loading