Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] AlgorithmConfig: Mostly dissolve resources() settings (e.g. num_learner_workers) into learners() and env_runners() methods. #45376

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented May 16, 2024

AlgorithmConfig: Mostly dissolve resources() settings (e.g. num_learner_workers) into learners() and env_runners() methods.

This will increase clarity of the AlgorithmConfig semantic structure, providing a clearer separation between "Learner workers" and "EnvRunner workers", which were previously sometimes both referred to as "workers".

config = (
    AlgorithmConfig()
    .env_runners(num_env_runners=16, num_cpus_per_env_runner=2)
    .learners(num_learners=4, num_gpus_per_learner=0.5)
)

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 requested a review from a team as a code owner May 16, 2024 04:42
@sven1977 sven1977 enabled auto-merge (squash) May 16, 2024 09:42
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label May 16, 2024
Copy link
Collaborator

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Makes a lot of things clearer and gives a clean separation of the different classes . What needs to be emphasized more clearly imo is that a single learner can use multiple GPUs and for what num_gpu is correctly used (local envrunner, local learner, etc.).

num_gpus_per_learner_worker=0, # <- set this to 1, if you have at least 1 GPU
num_cpus_for_local_worker=1,
# `num_learners` to the number of available GPUs for multi-GPU training (and
# `num_gpus_per_learner=1`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If each learner can use 1 GPU only, the argument num_gpus_per_learner is imo a bit off. Instead we might better use sth like learner_on_gpu=True/False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a necessary limit and we won't error out if the user sets it to >1. It could be for example that they have a model that doesn't fit on one GPU. We should leave this option flexible for future pipelining/model-parallelism support, I think.

num_learner_workers=0 # Set this to greater than 1 to allow for DDP style
# updates.
.learners(
num_learners=0 # Set this to greater than 1 to allow for DDP style updates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also work mwith data distributed learning, i.e. using a large batch and distributing it among different learner - still synchronous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the idea, yes. If you do num_learners > 1, right now, RLlib automatically performs DDP-style training, splitting the batch automatically into n shards and sending them (synchronized) to the n Learners.

# Use `num_cpus_for_local_worker` and `num_gpus` for the local worker and
# `num_cpus_per_worker` and `num_gpus_per_worker` for the remote
# workers to determine their CPU/GPU resource needs.
# Use `num_cpus_for_main_process` and `num_gpus` for the local worker and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users might ask, what if I run the learner on the main process - do I need num_gpus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I haven't really thought about this. We should definitely retire this option. It's misleading.
If you want PPO for example to learn on the local Learner worker (num_learners=0) AND use the GPU, I guess, we should test, whether this already works with num_gpus_per_learner=1. This should, imo, allocate the 1 GPU together with the num_cpus_for_main_process CPUs in the placement group bundle. Let me check ...

For example, an Algorithm with 2 EnvRunners and 1 Learner (with
1 GPU) will request a placement group with the bundles:
[{"cpu": 1}, {"gpu": 1, "cpu": 1}, {"cpu": 1}, {"cpu": 1}], where the
first bundle is for the local (main Algorithm) process, the secon one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secon - > second

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

For multi-gpu training, you have to set `num_learners` to > 1 and set
`num_gpus_per_learner` accordingly (e.g. 4 GPUs total and model fits on
1 GPU: `num_learners=4; num_gpus_per_learner=1` OR 4 GPUs total and
model requires 2 GPUs: `num_learners=2; num_gpus_per_learner=2`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be emphasized more clearly in the other sections where these arguments are mentioned and in the docs: A single learner can use multiple GPUs.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
@github-actions github-actions bot disabled auto-merge May 16, 2024 17:29
@sven1977 sven1977 merged commit 9ce581e into ray-project:master May 16, 2024
6 checks passed
@sven1977 sven1977 deleted the algorithm_config_dissolve_resources_method branch May 16, 2024 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants