Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RLOO/GRPO/REINFORCE? #68

Open
fzyzcjy opened this issue Dec 31, 2024 · 24 comments
Open

Support RLOO/GRPO/REINFORCE? #68

fzyzcjy opened this issue Dec 31, 2024 · 24 comments

Comments

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Dec 31, 2024

Hi thanks for the lib! I wonder whether this library will one day officially support algorithms such as RLOO/GRPO/REINFORCE?

@PeterSH6
Copy link
Collaborator

Hi @fzyzcjy, thanks for your interest!

We plan to release some of these algorithms (e.g., GRPO, REINFORCE...) next month.

In the meantime, we welcome contributions from the community. Feel free to submit a pull request if you'd like to help implement any RL algorithms or system optimizations!

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 31, 2024

That looks great and looking forward to it!

Btw, I wonder whether it is suggested to run on a single 4090 GPU (using e.g. 0.5B models), or is this lib optimized for large-scale usage, such that small-scale usage is improper and very slow?

In addition, I wonder what do you think about TRL and OpenRLHF (and others) - is there a comparison about them

@PeterSH6
Copy link
Collaborator

Oh, good questions!
For a quick start, we provide a lightweight example of running veRL training. However, veRL is primarily designed and optimized for large-scale usage. We target high training throughput and our programming model also offers flexibility to support various algorithms.

In the examples/ppo_trainer directory, you'll find several scripts to execute models on both single-node and multi-node setups. Users can also customize their own scripts to run larger model sizes and cluster sizes. We will continue to fulfill the experiment examples for different models and cluster scales for better reference.

Regarding other RL frameworks, I believe TRL offers comprehensive and accurate algorithm baselines, while OpenRLHF extends them to larger scales. They're both amazing RL post-training frameworks. In our paper "HybridFlow: A Flexible and Efficient RLHF Framework" (https://arxiv.org/pdf/2409.19256v2), we provide detailed comparisons against other frameworks. The paper discusses our design in-depth and presents the comparison results.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 31, 2024

Thanks for the info!

@fzyzcjy fzyzcjy closed this as completed Dec 31, 2024
@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 1, 2025

@PeterSH6 Feel free to submit a pull request if you'd like to help implement any RL algorithms or system optimizations!

So, if I submit PRs to optimize performance for single-GPU scenario (and also do not decrease codebase quality for sure), will it be accepted or welcomed?

In my humble opinion, if this is done, more people without many cards and people learning RLHF will use this library.

Surely, the ideal case would be PRs that is beneficial for both single-GPU and larger scale, e.g. sgl-project/sglang#2542. This uses https://github.com/fzyzcjy/torch_memory_saver and allows any GPU memory be temporarily released and resumed later, and supports CUDA graph as well.

@fzyzcjy fzyzcjy reopened this Jan 1, 2025
@eric-haibin-lin
Copy link
Collaborator

@fzyzcjy I cannot agree more that providing optimized examples accessible via single GPU is important and desirable

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 1, 2025

That looks great!

Another quick question: Is the open-sourced one the same one used internally? For example, Google's Flutter that I made a lot of PR to is the same.

If the open-sourced and internal versions are different, then it seems that I cannot make larger PRs. Otherwise, I can easily create code that has conflicts with the internal codebase while never knowing it. For example, when very briefly glancing at the code, it seems this (used by this) may not need to work on global variables and can be refactored. However, if the internal codebase has some extra logic that does require this, then it cannot be changed.

@PeterSH6
Copy link
Collaborator

PeterSH6 commented Jan 1, 2025

@fzyzcjy thanks for the questions!

if I submit PRs to optimize performance for single-GPU scenario (and also do not decrease codebase quality for sure), will it be accepted or welcomed?

Absolutely! We welcome optimizations of this kind. We're eager to extend veRL's scope to different scales.

Surely, the ideal case would be PRs that is beneficial for both single-GPU and larger scale, e.g. sgl-project/sglang#2542. This uses https://github.com/fzyzcjy/torch_memory_saver and allows any GPU memory be temporarily released and resumed later, and supports CUDA graph as well.

I believe many optimizations could be beneficial for both single-GPU and larger scale. We've noticed your SGLang PR and are investigating similar issues with vLLM. We've reported this feature to the vLLM team at vllm-project/vllm#11638. Thanks for sharing! Currently, veRL disables CUDAGraph to enable KVCache offloading in both single-GPU and distributed training. Your solution will likely increase training throughput!

Another quick update: we plan to integrate the SGLang framework into veRL in January.

Another quick question: Is the open-sourced one the same one used internally?

I would say, the open-source version is a large subset of the one used internally. Specifically, our programming model (verl.single_controller), generation framework (vLLM), and the PPO algorithm implementation are identical. We do have some internal optimizations on the training frameworks (Torch FSDP and Megatron-LM) but they're specific to our internal models. We already open-source most of the general trainers and relevant optimizations and we'll open-source more in the near future (e.g., context parallel).

If the open-sourced and internal versions are different, then it seems that I cannot make larger PRs. Otherwise, I can easily create code that has conflicts with the internal codebase while never knowing it. For example, when very briefly glancing at the code, it seems this (used by this) may not need to work on global variables and can be refactored. However, if the internal codebase has some extra logic that does require this, then it cannot be changed.

In my opinion, it would be fine to make large PRs if we can keep the same functionality and won't break the APIs too much. (We can also break some if necessary).
In the code you refer to, I'm sorry that I don't get your point regarding the global variables. But feel free to refactor it!
Our goal is to let users set any vLLM SamplingParam from the config file. Thanks to our hybrid controller design, different frameworks are independent. I believe refactored code will be easy to test.

Looking forward to your PRs. Feel free to contact us here or in Slack!

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 1, 2025

@eric-haibin-lin I cannot agree more that providing optimized examples accessible via single GPU is important and desirable
@PeterSH6 Absolutely! We welcome optimizations of this kind. We're eager to extend veRL's scope to different scales.

That looks great! Then I will probably do that.

I believe many optimizations could be beneficial for both single-GPU and larger scale. We've noticed your SGLang PR and are investigating similar issues with vLLM. We've reported this feature to the vLLM team at vllm-project/vllm#11638. Thanks for sharing! Currently, veRL disables CUDAGraph to enable KVCache offloading in both single-GPU and distributed training. Your solution will likely increase training throughput!

Yes I hope it will be helpful!

Another quick update: we plan to integrate the SGLang framework into veRL in January.

I am happy to contribute as well. Made a small PR here just now sgl-project/sglang#2695 also relavent to verl (since it seems verl uses ).

I would say, the open-source version is a large subset of the one used internally. Specifically, our programming model (verl.single_controller), generation framework (vLLM), and the PPO algorithm implementation are identical. We do have some internal optimizations on the training frameworks (Torch FSDP and Megatron-LM) but they're specific to our internal models. We already open-source most of the general trainers and relevant optimizations and we'll open-source more in the near future (e.g., context parallel).

I see, that looks great!

In my opinion, it would be fine to make large PRs if we can keep the same functionality and won't break the APIs too much. (We can also break some if necessary).
In the code you refer to, I'm sorry that I don't get your point regarding the global variables. But feel free to refactor it!
Our goal is to let users set any vLLM SamplingParam from the config file. Thanks to our hybrid controller design, different frameworks are independent. I believe refactored code will be easy to test.
Looking forward to your PRs. Feel free to contact us here or in Slack!

Sure :) I will PR it later (either separately or batch w/ other minor changes that I find). Anyway it is just a tiny cleanup and it is also completely possible just because I glanced too quickly and it is nothing wrong - I will check later when working on cleanup/refactor.

Btw, is there / will there be more tests in this framework? For example, one advantage of https://github.com/thu-ml/tianshou/ is its end-to-end tests. It may make users and contributors more confident it is working well and nothing is broken. I totally understand it is not cheap to test training though...

The single-GPU optimization we discussed above may also be somehow related to this - if we can more cheaply train a 0.5B model, then the CI will be cheaper.

@PeterSH6
Copy link
Collaborator

PeterSH6 commented Jan 1, 2025

Cannot agree more!

I'll set up the end-to-end CI as soon as possible. At the moment, we only have a single V100 GPU available on GitHub for CI purposes. Unfortunately, due to the older Volta architecture, we're encountering some compatibility issues when attempting to run our original e2e CI on this GPU. I'll fix this asap.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 1, 2025

I see... Looking forward to it!

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 1, 2025

Btw, do you mean the tests in https://github.com/volcengine/verl/tree/main/tests, or more tests? It would be very great if tests run to train a real model using PPO to verify real performance is unchanged as time goes by. (But I guess we should optimize really heavily and shrink the test task to a great extent before we can run real training on a V100 in reasonable time...)

@PeterSH6
Copy link
Collaborator

PeterSH6 commented Jan 1, 2025

Yes, the tests are in that directory and it doesn't cover all tests we use internally at the moment.

We have one test to train a real model using PPO in a very simple task. The output length is only 16. This may be too short to verify the performance but it's enough to validate the correctness.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 1, 2025

We have one test to train a real model using PPO in a very simple task. The output length is only 16. This may be too short to verify the performance but it's enough to validate the correctness.

That looks great and looking forward to seeing it open-sourced! Or if it is not open sourced, maybe after PRs are merged, it will be tested internally? If so then it also looks safe (though the feedback loop may be longer).

@PeterSH6
Copy link
Collaborator

PeterSH6 commented Jan 1, 2025

We have one test to train a real model using PPO in a very simple task. The output length is only 16. This may be too short to verify the performance but it's enough to validate the correctness.

That looks great and looking forward to seeing it open-sourced! Or if it is not open sourced, maybe after PRs are merged, it will be tested internally? If so then it also looks safe (though the feedback loop may be longer).

Yes, we can test it internally if it's not ready before your PR.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 2, 2025

@PeterSH6 Another quick update: we plan to integrate the SGLang framework into veRL in January.

To double check - is it true that no code has been written yet? Because I would like to start PR for this, and do not want to conflict if code has already been written.

@PeterSH6
Copy link
Collaborator

PeterSH6 commented Jan 2, 2025

To double check - is it true that no code has been written yet? Because I would like to start PR for this, and do not want to conflict if code has already been written.

It's true that no code has been written yet. Please go ahead to start your PR😄! Hope this basic tutorial will help: #21

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 2, 2025

@PeterSH6 Sure! But after discussing with SGLang people, they say related people will have a meeting and discuss more details, so I will probably start PRing after that.

Instead, I firstly made a very tiny PR: #74.

@PeterSH6
Copy link
Collaborator

PeterSH6 commented Jan 2, 2025

@PeterSH6 Sure! But after discussing with SGLang people, they say related people will have a meeting and discuss more details, so I will probably start PRing after that.

Instead, I firstly made a very tiny PR: #74.

We may schedule a meeting next week, would you like to join?

For the PR, it looks good at the first glance and I'll review it later. Thanks!

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 2, 2025

We may schedule a meeting next week, would you like to join?

Sure, feel free to ping me

For the PR, it looks good at the first glance and I'll review it later. Thanks!

You are welcome and take your time!

@hijkzzz
Copy link

hijkzzz commented Jan 2, 2025

It is recommended to implement REINFORCE++ instead of REINFORCE.
By integrating various optimization techniques from Proximal Policy Optimization (PPO) into the traditional REINFORCE algorithm, we “proposed” REINFORCE++, which aims to enhance performance and stability in RLHF while reducing computational resource requirements without the critic network.
The key feature of REINFORCE++ is that it is more stable than GRPO and faster than PPO.
see Tech Report

@PeterSH6
Copy link
Collaborator

PeterSH6 commented Jan 2, 2025

@hijkzzz, thanks for the info!

Awesome work!

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 2, 2025

@hijkzzz I saw that before and it looks great!

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Jan 3, 2025

Btw, I have briefly tried torch.compile and it seems to save some memory as well as having some speedup. Do you have interest in having a PR to add a flag to enable it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants