Support RLOO/GRPO/REINFORCE? #68

fzyzcjy · 2024-12-31T05:15:07Z

Hi thanks for the lib! I wonder whether this library will one day officially support algorithms such as RLOO/GRPO/REINFORCE?

PeterSH6 · 2024-12-31T13:29:23Z

Hi @fzyzcjy, thanks for your interest!

We plan to release some of these algorithms (e.g., GRPO, REINFORCE...) next month.

In the meantime, we welcome contributions from the community. Feel free to submit a pull request if you'd like to help implement any RL algorithms or system optimizations!

fzyzcjy · 2024-12-31T13:32:50Z

That looks great and looking forward to it!

Btw, I wonder whether it is suggested to run on a single 4090 GPU (using e.g. 0.5B models), or is this lib optimized for large-scale usage, such that small-scale usage is improper and very slow?

In addition, I wonder what do you think about TRL and OpenRLHF (and others) - is there a comparison about them

PeterSH6 · 2024-12-31T16:30:57Z

Oh, good questions!
For a quick start, we provide a lightweight example of running veRL training. However, veRL is primarily designed and optimized for large-scale usage. We target high training throughput and our programming model also offers flexibility to support various algorithms.

In the examples/ppo_trainer directory, you'll find several scripts to execute models on both single-node and multi-node setups. Users can also customize their own scripts to run larger model sizes and cluster sizes. We will continue to fulfill the experiment examples for different models and cluster scales for better reference.

Regarding other RL frameworks, I believe TRL offers comprehensive and accurate algorithm baselines, while OpenRLHF extends them to larger scales. They're both amazing RL post-training frameworks. In our paper "HybridFlow: A Flexible and Efficient RLHF Framework" (https://arxiv.org/pdf/2409.19256v2), we provide detailed comparisons against other frameworks. The paper discusses our design in-depth and presents the comparison results.

fzyzcjy · 2024-12-31T23:24:53Z

Thanks for the info!

fzyzcjy · 2025-01-01T03:00:58Z

@PeterSH6 Feel free to submit a pull request if you'd like to help implement any RL algorithms or system optimizations!

So, if I submit PRs to optimize performance for single-GPU scenario (and also do not decrease codebase quality for sure), will it be accepted or welcomed?

In my humble opinion, if this is done, more people without many cards and people learning RLHF will use this library.

Surely, the ideal case would be PRs that is beneficial for both single-GPU and larger scale, e.g. sgl-project/sglang#2542. This uses https://github.com/fzyzcjy/torch_memory_saver and allows any GPU memory be temporarily released and resumed later, and supports CUDA graph as well.

eric-haibin-lin · 2025-01-01T07:24:50Z

@fzyzcjy I cannot agree more that providing optimized examples accessible via single GPU is important and desirable

fzyzcjy · 2025-01-01T12:17:02Z

That looks great!

Another quick question: Is the open-sourced one the same one used internally? For example, Google's Flutter that I made a lot of PR to is the same.

If the open-sourced and internal versions are different, then it seems that I cannot make larger PRs. Otherwise, I can easily create code that has conflicts with the internal codebase while never knowing it. For example, when very briefly glancing at the code, it seems this (used by this) may not need to work on global variables and can be refactored. However, if the internal codebase has some extra logic that does require this, then it cannot be changed.

PeterSH6 · 2025-01-01T13:51:45Z

@fzyzcjy thanks for the questions!

if I submit PRs to optimize performance for single-GPU scenario (and also do not decrease codebase quality for sure), will it be accepted or welcomed?

Absolutely! We welcome optimizations of this kind. We're eager to extend veRL's scope to different scales.

Surely, the ideal case would be PRs that is beneficial for both single-GPU and larger scale, e.g. sgl-project/sglang#2542. This uses https://github.com/fzyzcjy/torch_memory_saver and allows any GPU memory be temporarily released and resumed later, and supports CUDA graph as well.

I believe many optimizations could be beneficial for both single-GPU and larger scale. We've noticed your SGLang PR and are investigating similar issues with vLLM. We've reported this feature to the vLLM team at vllm-project/vllm#11638. Thanks for sharing! Currently, veRL disables CUDAGraph to enable KVCache offloading in both single-GPU and distributed training. Your solution will likely increase training throughput!

Another quick update: we plan to integrate the SGLang framework into veRL in January.

Another quick question: Is the open-sourced one the same one used internally?

I would say, the open-source version is a large subset of the one used internally. Specifically, our programming model (verl.single_controller), generation framework (vLLM), and the PPO algorithm implementation are identical. We do have some internal optimizations on the training frameworks (Torch FSDP and Megatron-LM) but they're specific to our internal models. We already open-source most of the general trainers and relevant optimizations and we'll open-source more in the near future (e.g., context parallel).

If the open-sourced and internal versions are different, then it seems that I cannot make larger PRs. Otherwise, I can easily create code that has conflicts with the internal codebase while never knowing it. For example, when very briefly glancing at the code, it seems this (used by this) may not need to work on global variables and can be refactored. However, if the internal codebase has some extra logic that does require this, then it cannot be changed.

In my opinion, it would be fine to make large PRs if we can keep the same functionality and won't break the APIs too much. (We can also break some if necessary).
In the code you refer to, I'm sorry that I don't get your point regarding the global variables. But feel free to refactor it!
Our goal is to let users set any vLLM SamplingParam from the config file. Thanks to our hybrid controller design, different frameworks are independent. I believe refactored code will be easy to test.

Looking forward to your PRs. Feel free to contact us here or in Slack!

fzyzcjy · 2025-01-01T14:00:03Z

@eric-haibin-lin I cannot agree more that providing optimized examples accessible via single GPU is important and desirable
@PeterSH6 Absolutely! We welcome optimizations of this kind. We're eager to extend veRL's scope to different scales.

That looks great! Then I will probably do that.

I believe many optimizations could be beneficial for both single-GPU and larger scale. We've noticed your SGLang PR and are investigating similar issues with vLLM. We've reported this feature to the vLLM team at vllm-project/vllm#11638. Thanks for sharing! Currently, veRL disables CUDAGraph to enable KVCache offloading in both single-GPU and distributed training. Your solution will likely increase training throughput!

Yes I hope it will be helpful!

Another quick update: we plan to integrate the SGLang framework into veRL in January.

I am happy to contribute as well. Made a small PR here just now sgl-project/sglang#2695 also relavent to verl (since it seems verl uses ).

I would say, the open-source version is a large subset of the one used internally. Specifically, our programming model (verl.single_controller), generation framework (vLLM), and the PPO algorithm implementation are identical. We do have some internal optimizations on the training frameworks (Torch FSDP and Megatron-LM) but they're specific to our internal models. We already open-source most of the general trainers and relevant optimizations and we'll open-source more in the near future (e.g., context parallel).

I see, that looks great!

In my opinion, it would be fine to make large PRs if we can keep the same functionality and won't break the APIs too much. (We can also break some if necessary).
In the code you refer to, I'm sorry that I don't get your point regarding the global variables. But feel free to refactor it!
Our goal is to let users set any vLLM SamplingParam from the config file. Thanks to our hybrid controller design, different frameworks are independent. I believe refactored code will be easy to test.
Looking forward to your PRs. Feel free to contact us here or in Slack!

Sure :) I will PR it later (either separately or batch w/ other minor changes that I find). Anyway it is just a tiny cleanup and it is also completely possible just because I glanced too quickly and it is nothing wrong - I will check later when working on cleanup/refactor.

Btw, is there / will there be more tests in this framework? For example, one advantage of https://github.com/thu-ml/tianshou/ is its end-to-end tests. It may make users and contributors more confident it is working well and nothing is broken. I totally understand it is not cheap to test training though...

The single-GPU optimization we discussed above may also be somehow related to this - if we can more cheaply train a 0.5B model, then the CI will be cheaper.

PeterSH6 · 2025-01-01T14:14:10Z

Cannot agree more!

I'll set up the end-to-end CI as soon as possible. At the moment, we only have a single V100 GPU available on GitHub for CI purposes. Unfortunately, due to the older Volta architecture, we're encountering some compatibility issues when attempting to run our original e2e CI on this GPU. I'll fix this asap.

fzyzcjy · 2025-01-01T14:16:52Z

I see... Looking forward to it!

fzyzcjy · 2025-01-01T14:18:40Z

Btw, do you mean the tests in https://github.com/volcengine/verl/tree/main/tests, or more tests? It would be very great if tests run to train a real model using PPO to verify real performance is unchanged as time goes by. (But I guess we should optimize really heavily and shrink the test task to a great extent before we can run real training on a V100 in reasonable time...)

PeterSH6 · 2025-01-01T14:23:27Z

Yes, the tests are in that directory and it doesn't cover all tests we use internally at the moment.

We have one test to train a real model using PPO in a very simple task. The output length is only 16. This may be too short to verify the performance but it's enough to validate the correctness.

fzyzcjy · 2025-01-01T14:26:45Z

We have one test to train a real model using PPO in a very simple task. The output length is only 16. This may be too short to verify the performance but it's enough to validate the correctness.

That looks great and looking forward to seeing it open-sourced! Or if it is not open sourced, maybe after PRs are merged, it will be tested internally? If so then it also looks safe (though the feedback loop may be longer).

PeterSH6 · 2025-01-01T14:36:08Z

We have one test to train a real model using PPO in a very simple task. The output length is only 16. This may be too short to verify the performance but it's enough to validate the correctness.

That looks great and looking forward to seeing it open-sourced! Or if it is not open sourced, maybe after PRs are merged, it will be tested internally? If so then it also looks safe (though the feedback loop may be longer).

Yes, we can test it internally if it's not ready before your PR.

fzyzcjy · 2025-01-02T00:50:38Z

@PeterSH6 Another quick update: we plan to integrate the SGLang framework into veRL in January.

To double check - is it true that no code has been written yet? Because I would like to start PR for this, and do not want to conflict if code has already been written.

PeterSH6 · 2025-01-02T08:28:36Z

To double check - is it true that no code has been written yet? Because I would like to start PR for this, and do not want to conflict if code has already been written.

It's true that no code has been written yet. Please go ahead to start your PR😄! Hope this basic tutorial will help: #21

fzyzcjy · 2025-01-02T08:50:12Z

@PeterSH6 Sure! But after discussing with SGLang people, they say related people will have a meeting and discuss more details, so I will probably start PRing after that.

Instead, I firstly made a very tiny PR: #74.

PeterSH6 · 2025-01-02T11:30:49Z

@PeterSH6 Sure! But after discussing with SGLang people, they say related people will have a meeting and discuss more details, so I will probably start PRing after that.

Instead, I firstly made a very tiny PR: #74.

We may schedule a meeting next week, would you like to join?

For the PR, it looks good at the first glance and I'll review it later. Thanks!

fzyzcjy · 2025-01-02T11:37:25Z

We may schedule a meeting next week, would you like to join?

Sure, feel free to ping me

For the PR, it looks good at the first glance and I'll review it later. Thanks!

You are welcome and take your time!

hijkzzz · 2025-01-02T13:47:56Z

It is recommended to implement REINFORCE++ instead of REINFORCE.
By integrating various optimization techniques from Proximal Policy Optimization (PPO) into the traditional REINFORCE algorithm, we “proposed” REINFORCE++, which aims to enhance performance and stability in RLHF while reducing computational resource requirements without the critic network.
The key feature of REINFORCE++ is that it is more stable than GRPO and faster than PPO.
see Tech Report

PeterSH6 · 2025-01-02T13:56:19Z

@hijkzzz, thanks for the info!

Awesome work!

fzyzcjy · 2025-01-02T14:21:22Z

@hijkzzz I saw that before and it looks great!

fzyzcjy · 2025-01-03T00:25:16Z

Btw, I have briefly tried torch.compile and it seems to save some memory as well as having some speedup. Do you have interest in having a PR to add a flag to enable it?

fzyzcjy closed this as completed Dec 31, 2024

fzyzcjy reopened this Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support RLOO/GRPO/REINFORCE? #68

Support RLOO/GRPO/REINFORCE? #68

fzyzcjy commented Dec 31, 2024

PeterSH6 commented Dec 31, 2024

fzyzcjy commented Dec 31, 2024

PeterSH6 commented Dec 31, 2024

fzyzcjy commented Dec 31, 2024

fzyzcjy commented Jan 1, 2025 •

edited

Loading

eric-haibin-lin commented Jan 1, 2025

fzyzcjy commented Jan 1, 2025 •

edited

Loading

PeterSH6 commented Jan 1, 2025

fzyzcjy commented Jan 1, 2025 •

edited

Loading

PeterSH6 commented Jan 1, 2025

fzyzcjy commented Jan 1, 2025

fzyzcjy commented Jan 1, 2025 •

edited

Loading

PeterSH6 commented Jan 1, 2025

fzyzcjy commented Jan 1, 2025 •

edited

Loading

PeterSH6 commented Jan 1, 2025

fzyzcjy commented Jan 2, 2025

PeterSH6 commented Jan 2, 2025

fzyzcjy commented Jan 2, 2025 •

edited

Loading

PeterSH6 commented Jan 2, 2025

fzyzcjy commented Jan 2, 2025

hijkzzz commented Jan 2, 2025

PeterSH6 commented Jan 2, 2025

fzyzcjy commented Jan 2, 2025

fzyzcjy commented Jan 3, 2025

Support RLOO/GRPO/REINFORCE? #68

Support RLOO/GRPO/REINFORCE? #68

Comments

fzyzcjy commented Dec 31, 2024

PeterSH6 commented Dec 31, 2024

fzyzcjy commented Dec 31, 2024

PeterSH6 commented Dec 31, 2024

fzyzcjy commented Dec 31, 2024

fzyzcjy commented Jan 1, 2025 • edited Loading

eric-haibin-lin commented Jan 1, 2025

fzyzcjy commented Jan 1, 2025 • edited Loading

PeterSH6 commented Jan 1, 2025

fzyzcjy commented Jan 1, 2025 • edited Loading

PeterSH6 commented Jan 1, 2025

fzyzcjy commented Jan 1, 2025

fzyzcjy commented Jan 1, 2025 • edited Loading

PeterSH6 commented Jan 1, 2025

fzyzcjy commented Jan 1, 2025 • edited Loading

PeterSH6 commented Jan 1, 2025

fzyzcjy commented Jan 2, 2025

PeterSH6 commented Jan 2, 2025

fzyzcjy commented Jan 2, 2025 • edited Loading

PeterSH6 commented Jan 2, 2025

fzyzcjy commented Jan 2, 2025

hijkzzz commented Jan 2, 2025

PeterSH6 commented Jan 2, 2025

fzyzcjy commented Jan 2, 2025

fzyzcjy commented Jan 3, 2025

fzyzcjy commented Jan 1, 2025 •

edited

Loading

fzyzcjy commented Jan 1, 2025 •

edited

Loading

fzyzcjy commented Jan 1, 2025 •

edited

Loading

fzyzcjy commented Jan 1, 2025 •

edited

Loading

fzyzcjy commented Jan 1, 2025 •

edited

Loading

fzyzcjy commented Jan 2, 2025 •

edited

Loading