-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support RLOO/GRPO/REINFORCE? #68
Comments
Hi @fzyzcjy, thanks for your interest! We plan to release some of these algorithms (e.g., GRPO, REINFORCE...) next month. In the meantime, we welcome contributions from the community. Feel free to submit a pull request if you'd like to help implement any RL algorithms or system optimizations! |
That looks great and looking forward to it! Btw, I wonder whether it is suggested to run on a single 4090 GPU (using e.g. 0.5B models), or is this lib optimized for large-scale usage, such that small-scale usage is improper and very slow? In addition, I wonder what do you think about TRL and OpenRLHF (and others) - is there a comparison about them |
Oh, good questions! In the examples/ppo_trainer directory, you'll find several scripts to execute models on both single-node and multi-node setups. Users can also customize their own scripts to run larger model sizes and cluster sizes. We will continue to fulfill the experiment examples for different models and cluster scales for better reference. Regarding other RL frameworks, I believe TRL offers comprehensive and accurate algorithm baselines, while OpenRLHF extends them to larger scales. They're both amazing RL post-training frameworks. In our paper "HybridFlow: A Flexible and Efficient RLHF Framework" (https://arxiv.org/pdf/2409.19256v2), we provide detailed comparisons against other frameworks. The paper discusses our design in-depth and presents the comparison results. |
Thanks for the info! |
So, if I submit PRs to optimize performance for single-GPU scenario (and also do not decrease codebase quality for sure), will it be accepted or welcomed? In my humble opinion, if this is done, more people without many cards and people learning RLHF will use this library. Surely, the ideal case would be PRs that is beneficial for both single-GPU and larger scale, e.g. sgl-project/sglang#2542. This uses https://github.com/fzyzcjy/torch_memory_saver and allows any GPU memory be temporarily released and resumed later, and supports CUDA graph as well. |
@fzyzcjy I cannot agree more that providing optimized examples accessible via single GPU is important and desirable |
That looks great! Another quick question: Is the open-sourced one the same one used internally? For example, Google's Flutter that I made a lot of PR to is the same. If the open-sourced and internal versions are different, then it seems that I cannot make larger PRs. Otherwise, I can easily create code that has conflicts with the internal codebase while never knowing it. For example, when very briefly glancing at the code, it seems this (used by this) may not need to work on global variables and can be refactored. However, if the internal codebase has some extra logic that does require this, then it cannot be changed. |
@fzyzcjy thanks for the questions!
Absolutely! We welcome optimizations of this kind. We're eager to extend veRL's scope to different scales.
I believe many optimizations could be beneficial for both single-GPU and larger scale. We've noticed your SGLang PR and are investigating similar issues with vLLM. We've reported this feature to the vLLM team at vllm-project/vllm#11638. Thanks for sharing! Currently, veRL disables CUDAGraph to enable KVCache offloading in both single-GPU and distributed training. Your solution will likely increase training throughput! Another quick update: we plan to integrate the SGLang framework into veRL in January.
I would say, the open-source version is a large subset of the one used internally. Specifically, our programming model (
In my opinion, it would be fine to make large PRs if we can keep the same functionality and won't break the APIs too much. (We can also break some if necessary). Looking forward to your PRs. Feel free to contact us here or in Slack! |
That looks great! Then I will probably do that.
Yes I hope it will be helpful!
I am happy to contribute as well. Made a small PR here just now sgl-project/sglang#2695 also relavent to verl (since it seems verl uses ).
I see, that looks great!
Sure :) I will PR it later (either separately or batch w/ other minor changes that I find). Anyway it is just a tiny cleanup and it is also completely possible just because I glanced too quickly and it is nothing wrong - I will check later when working on cleanup/refactor. Btw, is there / will there be more tests in this framework? For example, one advantage of https://github.com/thu-ml/tianshou/ is its end-to-end tests. It may make users and contributors more confident it is working well and nothing is broken. I totally understand it is not cheap to test training though... The single-GPU optimization we discussed above may also be somehow related to this - if we can more cheaply train a 0.5B model, then the CI will be cheaper. |
Cannot agree more! I'll set up the end-to-end CI as soon as possible. At the moment, we only have a single V100 GPU available on GitHub for CI purposes. Unfortunately, due to the older Volta architecture, we're encountering some compatibility issues when attempting to run our original e2e CI on this GPU. I'll fix this asap. |
I see... Looking forward to it! |
Btw, do you mean the tests in https://github.com/volcengine/verl/tree/main/tests, or more tests? It would be very great if tests run to train a real model using PPO to verify real performance is unchanged as time goes by. (But I guess we should optimize really heavily and shrink the test task to a great extent before we can run real training on a V100 in reasonable time...) |
Yes, the tests are in that directory and it doesn't cover all tests we use internally at the moment. We have one test to train a real model using PPO in a very simple task. The output length is only 16. This may be too short to verify the performance but it's enough to validate the correctness. |
That looks great and looking forward to seeing it open-sourced! Or if it is not open sourced, maybe after PRs are merged, it will be tested internally? If so then it also looks safe (though the feedback loop may be longer). |
Yes, we can test it internally if it's not ready before your PR. |
To double check - is it true that no code has been written yet? Because I would like to start PR for this, and do not want to conflict if code has already been written. |
It's true that no code has been written yet. Please go ahead to start your PR😄! Hope this basic tutorial will help: #21 |
We may schedule a meeting next week, would you like to join? For the PR, it looks good at the first glance and I'll review it later. Thanks! |
Sure, feel free to ping me
You are welcome and take your time! |
It is recommended to implement REINFORCE++ instead of REINFORCE. |
@hijkzzz, thanks for the info! Awesome work! |
@hijkzzz I saw that before and it looks great! |
Btw, I have briefly tried torch.compile and it seems to save some memory as well as having some speedup. Do you have interest in having a PR to add a flag to enable it? |
Hi thanks for the lib! I wonder whether this library will one day officially support algorithms such as RLOO/GRPO/REINFORCE?
The text was updated successfully, but these errors were encountered: