Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用API调用奖励模型得到的奖励值与PPO过程中的奖励值差距巨大/The disparity between the reward values obtained from calling the reward model using the API and the reward values from the PPO process is huge #6100

Open
1 task done
LuRenjias opened this issue Nov 21, 2024 · 0 comments
Labels
pending This problem is yet to be addressed

Comments

@LuRenjias
Copy link

LuRenjias commented Nov 21, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1.dev0
  • Platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • PyTorch version: 2.1.0 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 2.21.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A100-SXM4-40GB

Reproduction

llamafactory-cli api
--stage rm
--model_name_or_path /workspace/Llama-2-7b-chat-hf
--adapter_name_or_path /workspace/LLaMA-Factory/saves/Llama-2-7B-Chat/lora/reward_model_alpaca_train
--template llama2

Expected behavior

(1) 以API的形式运行奖励模型。/Run the reward model as an API.
接下来使用一个python代码对奖励模型进行测试。/Next a python code is used to test the reward model.
代码如下:/The code is as follows:
image
其中chosen_score表示测试集中正例的分数,rejected_score表示测试集中负例的分数。/Where chosen_score represents the score of positive examples in the test set and rejected_score represents the score of negative examples in the test set.
(2) 我将chosen_score、rejected_score和最终的准确率(正例高于负例的比例)输出到了一个文件中。/I output the chosen_score, rejected_score and the final accuracy (the percentage of positive cases over negative cases) to a file.
文件的结果如下:/The results of the file are as follows:
image
(3) 下图是PPO训练结束后输出的奖励值曲线:/The following figure shows the reward value curve output at the end of PPO training:
image
我的问题是:为什么API运行得到的奖励值的量级与PPO训练过程中奖励值的量级差距如此之大。/My question is: why is the magnitude of the reward value obtained from the API run so different from the magnitude of the reward value during PPO training.

Others

我在PPO训练过程中勾选了如下选项:/I checked the following box during PPO training:
image

@github-actions github-actions bot added the pending This problem is yet to be addressed label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant