使用API调用奖励模型得到的奖励值与PPO过程中的奖励值差距巨大/The disparity between the reward values obtained from calling the reward model using the API and the reward values from the PPO process is huge #6100

LuRenjias · 2024-11-21T07:24:59Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.31
Python version: 3.10.13
PyTorch version: 2.1.0 (GPU)
Transformers version: 4.46.1
Datasets version: 2.21.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A100-SXM4-40GB

Reproduction

llamafactory-cli api
--stage rm
--model_name_or_path /workspace/Llama-2-7b-chat-hf
--adapter_name_or_path /workspace/LLaMA-Factory/saves/Llama-2-7B-Chat/lora/reward_model_alpaca_train
--template llama2

Expected behavior

（1）以API的形式运行奖励模型。/Run the reward model as an API.
接下来使用一个python代码对奖励模型进行测试。/Next a python code is used to test the reward model.
代码如下：/The code is as follows:

其中chosen_score表示测试集中正例的分数，rejected_score表示测试集中负例的分数。/Where chosen_score represents the score of positive examples in the test set and rejected_score represents the score of negative examples in the test set.
（2）我将chosen_score、rejected_score和最终的准确率（正例高于负例的比例）输出到了一个文件中。/I output the chosen_score, rejected_score and the final accuracy (the percentage of positive cases over negative cases) to a file.
文件的结果如下：/The results of the file are as follows:

（3）下图是PPO训练结束后输出的奖励值曲线：/The following figure shows the reward value curve output at the end of PPO training:

我的问题是：为什么API运行得到的奖励值的量级与PPO训练过程中奖励值的量级差距如此之大。/My question is: why is the magnitude of the reward value obtained from the API run so different from the magnitude of the reward value during PPO training.

Others

我在PPO训练过程中勾选了如下选项：/I checked the following box during PPO training:

The text was updated successfully, but these errors were encountered:

github-actions bot added the pending This problem is yet to be addressed label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用API调用奖励模型得到的奖励值与PPO过程中的奖励值差距巨大/The disparity between the reward values obtained from calling the reward model using the API and the reward values from the PPO process is huge #6100

使用API调用奖励模型得到的奖励值与PPO过程中的奖励值差距巨大/The disparity between the reward values obtained from calling the reward model using the API and the reward values from the PPO process is huge #6100

LuRenjias commented Nov 21, 2024 •

edited

Loading

使用API调用奖励模型得到的奖励值与PPO过程中的奖励值差距巨大/The disparity between the reward values obtained from calling the reward model using the API and the reward values from the PPO process is huge #6100

使用API调用奖励模型得到的奖励值与PPO过程中的奖励值差距巨大/The disparity between the reward values obtained from calling the reward model using the API and the reward values from the PPO process is huge #6100

Comments

LuRenjias commented Nov 21, 2024 • edited Loading

Reminder

System Info

Reproduction

Expected behavior

Others

LuRenjias commented Nov 21, 2024 •

edited

Loading