使用API调用奖励模型得到的奖励值与PPO过程中的奖励值差距巨大/The disparity between the reward values obtained from calling the reward model using the API and the reward values from the PPO process is huge #6100
Labels
pending
This problem is yet to be addressed
Reminder
System Info
llamafactory
version: 0.9.1.dev0Reproduction
llamafactory-cli api
--stage rm
--model_name_or_path /workspace/Llama-2-7b-chat-hf
--adapter_name_or_path /workspace/LLaMA-Factory/saves/Llama-2-7B-Chat/lora/reward_model_alpaca_train
--template llama2
Expected behavior
(1) 以API的形式运行奖励模型。/Run the reward model as an API.
接下来使用一个python代码对奖励模型进行测试。/Next a python code is used to test the reward model.
代码如下:/The code is as follows:
其中chosen_score表示测试集中正例的分数,rejected_score表示测试集中负例的分数。/Where chosen_score represents the score of positive examples in the test set and rejected_score represents the score of negative examples in the test set.
(2) 我将chosen_score、rejected_score和最终的准确率(正例高于负例的比例)输出到了一个文件中。/I output the chosen_score, rejected_score and the final accuracy (the percentage of positive cases over negative cases) to a file.
文件的结果如下:/The results of the file are as follows:
(3) 下图是PPO训练结束后输出的奖励值曲线:/The following figure shows the reward value curve output at the end of PPO training:
我的问题是:为什么API运行得到的奖励值的量级与PPO训练过程中奖励值的量级差距如此之大。/My question is: why is the magnitude of the reward value obtained from the API run so different from the magnitude of the reward value during PPO training.
Others
我在PPO训练过程中勾选了如下选项:/I checked the following box during PPO training:
The text was updated successfully, but these errors were encountered: