Skip to content

Latest commit

 

History

History
39 lines (28 loc) · 875 Bytes

README.md

File metadata and controls

39 lines (28 loc) · 875 Bytes

Language Models Learn to Mislead Humans via RLHF

This repository contains data and code for our paper:

Language Models Learn to Mislead Humans via RLHF

1. Installation

conda create -n mislead python=3.10
pip install -e .

2. RLHF Training

2.1 Programming

cd src/programming
python reward_api.py
bash train.sh

2.2 Question Answering

cd src/qa/reward
bash train_judge.sh # task-specific reward training
bash train_preference.sh # general reward training

cd ..
CUDA_VISIBLE_DEVICES=6 python reward_api.py # general reward
CUDA_VISIBLE_DEVICES=7 python judge_api.py # task-specific reward
bash train.sh

3. Fine-tuned Checkpoints