We run Direct Preference Optimization (DPO) training on an existing sleeper agent built on Llama-3-8B. Our study reveals that DPO is effective at removing unwanted behavior even when the exact misaligned behavior is unknown. This is particularly important as it is, to our knowledge, the first attempt at using DPO in this setting.
Our sleeper agent model can be found on HuggingFace at this link.
Each of our models use DPO to explore how effective it is at removing the misalignment within a sleeper agent. We have three different models, each fine-tuned with DPO using a different dataset:
Details about these models can be read in our report for CS224N, which is made available by request.
Commits are joint.
We would like to acknowledge Unsloth AI as a starting point for our DPO fine-tuning. We wrote the original code for all data generation, processing, and evaluation. We heavily modified the Unsloth DPO notebook to process and fit our dataset and objectives appropriately. We wrote our own inference section to load, adapt, and call our model. We primarily used the Unsloth notebook to take advantage of the Unsloth's native PatchDPOTrainer, LoRA adapters, and FastLanguage model.
We would also like to acknowledge the Language Model Evaluation Harness repository for helping us evaluate the competency of our model after DPO fine-tuning.