CS224N Custom Final Project

Description

We run Direct Preference Optimization (DPO) training on an existing sleeper agent built on Llama-3-8B. Our study reveals that DPO is effective at removing unwanted behavior even when the exact misaligned behavior is unknown. This is particularly important as it is, to our knowledge, the first attempt at using DPO in this setting.

Our sleeper agent model can be found on HuggingFace at this link.

Models

Each of our models use DPO to explore how effective it is at removing the misalignment within a sleeper agent. We have three different models, each fine-tuned with DPO using a different dataset:

Details about these models can be read in our report for CS224N, which is made available by request.

Authors

Katherine Worden
Jeong Shin

Commits are joint.

Acknowledgments

We would like to acknowledge Unsloth AI as a starting point for our DPO fine-tuning. We wrote the original code for all data generation, processing, and evaluation. We heavily modified the Unsloth DPO notebook to process and fit our dataset and objectives appropriately. We wrote our own inference section to load, adapt, and call our model. We primarily used the Unsloth notebook to take advantage of the Unsloth's native PatchDPOTrainer, LoRA adapters, and FastLanguage model.

We would also like to acknowledge the Language Model Evaluation Harness repository for helping us evaluate the competency of our model after DPO fine-tuning.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dpo_data_prep		dpo_data_prep
evaluation		evaluation
sft		sft
utils		utils
.gitignore		.gitignore
README.md		README.md
run_dpo.ipynb		run_dpo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS224N Custom Final Project

Description

Models

Authors

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

jshinnn/JSKW-CS224N-CustomFinalProject

Folders and files

Latest commit

History

Repository files navigation

CS224N Custom Final Project

Description

Models

Authors

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages