Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Alignment is underspecified with regard to preference and training objectives. We tackle this along two predominant axes: alignment data and alignment algorithms.

First, we introduce Contrastive Learning from AI Revisions (CLAIR). CLAIR uses a secondary AI system to minimally revise a solution A→A’ such that the resulting preference A < A’ is much more contrastive and precise.

Second, we introduce Anchored Preference Optimization (APO). APO uses simple constraints during training to account for the relationship between the model and preference data.

A: Preference pairs can vary along irrelevant axes, Contrastive Learning from AI Revisions (CLAIR) creates a targeted preference signal instead. B: The quality of the model can impact alignment training, Anchored Preference Optimization (APO) explicitly accounts for this.

Compared to conventional methods, we’ve observed a ~2x performance boost on MixEval-Hard for continued alignment of Llama-3-8B-Instruct.

Contrastive Learning From AI Revisions (CLAIR)

We've given a reference implementation of CLAIR in this notebook. Results are cached so you can run it without an API key.

Anchored Preference Optimization (APO)

APO is integrated in the TRL repository. First, install trl. Then, run either APO-zero (apo_zero) or APO-down (apo_down) using the trl dpo command.

pip install git+https://github.com/huggingface/trl.git

trl dpo \
    --loss_type apo_zero \
    --dataset_name ContextualAI/ultrafeedback_clair_32k \
    --model_name_or_path facebook/opt-125m \
    --output_dir results

Unpaired APO (similar to KTO), coming soon to TRL

trl kto \
    --loss_type apo_zero_unpaired \
    --dataset_name ContextualAI/ultrafeedback_clair_32k \
    --model_name_or_path facebook/opt-125m \
    --output_dir results

Citation

If you found CLAIR and APO useful, please cite:

@misc{doosterlinck2024anchored,
      title={Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment}, 
      author={Karel D'Oosterlinck and Winnie Xu and Chris Develder and Thomas Demeester and Amanpreet Singh and Christopher Potts and Douwe Kiela and Shikib Mehri},
      year={2024},
      eprint={2408.06266},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.06266}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
images		images
.gitignore		.gitignore
CLAIR_preferences.ipynb		CLAIR_preferences.ipynb
LICENSE		LICENSE
README.md		README.md
cache.tar.gz		cache.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Contrastive Learning From AI Revisions (CLAIR)

Anchored Preference Optimization (APO)

Citation

About

Releases

Packages

Contributors 2

Languages

License

ContextualAI/CLAIR_and_APO

Folders and files

Latest commit

History

Repository files navigation

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Contrastive Learning From AI Revisions (CLAIR)

Anchored Preference Optimization (APO)

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages