CLUES🔍

Code for [NeurIPS 2024 Paper] CLUES🔍: Collaborative Private-domain High-quality Data Selection for LLMs via Training Dynamics

Project Overview

CLUES🔍 aims to evaluate and optimize data quality for large language models (LLMs) in the collaborative setting (including model merging and federated learning). The project includes three main steps:

Original model training with raw data (mixed-quality data)
Data scoring and selection with global threshold <CLUES🔍>
Collaborative model training and merging with selected data (high-quality)

Installations

conda env create --file environment.yaml --name <your_env_name>

Usage Steps

Step 1: Fine-tune the Model with Mixed-quality Training Data

Run the following script to fine-tune the model:

sh finetune.sh

Step 2: Data Scoring and Selection

Run the scoring.py script to calculate gradients and data scores, and select data based on the calculations:

sh Scoring.sh

You can change "output_notation" to select gradients of different layers and submodules.

In our repo, we also implemented other data scoring baselines in our framework for easier usage and uniform comparison:

Instruction-Following Difficulty (IFD) Score from From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
Data Influence Score from DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

Step 3: Fine-tune the Model with Selected High-quality Data

Run the following script to fine-tune the model:

sh finetune_selected.sh

In this step, you should use the selected datasets.

Step 4: Merge Local Models and Evaluate the Global Model

Evaluate the model to verify the effectiveness of fine-tuning and data selection.

We use GPT scoring to evaluate the model after collaborative learning. Run merging_eval_med.sh to generate the output of test sets:

sh merging_eval_med.sh

Run eval_gpt.sh to score the answers:

sh eval_gpt.sh

Step 5: Analysis and result visualization

Notes

Ensure all relevant environments and dependencies are correctly configured before running the scripts.
Modify script parameters according to the specific language.
The scripts and parameters in each step may need adjustments based on specific requirements.
If you have any questions related to the code or the paper, feel free to email Wanru (wz341@cam.ac.uk).

Contributions

Contributions are welcome. Please submit issues and pull requests to improve this project.

Acknowledgements

We would like to thank Colin Raffel, Haokun Liu, Marco Ciccone, Brian Lester, Meghdad Kurmanji and Stefanos Laskaridis for useful discussions and feedback.

Citation

Please cite our paper if you find the repo helpful in your work:

@inproceedings{
  zhao2024clues,
  title={{CLUES}: Collaborative Private-domain High-quality Data Selection for LLMs via Training Dynamics},
  author={Wanru Zhao and Hongxiang Fan and Shell Xu Hu and Wangchunshu Zhou and Nicholas Donald Lane},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)},
  year={2024},
  }

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Calculate_grad_anc.py		Calculate_grad_anc.py
Calculate_grad_tr.py		Calculate_grad_tr.py
Compute_score_anc.py		Compute_score_anc.py
Compute_score_tr.py		Compute_score_tr.py
DataInf.py		DataInf.py
IFD_PPL_score.py		IFD_PPL_score.py
LICENSE		LICENSE
README.md		README.md
Scoring.sh		Scoring.sh
environment.yaml		environment.yaml
eval_gpt.py		eval_gpt.py
eval_gpt.sh		eval_gpt.sh
finetune.py		finetune.py
finetune.sh		finetune.sh
finetune_selected.sh		finetune_selected.sh
inference_loss.py		inference_loss.py
merging_eval_med.py		merging_eval_med.py
merging_eval_med.sh		merging_eval_med.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLUES🔍

Project Overview

Installations

Usage Steps

Step 1: Fine-tune the Model with Mixed-quality Training Data

Step 2: Data Scoring and Selection

Step 3: Fine-tune the Model with Selected High-quality Data

Step 4: Merge Local Models and Evaluate the Global Model

Step 5: Analysis and result visualization

Notes

Contributions

Acknowledgements

Citation

About

Releases

Packages

Languages

License

Ryan0v0/CLUES

Folders and files

Latest commit

History

Repository files navigation

CLUES🔍

Project Overview

Installations

Usage Steps

Step 1: Fine-tune the Model with Mixed-quality Training Data

Step 2: Data Scoring and Selection

Step 3: Fine-tune the Model with Selected High-quality Data

Step 4: Merge Local Models and Evaluate the Global Model

Step 5: Analysis and result visualization

Notes

Contributions

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages