Skip to content

Aligning pretrained language models with instruction data generated by themselves.

License

Notifications You must be signed in to change notification settings

snail-unamur/self-instruct

 
 

Repository files navigation

Self-Instruct: Aligning LM with Self Generated Instructions

DOI

This repository contains code and data for SelfBehave, a method for generating a Behaviour-Driven Development dataset using the SELF-INSTRUCT method. This was inspired by Self-Instruct paper, a method for aligning pretrained language models with instructions.

Introduction

Self-Instruct is a framework that helps language models improve their ability to follow natural language instructions. It does this by using the model's own generations to create a large collection of instructional data. With Self-Instruct, it is possible to improve the instruction-following capabilities of language models without relying on extensive manual annotation. The SelfBehave version aims to generate a great dataset of high-quality BDD scenarios.

Background

Agile methods have gained traction in software development. Behaviour-Driven Development (BDD) uses natural language in test scenarios to enhance communication between developers and stakeholders. Despite its benefits, BDD adoption can lead to significant maintenance costs, impacting productivity. Researchers are exploring automated test case generation from user stories to address this, but current tools often fail to produce high-quality scenarios, resulting in testing inefficiencies. This is why the approach provided by Self-Instruct could be interestingin order to obtain large, high-quality BDD datasets.

How Self-Instruct works?

The Self-Instruct process is an iterative bootstrapping algorithm that starts with a seed set of manually-written instructions and uses them to prompt the language model to generate new instructions and corresponding input-output instances. These generations are then filtered to remove low-quality or similar ones, and the resulting data is added back to the task pool. This process can be repeated multiple times, resulting in a large collection of instructional data that can be used to fine-tune the language model to follow instructions more effectively.

Here is an overview of Self-Instruct:

The pipeline for generating instruction data from a language model itself.

Usage

* This work is still in progress. We may update the code and data as we make progress. Please be cautious about the version control.

Instruction-tuning using our Self-Instruct data

Self-Instruct paper releases a dataset that contains 52k instructions, paired with 82K instance inputs and outputs. The entire original model-generated data is still accessible in data/gpt3-generations/batch_221203/all_instances_82K.jsonl.

The SelfBehave version resealses two dinstincts datasets that contains 1,000 data generated from two seeds of diffrent quality level.

Note: This data is generated by a language model (Mixtral-8x7B-Instruct) and inevitably contains some errors or biases. We analyzed the data quality on 286 random data for each generated dataset.

Generating Self-Instruct data from scratch

To generate Self-Instruct data using your own seed tasks or other models, the scripts is open-source for the entire pipeline here. The current code is only tested on the Mixtral-8x7B-Instruct model accessible via the Hugging Face.

Here are the scripts for generating new datasets from the two seeds in the new context of use:

For the high-quality seed:

# 1. Generate instructions from the seed
batch_dir=high_quality_seed_data/data/api_generations
python self_instruct/bootstrap_instructions.py \
  --batch_dir ${batch_dir} \
  --num_instructions_to_generate 100 \
  --seed_tasks_path high_quality_seed_data/data/high_quality_seed_tasks/high_quality_seed_tasks.jsonl \
  --engine "mistralai/Mixtral-8x7B-Instruct-v0.1"

# 2. Identify whether the instruction represents a classification task or not
batch_dir=high_quality_seed_data/data/api_generations
python self_instruct/identify_clf_or_not.py \
  --batch_dir ${batch_dir} \
  --engine "mistralai/Mixtral-8x7B-Instruct-v0.1" \
  --request_batch_size 5
    
# 3. Generate instances for each instruction
batch_dir=high_quality_seed_data/data/api_generations
python self_instruct/generate_instances.py \
  --batch_dir ${batch_dir} \
  --input_file machine_generated_instructions.jsonl \
  --output_file machine_generated_instances.jsonl \
  --max_instances_to_gen 5 \
  --engine "mistralai/Mixtral-8x7B-Instruct-v0.1" \
  --request_batch_size 5
  
# 4. Filtering, processing, and reformatting
batch_dir=mixed_quality_seed_data/data/api_generations
 python self_instruct/prepare_for_finetuning.py \
    --instance_files ${batch_dir}/machine_generated_instances.jsonl \
    --classification_type_files ${batch_dir}/is_clf_or_not_davinci_template_1.jsonl \
    --output_dir ${batch_dir}/finetuning_data \
    --include_seed_tasks \
    --seed_tasks_path high_quality_seed_data/data/high_quality_seed_tasks/high_quality_seed_tasks.jsonl

All data relating to high-quality seed can be found in the high_quality_seed_data folder.

For the mixed-quality seed:

# 1. Generate instructions from the seed
batch_dir=mixed_quality_seed_data/data/api_generations
python self_instruct/bootstrap_instructions.py \
  --batch_dir ${batch_dir} \
  --num_instructions_to_generate 100 \
  --seed_tasks_path mixed_quality_seed_data/data/mixed_quality_seed_tasks/mixed_quality_seed_tasks.jsonl \
  --engine "mistralai/Mixtral-8x7B-Instruct-v0.1"
  
# 2. Identify whether the instruction represents a classification task or not
batch_dir=mixed_quality_seed_data/data/api_generations
python self_instruct/identify_clf_or_not.py \
  ----batch_dir ${batch_dir} \
    --engine "mistralai/Mixtral-8x7B-Instruct-v0.1" \
    --request_batch_size 5
    
# 3. Generate instances for each instruction
batch_dir=mixed_quality_seed_data/data/api_generations
python self_instruct/generate_instances.py \
  --batch_dir ${batch_dir} \
  --input_file machine_generated_instructions.jsonl \
  --output_file machine_generated_instances.jsonl \
  --max_instances_to_gen 5 \
  --engine "mistralai/Mixtral-8x7B-Instruct-v0.1" \
  --request_batch_size 5
  
# 4. Filtering, processing, and reformatting
batch_dir=mixed_quality_seed_data/data/api_generations
 python self_instruct/prepare_for_finetuning.py \
    --instance_files ${batch_dir}/machine_generated_instances.jsonl \
    --classification_type_files ${batch_dir}/is_clf_or_not_davinci_template_1.jsonl \
    --output_dir ${batch_dir}/finetuning_data \
    --include_seed_tasks \
    --seed_tasks_path mixed_quality_seed_data/data/mixed_quality_seed_tasks/mixed_quality_seed_tasks.jsonl

All data relating to mixed-quality seed can be found in the mixed_quality_seed_data folder.

Data and results

For each of the previous seed, the .jsonl file (as well as an .md version) can be found in the high_quality_seed_tasks or mixed_quality_seed_tasks folder. All the data generated at different stages can be found in the api_generations folder. The sample used to analyse the quality of the data generated can be found in the results_sample folder.

The .csv files with the analyses of each sample can be found in the data results analyses folder.

Manual analysis results

Manual analysis results are available in the results/ folder. Please read results/RaterGuide.md for more information.

Acknowledgements

Thanks to the authors of the Self-Instruct method for the great inspiration they brought to this work. If you want to cite the original authors of Self-Instruct:

@misc{selfinstruct,
  title={Self-Instruct: Aligning Language Model with Self Generated Instructions},
  author={Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh},
  journal={arXiv preprint arXiv:2212.10560},
  year={2022}
}

About

Aligning pretrained language models with instruction data generated by themselves.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 78.7%
  • Jupyter Notebook 19.8%
  • Shell 1.5%