GEC-EB: Mitigating Exposure Bias in Grammatical Error Correction with Data Augmentation and Reweighting
Hannan Cao, Wenmian Yang, Hwee Tou Ng. Mitigating Exposure Bias in Grammatical Error Correction with Data Augmentation and Reweighting. In EACL 2023.
The program is tested under pytorch 1.7.1, CUDA version 11.7
-
Download required data and install required software
1.1. Generate the C4 200M synthetic data by following https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction
1.2. Download the NUCLE, FCE, CLang8, W&I , CoNLL-2013 and CoNLL-2014 ;
1.3. Install the fairseq inside the fairseq folder
cd fairseq pip3 install --editable ./
Note all the scripts are inside train-scripts folder.
-
Pretrain and Train the Transformer-big model
2.1 Pass the path for target sentences into tok+bpe+pre.sh to generate the bpe using the subword_nmt package (https://github.com/rsennrich/subword-nmt)
./tok+bpe+pre.sh
2.2 Use apply_bpe.sh and create-dict-preprocess.sh to preprocess the pre-training data.
./apple_bpe.sh path/to/pretrain/data/folder ./create-dict-preprocess.sh path/to/bpe-ed/data/folder
2.3 Pretrain the model with pretrain.sh
./pretrain.sh
2.4 Preprocess the training data
./apple_bpe.sh path/to/train/data/folder ./preprocess.sh path/to/bpe-ed/data/folder
2.5 Train the model with train.sh
./train.sh model/train preprocessed/train/data path/to/pretrained/checkpoint
-
Generate augmented sentence
3.1. Use downloaded checkpoint to make predicitions on the training set (need to specify):
./predict.sh 0 path/to/source/training/sentence "candidate_data" path/to/downloaded/checkpoint output/directory
3.2. Generate candidate sentences from the prediction result, move the candidate files to respective folders (e.g. neg-1, neg-2, neg-3, neg-4, neg-5 are the respective folders and assume original training and validation sentences are stored in pos folder):
python generate_candidates.py --root_path previous/used/output/directory --candidate_name test.nbest.tok.candidate_data mkdir pos mkdir neg-1 mkdir neg-2 mkdir neg-3 mkdir neg-4 mkdir neg-5 mkdir pos-data mkdir neg-1-data mkdir neg-2-data mkdir neg-3-data mkdir neg-4-data mkdir neg-5-data mv candi.1 neg-1/train.tgt mv candi.2 neg-2/train.tgt mv candi.3 neg-3/train.tgt mv candi.4 neg-4/train.tgt mv candi.5 neg-5/train.tgt
3.3. Copy the train.src, valid.src and valid.tgt to neg-1, neg-2, neg-3, neg-4, neg-5 folders
5.4. Create the count for the number of candidates:
python valid_count.py --candi_path path/to/your/source/training/data/folder --file_name output/file/name --count numner/of/candidates/you/selected
3.5. Pass neg-1, neg-2, ..., neg-5 folders to apply_bpe.py to process the data
./apply_bpe.sh /path/to/neg-1/folder
3.6. Combine augmented data together (e.g. combine 5 together):
python multi_target.py --source /path/to/processed/pos/train.tgt/file /path/to/processed/neg-1/train.tgt/file ... /path/to/processed neg-5/train.tgt/file /path/to/count/file" \ --target /path/to/processed/pos/train.src/file \ --max 750 \ --ratio 750 \ --out path/to/output/data/folder
3.7. Binarize the data using preprocess.sh
./preprocess.sh path/to/output/data/folder
-
Train the model using DM method.
./conll_run.sh 0 path/to/save/finetuned/checkpoint
./bea_run.sh 0 path/to/save/finetuned/checkpoint
- Make prediction with predict.sh
./predict.sh 0 path/to/test/set randome/name path/to/finetuned/weight output/directory
- Use M2 scorer to evaluate the result of CoNLL-2014 test set, and evaluate the result on BEA-2019 test set by submitting the prediction result to colab:https://competitions.codalab.org/competitions/20228#participate-get-data
If you found our paper or code useful, please cite as:
@inproceedings{cao-etal-2022-eb,
title = "Mitigating Exposure Bias in Grammatical Error Correction with Data Augmentation and Reweighting",
author = "Cao, Hannan and
Yang, Wenmian and
Ng, Hwee Tou",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
year = "2023",
}
The source code is licensed under GNU GPL 3.0 (see License) for non-commercial use. For commercial use of this code, separate commercial licensing is also available. Please contact Prof. Hwee Tou Ng (nght@comp.nus.edu.sg).