Skip to content

Latest commit

 

History

History
97 lines (77 loc) · 4.05 KB

README.md

File metadata and controls

97 lines (77 loc) · 4.05 KB

TransGEC: Improving Grammaticial Error Correction with Translationese

The code for "TransGEC: Improving Grammaticial Error Correction with Translationese". Our models were trained using the NVIDIA Tesla V100 32G and A100 40G GPUs.

Citation

@inproceedings{fang-etal-2023-transgec,
    title = "{T}rans{GEC}: Improving Grammatical Error Correction with Translationese",
    author = "Fang, Tao  and
      Liu, Xuebo  and
      Wong, Derek F.  and
      Zhan, Runzhe  and
      Ding, Liang  and
      Chao, Lidia S.  and
      Tao, Dacheng  and
      Zhang, Min",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.223",
    pages = "3614--3633",
}

Simplified Instruction

We released the translationese GEC models (TransGEC) fine-tuned on (m)T5-large pre-trained language model. If you want to quickly explore our job, the following instructions may be useful to you.

  • Step 1: Requirements and Installation

    This implementation is based on huggingface/transformers(v4.13.0)

    • PyTorch version >= 1.3.1
    • Python version >= 3.6
    git clone https://github.com/NLP2CT/Trans4GEC.git
    cd transformers
    pip install .
    pip install -r requirements.txt
    
  • Step 2: Download Translationese (m)T5-GEC Models and Data

    Lang. Model Description Model-Download Data-Download
    En TransGEC Fine-tuned with cLang8-en and translationese TransGEC.en.model data.en
    De TransGEC Fine-tuned with cLang8-de and translationese TransGEC.de.model data.de
    Ru TransGEC Fine-tuned with cLang8-ru and translationese TransGEC.ru.model data.ru
    Zh TransGEC Fine-tuned with Lang8-zh and translationese TransGEC.zh.model data.zh

    The directory of the downloaded data follows the following format:

    data_xx/
     |--train
       |--translationese.tsv
       |--train-translationese.json
     |--dev
       |--dev.xx.json
     |--test
       |--test.xx.json
       |--test.xx.M2
    
  • Step 3: Generation and Evaluation

    If you want to use the downloaded TransGEC models to generate and evaluate, please refer to the script transgec_generate.sh for detailed information.

Usage

If you want to fine-tune (m)T5-large pre-trained language model from scratch using translationese, please follow the instructions below.

Fine-tuning

sh /shell_finetune-T5/train_en.sh
sh /shell_finetune-T5/train_de.sh
sh /shell_finetune-T5/train_ru.sh
sh /shell_finetune-T5/train_zh.sh

Generation and Evaluation

sh /shell_finetune-T5/Generate_evaluate_en.sh
sh /shell_finetune-T5/Generate_evaluate_de.sh
sh /shell_finetune-T5/Generate_evaluate_ru.sh
sh /shell_finetune-T5/Generate_evaluate_zh.sh

Quick Links

Please refer to the following instructions for more information on our work: