gec_model_training

Code for training Buntan/gec-t5-v1_1-small, a small T5v1.1 model trained on the cLang-8 dataset following the process described in A Simple Recipe for Multilingual Grammatical Error Correction.

The data was obtained from Google's cLang-8 repository here.

The dataset was generated with the --tokenize_text='False' option to produce clang8_source_target_en.tsv, a file of sentence pairs suitable for T5 training.

The data was then split into 95% training and 5% validation and converted into a Hugging Face Dataset object (see data_preparation.py).

The training code can be found in train_script.py. The hyperparameters used are below:

Param	Value
epochs	3
batch size	32
learning rate	2e-5
lr scheduler type	linear
optimizer	AdamW
max length	128
mixed precision	fp16
gradient accumulation	3

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
dataset_preparation.py		dataset_preparation.py
inference_demo.ipynb		inference_demo.ipynb
requirements.txt		requirements.txt
train_script.py		train_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gec_model_training

About

Releases

Packages

Languages

coynestevencharles/gec_model_training

Folders and files

Latest commit

History

Repository files navigation

gec_model_training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages