Skip to content
/ kNN-TL Public

kNN-TL: k-Nearest-Neighbor Transfer Learning for Low-Resource Neural Machine Translation (ACL2023)

Notifications You must be signed in to change notification settings

NLP2CT/kNN-TL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kNN-TL

kNN-TL: k-Nearest-Neighbor Transfer Learning for Low-Resource Neural Machine Translation (ACL 2023)

Overview

Transfer learning is an effective method to enhance low-resource NMT through the parent-child framework. kNN-TL aims to leverage the parent's knowledge throughout the entire developing process of the child model. The approach includes a parent-child representation alignment method, which ensures consistency in the output representations between the two models, and a child-aware datastore construction method that improves inference efficiency by selectively distilling the parent datastore based on relevance to the child model.

Training and Inference framework of kNN-TL.

Installation

cd kNN-TL
pip install --editable . # python >=3.7
cd ..
conda install faiss-gpu -c pytorch
pip install sacremoses==0.0.53

Data Preparation

Download and preprocess the parent and child data

# download and preprocess child data
mkdir tr_en
cd tr_en
# donwload tr-en from https://drive.google.com/file/d/1B23gkfQ3O430KSGVRCqTLyjPO01A5e6L/view?usp=sharing
# raw tr-en can be downloaded from https://opus.nlpl.eu/download.php?f=SETIMES/v2/moses/en-tr.txt.zip
cd ..
fairseq-preprocess -s tr -t en --trainpref tr_en/pack_clean/train --validpref tr_en/pack_clean/valid --testpref tr_en/pack_clean/test --srcdict tr_en/dict.tr.txt --tgtdict dict.en.txt --workers 10 --destdir ${BIN_CHILD_DATA}
# download and preprocess teacher data
mkdir de_en
cd de_en
#donwload de-en from https://drive.google.com/file/d/15CXWVj0NIMjDjxEfPCw2WktoYADUuX8O/view?usp=sharing
cd ..
fairseq-preprocess -s de -t en --trainpref de_en/pack_clean/train --validpref de_en/pack_clean/valid --testpref de_en/pack_clean/test --joined-dictionary --destdir ${BIN_PARENT_DATA} --workers 10

Training

Parent Models

cd train-scripts
BIN_PARENT_DATA=${BIN_PARENT_DATA} # path of binarized parent data
## train for de-en 
bash train_parent.sh de en $BIN_PARENT_DATA
## train for en-de (reversed parent model)
bash train_parent.sh en de $BIN_PARENT_DATA

Pseudo Parent Data Construction

For each instance in child train data (Tr-En), use the reversed parent model to back-translate the target sentence (En) into the source sentence (De).

CHILD_EN=${CHILD_EN} # target sentences 
REVERSED_PARENT_CHECKPOINT=${REVERSED_PARENT_CHECKPOINT} ## path of trained reversed parent model checkpoint
AUX_SRC_BIN=${AUX_SRC_BIN} # auxiliary source
#gen synthetic de-en for tr-en
bash gen.sh $CHILD_EN $BIN_TEACHER_DATA $REVERSED_TEACHER_CHECKPOINT $AUX_SRC_BIN

Child model

Exploit Token Matching (TM) for initialization, then train the child model with parent-child representation alignment.

BIN_CHILD_DATA=${BIN_CHILD_DATA} # path of child data
BIN_PARENT_DATA=${BIN_PARENT_DATA} # path of parent data
INIT_CHECKPOINT=${INIT_CHECKPOINT} # path of initialized child checkpoint
PARENT_CHECKPOINT=${PARENT_CHECKPOINT} # path of parent checkpoint
# Token matching
python ../kNN-TL/preprocessing_scripts/TM.py --checkpoint $PARENT_CHECKPOINT --output $INIT_CHECKPOINT --parent-dict $BIN_PARENT_DATA/dict.de.txt --child-dict $BIN_CHILD_DATA/dict.tr.txt --switch-dict src
# train child model for kNN-TL
bash train_child.sh $AUX_SRC_BIN-bin $PARENT_CHECKPOINT $BIN_PARENT_DATA $BIN_CHILD_DATA $INIT_CHECKPOINT

Inference

Origin Parent Datastore Building

Use the parent model and parent data to build the origin parent datastore.

PARENT_DATASTORE=${PARENT_DATASTORE} # The path to save the parent datastore
bash build_parent_datastore.sh $PARENT_CHECKPOINT $BIN_PARENT_DATA $PARENT_DATASTORE

Child-Aware Parent Datastore Building

Use the parent model to inference on the pseudo parent data, with kNN retrieval on the origin parent datastore. Subsets of datastore indexes will be generated in PSEUDO_PARENT_DATA. You can also skip this step and potentially get a tiny boost at the expense of inference speed.

# Combine the child target (En) and its pseudo parent source (De) generated in Pseudo Parent Data Construction as ‘PSEUDO_PARENT_DATA’
PSEUDO_PARENT_DATA=${PSEUDO_PARENT_DATA}
bash inference_pseudo_parent.sh $PARENT_CHECKPOINT $PSEUDO_PARENT_DATA $PARENT_DATASTORE

Parent-Enhanced Model Prediction

Tuning different subsets and parameters on the Valid Set, then selecting the best combination of parameters to evaluate performance on the Test Set. More parameters can be modified in the script.

GEN_SUBSET=${GEN_SUBSET} # [valid,test]
SUBSET_PATH=${SUBSET_PATH} # Path of the subset in the `PSEUDO_PARENT_DATA`
bash inference_child.sh $CHILD_MODEL $CHILD_DATA $GEN_SUBSET $DATA_STORE $SUBSET_PATH $RESULT_PATH 

Citation

@inproceedings{liu-etal-2023-knn,
    title = "k{NN}-{TL}: k-Nearest-Neighbor Transfer Learning for Low-Resource Neural Machine Translation",
    author = "Liu, Shudong  and
      Liu, Xuebo  and
      Wong, Derek F.  and
      Li, Zhaocong  and
      Jiao, Wenxiang  and
      Chao, Lidia S.  and
      Zhang, Min",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.105",
    pages = "1878--1891",
}

About

kNN-TL: k-Nearest-Neighbor Transfer Learning for Low-Resource Neural Machine Translation (ACL2023)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published