Skip to content

Latest commit

 

History

History
37 lines (22 loc) · 1.8 KB

README.md

File metadata and controls

37 lines (22 loc) · 1.8 KB

SemLing-MNMT

Code and scripts for the ACL2024 Findings paper "Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features".

image-20240801002159556

Code

The code is based on the open-source toolkit fairseq. Our model code transformer_disentangler_and_linguistic_encoder.py is in "fairseq/fairseq/models", and our criterion code label_smoothed_cross_entropy_with_disentangling.py is in "fairseq/fairseq/criterions".

Get Started

Requirements and Installation

  • Python version == 3.9.12

  • Pytorch version == 1.12.1

  • Install fairseq:

    git clone https://github.com/ictnlp/SemLing-MNMT.git
    cd SemLing-MNMT
    pip install --editable ./

Data Pre-processing

We use the Sentencepiece toolkit to pre-process the IWSLT2017, OPUS-7 and PC-6 datasets. For each dataset, we implement the Unigram Model algorithm for tokenization and learn a joint vocabulary with 32K tokens.

Training and Inference

We provide training and inference scripts of IWSLT2017 in the folder "scripts" as examples. Add your pathes to scripts and run them.

Here are some explanations:

  • In train.sh, --disentangler-lambda, --disentangler-reconstruction-lambda, and --disentangler-negative-lambda are hyperparameters corresponding to $\lambda$, $\lambda_1$, $\lambda_2$ in our paper. And --linguistic-encoder-layers controls the layer number of the linguistic encoder.

  • In generate.sh and generate_zero_shot.sh, we generate translation and compute BLEU scores with SacreBLEU (version == 1.5.1).