Experiment 2: TBASE.MT-OPUS

In this experiment, we train Transformer BASE model on mt-opus version 1.0 that we have processed with hand-craft segment filtering rules (Available at: MURL>). The total number of English-Thai segment pairs after segment filtering is 3,318,153.

Similar to the Experiment 1 (TBASE.SCB-1M), the Transformer BASE model used in this experiment consists of 6 encoder and 6 decoder blocks, 512 embedding dimensions, and 2,048 feed forward hidden units. The dropout rate is set to 0.3. The embedding of decoder input and output are shared. Maximum number of tokens per mini-batch is 9,750. The optimizer is Adam with initial learning rate of 1e-7 and weight decay rate of 0.0. The learning rate has an inverse squared schedule with warmup for the first 4,000 updates. Label smoothing of 0.1 is applied during training. The criteria for selecting the best model checkpoint is label-smoothed cross entropy loss.

We train each model on 1 NVIDIA Tesla V100 GPU (as a part of DGX-1) with mixed-precision training (fp16) and gradient accumulation for 16 steps.

Experiment setup

Package Installlation

1.1 Install required Python packages via pip install
```
pip install -r requirements.txt
```
1.2 Install Fairseq Toolkit from source
```
bash scripts/install_fairseq.sh y
```
Note: When the first argument is specified as y, it will install apex, an extension for mixed-precision training in Pytorch, designed for host machine with GPUs. If specify this argument as n, it will only install fairseq. The default value is n.

In our experiment, we install apex librery.
Download Dataset

Download mt-opus dataset - version 1.0 from the following script.
```
bash scripts/download_dataset.mt-opus.sh 1.0
```
Note: The first argument indicates the version of mt-opus dataset. (default value is 1.0)

Data Preprocessing

Perform text cleaning and filtering

python ./scripts/clean_text.py ./dataset/raw/mt-opus \
    --unicode_norm NFKC \
    --out_dir ./dataset/cleaned/mt-opus

Merge csv files into txt file.

python ./scripts/merge_csv_files.py ./dataset/cleaned/mt-opus/ \
    --out_dir ./dataset/merged/mt-opus/

Split the dataset into train/val/test set with the ratio 80/10/10
```
python ./scripts/split_dataset.py ./dataset/merged/mt-opus/en-th.merged.csv \
    0.8 \
    0.1 \
    --val_ratio 0.1 \
    --stratify \
    --seed 2020 \
    --out_dir ./dataset/split/mt-opus
```
As this script splits train/val/test set differently each time, we provide our version of train/val/test split in order to reproduce our experiment. This can be download via the following script.
```
bash scripts/download_dataset_split.mt-opus.sh
```

Perform text preprocessing for th→en

newmm→moses

python ./scripts/preprocess_tokenize.py \
    --out_dir ./dataset/tokenized/mt-opus/th-en/newmm-moses/ \
    --spm_out_dir ./dataset/spm/mt-opus/th-en \
    --split_dataset_dir ./dataset/split/mt-opus \
    --src_lang th \
    --tgt_lang en \
    --src_tokenizer newmm \
    --tgt_tokenizer moses

newmm→spm

python ./scripts/preprocess_tokenize.py \
    --out_dir ./dataset/tokenized/mt-opus/th-en/newmm-spm/ \
    --spm_out_dir ./dataset/spm/mt-opus/th-en \
    --split_dataset_dir ./dataset/split/mt-opus \
    --src_lang th \
    --tgt_lang en \
    --tgt_spm_vocab_size 16000 \
    --src_tokenizer newmm \
    --tgt_tokenizer spm

spm→moses

python ./scripts/preprocess_tokenize.py \
    --out_dir ./dataset/tokenized/mt-opus/th-en/spm-moses/ \
    --spm_out_dir ./dataset/spm/mt-opus/th-en \
    --split_dataset_dir ./dataset/split/mt-opus \
    --src_lang th \
    --tgt_lang en \
    --src_spm_vocab_size 16000 \
    --src_tokenizer spm \
    --tgt_tokenizer moses

spm→spm

python ./scripts/preprocess_tokenize.py \
    --out_dir ./dataset/tokenized/mt-opus/th-en/spm-spm/ \
    --spm_out_dir ./dataset/spm/mt-opus/th-en \
    --split_dataset_dir ./dataset/split/mt-opus \
    --src_lang th \
    --tgt_lang en \
    --src_spm_vocab_size 16000 \
    --tgt_spm_vocab_size 16000 \
    --src_tokenizer spm \
    --tgt_tokenizer spm

Perform text preprocessing for en→th

moses→newmm

python ./scripts/preprocess_tokenize.py \
    --out_dir ./dataset/tokenized/mt-opus/en-th/moses-newmm_space/ \
    --spm_out_dir ./dataset/spm/mt-opus/en-th \
    --split_dataset_dir ./dataset/split/mt-opus \
    --src_lang en \
    --tgt_lang th \
    --src_tokenizer moses \
    --tgt_tokenizer newmm_space

moses→spm

python ./scripts/preprocess_tokenize.py \
    --out_dir ./dataset/tokenized/mt-opus/en-th/moses-spm/ \
    --spm_out_dir ./dataset/spm/mt-opus/en-th \
    --split_dataset_dir ./dataset/split/mt-opus \
    --src_lang en \
    --tgt_lang th \
    --tgt_spm_vocab_size 16000 \
    --src_tokenizer moses \
    --tgt_tokenizer spm

spm→newmm

python ./scripts/preprocess_tokenize.py \
    --out_dir ./dataset/tokenized/mt-opus/en-th/spm-newmm_space/ \
    --spm_out_dir ./dataset/spm/mt-opus/en-th \
    --split_dataset_dir ./dataset/split/mt-opus \
    --src_lang en \
    --tgt_lang th \
    --src_spm_vocab_size 16000 \
    --src_tokenizer spm \
    --tgt_tokenizer newmm_space

spm→spm

python ./scripts/preprocess_tokenize.py \
    --out_dir ./dataset/tokenized/mt-opus/en-th/spm-spm/ \
    --spm_out_dir ./dataset/spm/mt-opus/en-th \
    --split_dataset_dir ./dataset/split/mt-opus \
    --src_lang en \
    --tgt_lang th \
    --src_spm_vocab_size 16000 \
    --tgt_spm_vocab_size 16000 \
    --src_tokenizer spm \
    --tgt_tokenizer spm

Binarize tokenized segments in train/val/test with fairseq-preprocess via the following script.

bash ./scripts/fairseq_preprocess.sh th en 130000 130000 ./dataset/tokenized/mt-opus/th-en/newmm-moses ./dataset/binarized/mt-opus/th-en/newmm-moses/130000-130000/

bash ./scripts/fairseq_preprocess.sh th en 130000 16000 ./dataset/tokenized/mt-opus/th-en/newmm-spm ./dataset/binarized/mt-opus/th-en/newmm-spm/130000-16000/

bash ./scripts/fairseq_preprocess.sh th en 16000 130000 ./dataset/tokenized/mt-opus/th-en/spm-moses ./dataset/binarized/mt-opus/th-en/spm-moses/16000-130000/

bash ./scripts/fairseq_preprocess.sh th en 32000 32000 ./dataset/tokenized/mt-opus/th-en/spm-spm ./dataset/binarized/mt-opus/th-en/spm-spm/32000-joined/ --joined-dictionary

bash ./scripts/fairseq_preprocess.sh en th 130000 130000 ./dataset/tokenized/mt-opus/en-th/moses-newmm_space ./dataset/binarized/mt-opus/en-th/moses-newmm_space/130000-130000/

bash ./scripts/fairseq_preprocess.sh en th 130000 16000 ./dataset/tokenized/mt-opus/en-th/moses-spm ./dataset/binarized/mt-opus/en-th/moses-spm/130000-16000/

bash ./scripts/fairseq_preprocess.sh en th 16000 130000 ./dataset/tokenized/mt-opus/en-th/spm-newmm_space ./dataset/binarized/mt-opus/en-th/spm-newmm_space/16000-130000/

bash ./scripts/fairseq_preprocess.sh en th 32000 32000 ./dataset/tokenized/mt-opus/en-th/spm-spm ./dataset/binarized/mt-opus/en-th/spm-spm/32000-joined/ --joined-dictionary

Model Training

Train Transformer BASE model via the following script: scripts/fairseq_train.transformer_base.single_gpu.fp16.sh

Note: The first argument indicate the ID of GPU. In this case, we train each model on 1 GPU (GPU_ID: 0-7).

Train models for th→en direction

1.1 moses→newmm

bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16 0 ./dataset/binarized/mt-opus/th-en/newmm-moses/130000-130000/ mt-opus/th-en/newmm-moses/130000-130000 9750

1.2 moses→spm

bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16 1 ./dataset/binarized/mt-opus/th-en/newmm-spm/130000-16000/ mt-opus/th-en/newmm-spm/130000-16000 9750

1.3 spm→newmm

bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16 2 ./dataset/binarized/mt-opus/th-en/spm-moses/16000-130000/ mt-opus/th-en/spm-moses/16000-130000 9750

1.4 spm→spm

bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16 3 ./dataset/binarized/mt-opus/th-en/spm-spm/32000-joined/ mt-opus/th-en/spm-spm/32000-joined 9750

Train models for en→th direction

2.1 newmm→moses

bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16 4 ./dataset/binarized/mt-opus/en-th/moses-newmm_space/130000-130000/ mt-opus/en-th/moses-newmm_space/130000-130000 9750

2.2 newmm→spm

bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16 5  ./dataset/binarized/mt-opus/en-th/moses-spm/130000-16000/ mt-opus/en-th/moses-spm/130000-16000 9750

2.3 spm→moses

bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16 6 ./dataset/binarized/mt-opus/en-th/spm-newmm_space/16000-130000/ mt-opus/en-th/spm-newmm_space/16000-130000 9750

2.4 spm→spm

bash ./scripts/fairseq_train.transformer_base.single_gpu.fp16 7 ./dataset/binarized/mt-opus/en-th/spm-spm/32000-joined/ mt-opus/en-th/spm-spm/32000-joined 9750

Model Evaluation

1. Evaluate models on `mt-opus` test set.

The total number of segment pairs is 100,177.

1.1 Evaluate models on th→en direction

1.1.1 newmm→moses

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.test_set.fp16.sh \
./checkpoints/mt-opus/th-en/newmm-moses/130000-130000/checkpoint_best.pt \
./dataset/binarized/mt-opus/th-en/newmm-moses/130000-130000 \
./dataset/tokenized/mt-opus/th-en/newmm-moses/test.th \
th \
en \
word \
./dataset/split/mt-opus/test.detok.en \
./translation_results/mt-opus/th-en/newmm-moses/130000-130000/checkpoint_best \
20000 \
4

1.1.2 newmm→spm

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.test_set.fp16.sh \
./checkpoints/mt-opus/th-en/newmm-spm/130000-16000/checkpoint_best.pt \
./dataset/binarized/mt-opus/th-en/newmm-spm/130000-16000 \
./dataset/tokenized/mt-opus/th-en/newmm-spm/test.th \
th \
en \
sentencepiece \
./dataset/split/mt-opus/test.detok.en \
./translation_results/mt-opus/th-en/newmm-spm/130000-16000/checkpoint_best \
20000 \
4

1.1.3 spm→moses

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.test_set.fp16.sh \
./checkpoints/mt-opus/th-en/spm-moses/16000-130000/checkpoint_best.pt \
./dataset/binarized/mt-opus/th-en/spm-moses/16000-130000 \
./dataset/tokenized/mt-opus/th-en/spm-moses/test.th \
th \
en \
word \
./dataset/split/mt-opus/test.detok.en \
./translation_results/mt-opus/th-en/spm-moses/16000-130000/checkpoint_best \
20000 \
4

1.1.4 spm→spm

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.test_set.fp16.sh \
./checkpoints/mt-opus/th-en/spm-spm/32000-joined/checkpoint_best.pt \
./dataset/binarized/mt-opus/th-en/spm-spm/32000-joined \
./dataset/tokenized/mt-opus/th-en/spm-spm/test.th \
th \
en \
sentencepiece \
./dataset/split/mt-opus/test.detok.en \
./translation_results/mt-opus/th-en/spm-spm/32000-joined/checkpoint_best \
20000 \
4

1.2 Evaluate models on en→th direction

1.2.1 moses→newmm_space

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.test_set.fp16.sh \
./checkpoints/mt-opus/en-th/moses-newmm_space/130000-130000/checkpoint_best.pt \
./dataset/binarized/mt-opus/en-th/moses-newmm_space/130000-130000 \
./dataset/tokenized/mt-opus/en-th/moses-newmm_space/test.en \
en \
th \
word \
./dataset/split/mt-opus/test.detok.th \
./translation_results/mt-opus/en-th/moses-newmm_space/130000-130000/checkpoint_best \
20000 \
4

1.2.2 moses→spm

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.test_set.fp16.sh \
./checkpoints/mt-opus/en-th/moses-spm/130000-16000/checkpoint_best.pt \
./dataset/binarized/mt-opus/en-th/moses-spm/130000-16000 \
./dataset/tokenized/mt-opus/en-th/moses-spm/test.en \
en \
th \
sentencepiece \
./dataset/split/mt-opus/test.detok.th \
./translation_results/mt-opus/en-th/moses-spm/130000-16000/checkpoint_best \
20000 \
4

1.2.3 spm→newmm_space

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.test_set.fp16.sh \
./checkpoints/mt-opus/en-th/spm-newmm_space/16000-130000/checkpoint_best.pt \
./dataset/binarized/mt-opus/en-th/spm-newmm_space/16000-130000 \
./dataset/tokenized/mt-opus/en-th/spm-newmm_space/test.en \
en \
th \
word \
./dataset/split/mt-opus/test.detok.th \
./translation_results/mt-opus/en-th/spm-newmm_space/16000-130000/checkpoint_best \
20000 \
4

1.2.4 spm→spm

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.test_set.fp16.sh \
./checkpoints/mt-opus/en-th/spm-spm/32000-joined/checkpoint_best.pt \
./dataset/binarized/mt-opus/en-th/spm-spm/32000-joined \
./dataset/tokenized/mt-opus/en-th/spm-spm/test.en \
en \
th \
sentencepiece \
./dataset/split/mt-opus/test.detok.th \
./translation_results/mt-opus/en-th/spm-spm/32000-joined/checkpoint_best \
20000 \
4

2. Evaluate models on Thai-English IWSLT 2015 test sets (tst2010-2013).

The total number of segment pairs is 4,242.

2.1 Evaluate models on th→en direction

2.1.1 newmm→moses

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.iwslt2015.sh \
./checkpoints/mt-opus/th-en/newmm-moses/130000-130000/checkpoint_best.pt \
./dataset/binarized/mt-opus/th-en/newmm-moses/130000-130000 \
th \
en \
word \
word \
./iwslt_2015/test/tst2010-2013_th-en.th \
./iwslt_2015/test/tst2010-2013_th-en.en \
./translation_results/mt-opus@eval_on@iwslt2015/th-en/newmm-moses/130000-130000/checkpoint_best \
64 \
4

2.1.2 newmm→spm

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.iwslt2015.sh \
./checkpoints/mt-opus/th-en/newmm-spm/130000-16000/checkpoint_best.pt \
./dataset/binarized/mt-opus/th-en/newmm-spm/130000-16000 \
th \
en \
word \
sentencepiece \
./iwslt_2015/test/tst2010-2013_th-en.th \
./iwslt_2015/test/tst2010-2013_th-en.en \
./translation_results/mt-opus@eval_on@iwslt2015/th-en/newmm-spm/130000-16000/checkpoint_best \
64 \
4

2.1.3 spm→moses

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.iwslt2015.sh \
./checkpoints/mt-opus/th-en/spm-moses/16000-130000/checkpoint_best.pt \
./dataset/binarized/mt-opus/th-en/spm-moses/16000-130000 \
th \
en \
sentencepiece \
word \
./iwslt_2015/test/tst2010-2013_th-en.th \
./iwslt_2015/test/tst2010-2013_th-en.en \
./translation_results/mt-opus@eval_on@iwslt2015/th-en/spm-moses/16000-130000/checkpoint_best \
64 \
4 \
./dataset/spm/mt-opus/th-en/spm.th.v-16000.cased.model

2.1.4 spm→spm

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.iwslt2015.sh \
./checkpoints/mt-opus/th-en/spm-spm/32000-joined/checkpoint_best.pt \
./dataset/binarized/mt-opus/th-en/spm-spm/32000-joined \
th \
en \
sentencepiece \
sentencepiece \
./iwslt_2015/test/tst2010-2013_th-en.th \
./iwslt_2015/test/tst2010-2013_th-en.en \
./translation_results/mt-opus@eval_on@iwslt2015/th-en/spm-spm/32000-joined/checkpoint_best \
64 \
4 \
./dataset/spm/mt-opus/th-en/spm.th.v-16000.cased.model

2.2 Evaluate models on en→th direction

Pretokenize Thai target sentences with PyThaiNLP's newmm dictionary-based word tokenizer with the following script.

python ./scripts/th_newmm_space_tokenize.py \
./iwslt_2015/test/tst2010-2013_th-en.th \
./iwslt_2015/test/tst2010-2013_th-en.th.ref.tok

2.2.1 moses→newmm_space

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.iwslt2015.sh \
./checkpoints/mt-opus/en-th/moses-newmm_space/130000-130000/checkpoint_best.pt \
./dataset/binarized/mt-opus/en-th/moses-newmm_space/130000-130000 \
en \
th \
word \
word \
./iwslt_2015/test/tst2010-2013_th-en.en \
./iwslt_2015/test/tst2010-2013_th-en.th \
./translation_results/mt-opus@eval_on@iwslt2015/en-th/moses-newmm_space/130000-130000/checkpoint_best \
64 \
4

2.2.2 moses→spm

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.iwslt2015.sh \
./checkpoints/mt-opus/en-th/moses-spm/130000-16000/checkpoint_best.pt \
./dataset/binarized/mt-opus/en-th/moses-spm/130000-16000 \
en \
th \
word \
sentencepiece \
./iwslt_2015/test/tst2010-2013_th-en.en \
./iwslt_2015/test/tst2010-2013_th-en.th \
./translation_results/mt-opus@eval_on@iwslt2015/en-th/moses-spm/130000-16000/checkpoint_best \
64 \
4

2.2.3 spm→newmm_space

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.iwslt2015.sh \
./checkpoints/mt-opus/en-th/spm-newmm_space/16000-130000/checkpoint_best.pt \
./dataset/binarized/mt-opus/en-th/spm-newmm_space/16000-130000 \
en \
th \
sentencepiece \
word \
./iwslt_2015/test/tst2010-2013_th-en.en \
./iwslt_2015/test/tst2010-2013_th-en.th \
./translation_results/mt-opus@eval_on@iwslt2015/en-th/spm-newmm_space/16000-130000/checkpoint_best \
64 \
4 \
./dataset/spm/mt-opus/en-th/spm.en.v-16000.cased.model

2.2.4 spm→spm

CUDA_VISIBLE_DEVICES=0 bash ./scripts/evaluate_model.iwslt2015.sh \
./checkpoints/mt-opus/en-th/spm-spm/32000-joined/checkpoint_best.pt \
./dataset/binarized/mt-opus/en-th/spm-spm/32000-joined \
en \
th \
sentencepiece \
sentencepiece \
./iwslt_2015/test/tst2010-2013_th-en.en \
./iwslt_2015/test/tst2010-2013_th-en.th \
./translation_results/mt-opus@eval_on@iwslt2015/en-th/spm-spm/32000-joined/checkpoint_best \
64 \
4 \
./dataset/spm/mt-opus/en-th/spm.en.v-16000.cased.model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TBASE.MT-OPUS.md

TBASE.MT-OPUS.md

Experiment 2: TBASE.MT-OPUS

Experiment setup

Data Preprocessing

Model Training

Model Evaluation

1. Evaluate models on `mt-opus` test set.

1.1 Evaluate models on th→en direction

1.2 Evaluate models on en→th direction

2. Evaluate models on Thai-English IWSLT 2015 test sets (tst2010-2013).

2.1 Evaluate models on th→en direction

2.2 Evaluate models on en→th direction

Files

TBASE.MT-OPUS.md

Latest commit

History

TBASE.MT-OPUS.md

File metadata and controls

Experiment 2: TBASE.MT-OPUS

Experiment setup

Data Preprocessing

Model Training

Model Evaluation

1. Evaluate models on mt-opus test set.

1.1 Evaluate models on th→en direction

1.2 Evaluate models on en→th direction

2. Evaluate models on Thai-English IWSLT 2015 test sets (tst2010-2013).

2.1 Evaluate models on th→en direction

2.2 Evaluate models on en→th direction

1. Evaluate models on `mt-opus` test set.