Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eole predict command #126

Closed
HURIMOZ opened this issue Oct 7, 2024 · 25 comments
Closed

eole predict command #126

HURIMOZ opened this issue Oct 7, 2024 · 25 comments
Labels
documentation Improvements or additions to documentation

Comments

@HURIMOZ
Copy link

HURIMOZ commented Oct 7, 2024

Hi, can you please clarify how to use the eole predict command?
Right now I use command eole predict --src data/src-test.txt --model_path models/step_1000 --beam_size 5 --batch_size 4096 --output translations/tgt-test.txt --gpu 0 for inference and all I get is <UNK> tokens in the output.

Iʻm using this in the config file:

transforms_configs:
  normalize:
    norm_quote_commas: True
    norm_numbers: True
  onmt_tokenize:
    src_subword_type: bpe
    tgt_subword_type: sentencepiece
    src_subword_model: data/en.wiki.bpe.vs25000.model
    tgt_subword_model: processed_data/spm_tgt-train.model
    src_onmttok_kwargs:
      mode: none
      spacer_annotate: true
    tgt_onmttok_kwargs:
      mode: none
      spacer_annotate: true
  filtertoolong:
    src_seq_length: 512
    tgt_seq_length: 512

I use this for the vocab files:

src_vocab: processed_data/spm_src-train.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab

My files look good and I donʻt understand why Iʻm getting UNK tokens only.

I also see that the eole documentation says we can use eole translate -c your_config.yaml. What is that for?

@francoishernandez
Copy link
Contributor

You are mentioning a config file, but I don't see it in your command line, so it can't be used.
The key logic is mentioned here:

The main entrypoints are typically used with a yaml configuration file. Most parameters can also be overridden via corresponding command line flags if needed.

So, you need to call eole predict -c your_config_file.yaml, and then, if needed, you can override some parameters via the command line, e.g. eole predict -c your_config_file.yaml -src data/src-test.txt ....
The translate command does not exist anymore, it was replaced by predict. This part of the docs fell under the radar when updating (might be good to add some more examples as well for clarity). Feel free to open some quick PR(s) to fix such issues as you encounter them.

@francoishernandez francoishernandez added the documentation Improvements or additions to documentation label Oct 8, 2024
@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 12, 2024

Hi Francois, I still get UNK tokens only.
I thought it was because Iʻm using pre-trained embeddings and a different tokenization method for my source but no. I disabled the pre-trained embedddings and used sentencepiece to build the tok model and vocab file. And I still get UNK tokens only.
So I have a question: Is the json file generated for each training step automatically used in the inference process for the tokenization? Those generated files are new to me, coming from OpenNMT.
Also, do I need to define a yaml file specifically for inference or just mentioning the step in the bash command line is enough?

@francoishernandez
Copy link
Contributor

That's quite difficult to debug remotely with partial information like that. Could be something going bad in your training, could be some misconfiguration, could be some version conflict. If you built the setup up from an existing recipe, maybe try and do some kind of ablation study / add features iteratively to check at which point things are going rogue.

That being said, has your model been trained with a recent version/commit or does it date back to earlier version? Since we moved quite a few things configuration-wise (see below), you might have end up in a shaky setup.

So I have a question: Is the json file generated for each training step automatically used in the inference process for the tokenization? Those generated files are new to me, coming from OpenNMT.

Yes, since version 0.0.2 some transform/inference related params can be stored in the model config for more transparent usage.

Also, do I need to define a yaml file specifically for inference or just mentioning the step in the bash command line is enough?

If your model has been trained/converted with 0.0.2 or later, you should not need a yaml config, and you can only specify the needed inference related params via command line.

Just to be clear, these json/yaml files and command line arguments have the same end goal: build a valid config for your model to run. So the true question is: what do you need for the model to run?
In most cases, it can be summed up to:

  • (a) the model weights;
  • (b) the mode vocabulary;
  • (c) transforms/tokenization model and configuration;
  • (d) some model-related configuration;
  • (e) some decoding-related configuration.

The idea of embedding more stuff in the model config.json file is to make most of this transparent, and allow you to focus on "what's important" when predicting, i.e. (e) in the list above.

Hope this helps.

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 15, 2024

Hi François, I donʻt know what Iʻm doing wrong.
See my training config file here:

## IO
save_data: processed_data
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]

### Vocab
src_vocab: processed_data/spm_src-train.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab
#n_sample: -1

data:
    corpus_1:
        path_src: data/src-train.txt
        path_tgt: data/tgt-train.txt
    valid:
        path_src: data/src-val.txt
        path_tgt: data/tgt-val.txt

    #transforms: [normalize, onmt_tokenize, filtertoolong]
transforms_configs:
    normalize:
        src_lang: en
        tgt_lang: ty
        norm_quote_commas: True
        norm_numbers: True
    onmt_tokenize:
        src_subword_type: sentencepiece
        src_subword_model: data/en.wiki.bpe.vs25000.model
        tgt_subword_type: sentencepiece
        tgt_subword_model: processed_data/spm_tgt-train.model
    filtertoolong:
        src_seq_length: 512
        tgt_seq_length: 512

# Number of candidates for SentencePiece sampling
    #subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
    #subword_alpha: 0.1
  

training:
    # Model configuration
    model_path: models
    keep_checkpoint: 50
    save_checkpoint_steps: 1000
    average_decay: 0
    train_steps: 100000
    valid_steps: 500

    # bucket_size: 
    bucket_size: 2048
    num_workers: 4
    prefetch_factor: 2
    world_size: 1
    gpu_ranks: [0]
    batch_type: "tokens"
    batch_size: 2048
    valid_batch_size: 2048
    batch_size_multiple: 8
    accum_count: [10]
    accum_steps: [0]
    dropout_steps: [0]
    dropout: [0.2]
    attention_dropout: [0.2]
    compute_dtype: "fp16"
    optim: "adam"
    learning_rate: 2
    warmup_steps: 4000
    decay_method: "noam"
    adam_beta2: 0.998
    max_grad_norm: 0
    label_smoothing: 0.1
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"
    #early_stopping: 5
   
   
# Pretrained embeddings configuration for the source language
embeddings_type: word2vec
src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v-256.txt
#tgt_embeddings:
save_data: processed_data/
#position_encoding_type: SinusoidalInterleaved

model:
    architecture: "transformer"
    hidden_size: 256
    share_decoder_embeddings: true
    share_embeddings: false
    layers: 6
    heads: 8
    transformer_ff: 256
    word_vec_size: 256

First thing: When I use early_stopping, the training wonʻt go past 3000 steps or so and finds the best model to be at 500 steps (the very first model saved...). Also, the training seems too fast compared to openNMT-py. It would usually take about 8 hours to train those same 90,000 parallel lines with OpenNMT-py.

So I disabled the early-stopping parameter and left the training run overnight and stopped it at 69000 steps. None of the 50 models would yield something in the output.

For inference, I tried four commands:
eole predict -c wmt17_enty.yaml -model_path models/step_40000/optimizer.pt -src data/src-test.txt -output translations/tgt-test.txt -verbose

eole predict -model_path models/step_40000/optimizer.pt -src data/src-test.txt -output translations/tgt-test.txt -verbose

eole predict -c wmt17_enty.yaml -model_path models/step_40000 -src data/src-test.txt -output translations/tgt-test.txt -verbose

eole predict -model_path models/step_40000 -src data/src-test.txt -output translations/tgt-test.txt -verbose

The system seems to compute but then doesnʻt write the output in the txt file. Even -verbose wonʻt work.

I did look into the Eole NLP documentation to try and get rid of OpenNMT-py parameters.

Iʻm not sure what Iʻm doing wrong.

@francoishernandez
Copy link
Contributor

Might not be the only issue, but this line should not be commented:
#transforms: [normalize, onmt_tokenize, filtertoolong]
This means no transform will be applied.
The transforms_config entries are just the settings, but you need to select which transform to apply, either at the general level (the line you have commented) or at each dataset level if needed.

This can probably explain the early-stopping behaviour as well, the model is probably not learning anything because the input data does not make sense with regards to its vocab.

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 16, 2024

Hi François, thanks for that.
So, I tried the dataset level configuration too.
Hereʻs my yaml config file:

## IO
save_data: processed_data
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]

### Vocab
src_vocab: processed_data/spm_src-train.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab
#n_sample: -1

data:
    corpus_1:
        path_src: data/src-train.txt
        path_tgt: data/tgt-train.txt
        transforms: [normalize, onmt_tokenize, filtertoolong]
        transforms_configs:
            normalize:
                src_lang: en
                tgt_lang: ty
                norm_quote_commas: True
                norm_numbers: True
            onmt_tokenize:
                src_subword_type: sentencepiece
                src_subword_model: data/en.wiki.bpe.vs25000.model
                tgt_subword_type: sentencepiece
                tgt_subword_model: processed_data/spm_tgt-train.model
            filtertoolong:
                src_seq_length: 512
                tgt_seq_length: 512




    valid:
        path_src: data/src-val.txt
        path_tgt: data/tgt-val.txt
        transforms_configs:
            normalize:
                src_lang: en
                tgt_lang: ty
                norm_quote_commas: True
                norm_numbers: True
            onmt_tokenize:
                src_subword_type: sentencepiece
                src_subword_model: data/en.wiki.bpe.vs25000.model
                tgt_subword_type: sentencepiece
                tgt_subword_model: processed_data/spm_tgt-train.model
            filtertoolong:
                src_seq_length: 512
                tgt_seq_length: 512



# Number of candidates for SentencePiece sampling
    #subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
    #subword_alpha: 0.1
  

training:
    # Model configuration
    model_path: models
    keep_checkpoint: 50
    save_checkpoint_steps: 1000
    average_decay: 0
    train_steps: 100000
    valid_steps: 500

    # bucket_size: 
    bucket_size: 2048
    num_workers: 4
    prefetch_factor: 2
    world_size: 1
    gpu_ranks: [0]
    batch_type: "tokens"
    batch_size: 2048
    valid_batch_size: 2048
    batch_size_multiple: 8
    accum_count: [10]
    accum_steps: [0]
    dropout_steps: [0]
    dropout: [0.2]
    attention_dropout: [0.2]
    compute_dtype: "fp16"
    optim: "adam"
    learning_rate: 2
    warmup_steps: 4000
    decay_method: "noam"
    adam_beta2: 0.998
    max_grad_norm: 0
    label_smoothing: 0.1
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"
    #early_stopping: 5
   
   
# Pretrained embeddings configuration for the source language
embeddings_type: word2vec
src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v-256.txt
#tgt_embeddings:
save_data: processed_data/
#position_encoding_type: SinusoidalInterleaved

model:
    architecture: "transformer"
    hidden_size: 256
    share_decoder_embeddings: true
    share_embeddings: false
    layers: 6
    heads: 8
    transformer_ff: 256
    word_vec_size: 256

I now get to 93% acc and 4.2 xent after 5000 steps but the inference still doesnʻt work. None of the four commands above will generate the translations.

@francoishernandez
Copy link
Contributor

Well, in that case you would need to set the transforms explicitly at inference. Dataset-level configuration is not embedded automatically in the model config.
(The main reason is that if you enable dataset-level configuration, it might be that there are different transform pipes going on in your setup, so we could not really know which one to use at inference by default. We could technically take the valid config though, might PR that at some point.)

So, you can either:

  • configure the transforms at root level in your training, to make things easier (you don't need dataset-level configuration here it seems) -- and by that I mean setting both transforms and transforms_configs;
  • configure the transfoms at the inference stage, using a yaml config file.

For inference, I tried four commands:
eole predict -c wmt17_enty.yaml -model_path models/step_40000/optimizer.pt -src data/src-test.txt -output translations/tgt-test.txt -verbose

models/step_40000/optimizer.pt is not a valid model path, it's just the optimizer params, useful if you want to continue training

eole predict -model_path models/step_40000/optimizer.pt -src data/src-test.txt -output translations/tgt-test.txt -verbose

same as above

eole predict -c wmt17_enty.yaml -model_path models/step_40000 -src data/src-test.txt -output translations/tgt-test.txt -verbose

this one should work, provided your model is fine (the latest one with 93% acc should be), and the wmt17_enty.yaml enables the transforms (both transforms and transforms_configs)

eole predict -model_path models/step_40000 -src data/src-test.txt -output translations/tgt-test.txt -verbose

this one should be fine, provided you trained your model with root level transforms configs, not dataset-level
(note: you can technically manually adapt your model config.json to test this, just move the "transforms"/"transforms_configs" keys to the root level of the json)


Final notes:

  • this "duality" of the transforms configuration is similar to what was done in OpenNMT-py (transforms field + many root-level flags), but just in a more structured way (transforms field + explicit transforms configuration);
  • it might be unified to rely on a single field, rather than two, but it might not necessarily make things simpler if we want to retain adaptability.

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 17, 2024

Thank you François.
While I see some improvements in the training, I however still get no output upon inference.
Here is my new config yaml file:

## IO
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]

### Vocab
src_vocab: processed_data/spm_src-train.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab
#n_sample: -1

data:
    corpus_1:
        path_src: data/src-train.txt
        path_tgt: data/tgt-train.txt
        transforms: [normalize, onmt_tokenize, filtertoolong]
        transforms_configs:
            normalize:
                src_lang: en
                tgt_lang: ty
                norm_quote_commas: true
                norm_numbers: true
            onmt_tokenize:
                src_subword_type: sentencepiece
                src_subword_model: data/en.wiki.bpe.vs25000.model
                tgt_subword_type: sentencepiece
                tgt_subword_model: processed_data/spm_tgt-train.model
            filtertoolong:
                src_seq_length: 512
                tgt_seq_length: 512


    valid:
        path_src: data/src-val.txt
        path_tgt: data/tgt-val.txt
        transforms: [normalize, onmt_tokenize]
        transforms_configs:
            normalize:
                src_lang: en
                tgt_lang: ty
                norm_quote_commas: True
                norm_numbers: True
            onmt_tokenize:
                src_subword_type: sentencepiece
                src_subword_model: data/en.wiki.bpe.vs25000.model
                tgt_subword_type: sentencepiece
                tgt_subword_model: processed_data/spm_tgt-train.model


# Number of candidates for SentencePiece sampling
subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
subword_alpha: 0.1
  

training:
    # Model configuration
    model_path: models
    keep_checkpoint: 50
    save_checkpoint_steps: 1000
    train_steps: 70000
    valid_steps: 500

    # bucket_size: 
    bucket_size: 256
    num_workers: 4
    prefetch_factor: 2
    world_size: 1
    gpu_ranks: [0]
    batch_type: "tokens"
    batch_size: 1024
    valid_batch_size: 1024
    batch_size_multiple: 8
    accum_count: [10]
    accum_steps: [0]
    dropout_steps: [0]
    dropout: [0.2]
    attention_dropout: [0.2]
    compute_dtype: "fp16"
    optim: "adam"
    learning_rate: 0.02
    average_decay: 0.0001
    warmup_steps: 4000
    decay_method: "noam"
    adam_beta2: 0.998
    max_grad_norm: 0
    label_smoothing: 0.1
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"
    #early_stopping: 3

tensorboard: true
tensorboard_log_dir: logs
   
log_file: logs/eole.log
   
# Pretrained embeddings configuration for the source language
embeddings_type: word2vec
src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v-256.txt
#tgt_embeddings:
save_data: processed_data/
#position_encoding_type: SinusoidalInterleaved

model:
    architecture: "transformer"
    hidden_size: 256
    share_decoder_embeddings: true
    share_embeddings: false
    layers: 6
    heads: 8
    transformer_ff: 256
    word_vec_size: 256
    position_encoding: true

and here is the tentative inference yaml file that I wrote:

valid_metrics: ["BLEU"]
data:
    corpus_1:
        transforms: [normalize, onmt_tokenize]
        transforms_configs:
            normalize:
                src_lang: en
                tgt_lang: ty
                norm_quote_commas: True
                norm_numbers: True
            onmt_tokenize:
                src_subword_type: sentencepiece
                src_subword_model: data/en.wiki.bpe.vs25000.model
                tgt_subword_type: sentencepiece
                tgt_subword_model: processed_data/spm_tgt-train.model

report_time: true

verbose: true
n_best: 3
top_p: 0.9
beam_size: 5

world_size: 1
gpu: 0

I try to keep it simple and add in more features when I no longer get errors or when I see some improvements.
I previously had a learning_rate set too high (set to 2) and I reduced it.
But the Blue score is still Zero, which either means that Bleu is not implemented properly, or that the model is just not learning anything.
Here is sample of my Bash after a few steps:

[2024-10-17 07:47:44,639 INFO] Weighted corpora loaded so far:
                        * corpus_1: 6
[2024-10-17 07:47:44,706 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:47:48,370 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:47:48,742 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:48:20,377 INFO] Step 1200/70000; acc: 91.0; ppl: 1206.85; xent: 7.10; aux: 0.000; lr: 5.93e-06; sents:   41089; bsz:  302/ 459/41; 7314/11109 tok/s;    546 sec;
[2024-10-17 07:49:01,758 INFO] Step 1300/70000; acc: 91.4; ppl: 871.92; xent: 6.77; aux: 0.000; lr: 6.43e-06; sents:   45200; bsz:  346/ 528/45; 8357/12760 tok/s;    588 sec;
[2024-10-17 07:49:05,760 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 435117 -> 47104 tokens

[2024-10-17 07:49:05,761 INFO] Weighted corpora loaded so far:
                        * corpus_1: 7
[2024-10-17 07:49:09,454 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:49:10,199 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 435981 -> 47104 tokens

[2024-10-17 07:49:10,199 INFO] Weighted corpora loaded so far:
                        * corpus_1: 7
[2024-10-17 07:49:11,442 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:49:11,869 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:49:13,010 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 437577 -> 47104 tokens

[2024-10-17 07:49:13,011 INFO] Weighted corpora loaded so far:
                        * corpus_1: 7
[2024-10-17 07:49:13,879 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:49:14,640 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:49:15,016 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 438678 -> 47104 tokens

[2024-10-17 07:49:15,016 INFO] Weighted corpora loaded so far:
                        * corpus_1: 7
[2024-10-17 07:49:16,727 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:49:18,744 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:49:18,776 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:49:42,500 INFO] Step 1400/70000; acc: 91.6; ppl: 654.55; xent: 6.48; aux: 0.000; lr: 6.92e-06; sents:   39086; bsz:  317/ 468/39; 7790/11489 tok/s;    629 sec;
[2024-10-17 07:50:23,750 INFO] Step 1500/70000; acc: 90.7; ppl: 461.24; xent: 6.13; aux: 0.000; lr: 7.42e-06; sents:   47468; bsz:  325/ 512/47; 7874/12412 tok/s;    670 sec;
The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:50:34,027 INFO] valid stats calculation
                           took: 10.274404764175415 s.
[2024-10-17 07:50:43,192 INFO] The translation of the valid dataset for dynamic scoring
                               took : 9.163396120071411 s.
[2024-10-17 07:50:43,192 INFO] UPDATING VALIDATION BLEU
[2024-10-17 07:50:43,432 INFO] validation BLEU: 0.0
[2024-10-17 07:50:43,434 INFO] Train perplexity: 3321.04
[2024-10-17 07:50:43,434 INFO] Train accuracy: 81.0061
[2024-10-17 07:50:43,434 INFO] Sentences processed: 627118
[2024-10-17 07:50:43,434 INFO] Average bsz:  327/ 500/42
[2024-10-17 07:50:43,435 INFO] Validation perplexity: 516.786
[2024-10-17 07:50:43,435 INFO] Validation accuracy: 96.0977
[2024-10-17 07:50:56,537 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 444374 -> 47616 tokens

[2024-10-17 07:50:56,537 INFO] Weighted corpora loaded so far:
                        * corpus_1: 8
[2024-10-17 07:50:59,758 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:50:59,758 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:51:00,906 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 445883 -> 47616 tokens

[2024-10-17 07:51:00,907 INFO] Weighted corpora loaded so far:
                        * corpus_1: 8
[2024-10-17 07:51:00,967 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:51:01,775 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:51:01,775 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:51:03,729 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 446561 -> 47616 tokens

[2024-10-17 07:51:03,729 INFO] Weighted corpora loaded so far:
                        * corpus_1: 8
[2024-10-17 07:51:04,644 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:51:05,776 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 448752 -> 47616 tokens

[2024-10-17 07:51:05,776 INFO] Weighted corpora loaded so far:
                        * corpus_1: 8
[2024-10-17 07:51:07,043 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:51:07,901 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:51:09,557 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:51:09,954 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:51:24,073 INFO] Step 1600/70000; acc: 93.0; ppl: 314.95; xent: 5.75; aux: 0.000; lr: 7.91e-06; sents:   35434; bsz:  342/ 507/35; 5671/8404 tok/s;    730 sec;
[2024-10-17 07:52:05,721 INFO] Step 1700/70000; acc: 89.2; ppl: 225.80; xent: 5.42; aux: 0.000; lr: 8.40e-06; sents:   51089; bsz:  300/ 474/51; 7193/11389 tok/s;    772 sec;
[2024-10-17 07:52:26,790 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 443996 -> 47616 tokens

[2024-10-17 07:52:26,791 INFO] Weighted corpora loaded so far:
                        * corpus_1: 9
[2024-10-17 07:52:28,789 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:30,434 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:30,435 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:31,991 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 445881 -> 47616 tokens

[2024-10-17 07:52:31,991 INFO] Weighted corpora loaded so far:
                        * corpus_1: 9
[2024-10-17 07:52:32,046 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:32,477 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:32,477 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:34,063 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:35,764 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:36,873 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 446722 -> 47616 tokens

[2024-10-17 07:52:36,874 INFO] Weighted corpora loaded so far:
                        * corpus_1: 9
[2024-10-17 07:52:36,934 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:37,307 INFO] * Transform statistics for corpus_1(25.00%):
                        * SubwordStats: 448803 -> 47616 tokens

[2024-10-17 07:52:37,307 INFO] Weighted corpora loaded so far:
                        * corpus_1: 9
[2024-10-17 07:52:37,369 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:41,509 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:42,772 WARNING] The batch will be filled until we reach 8, its size may exceed 1024 tokens
[2024-10-17 07:52:45,931 INFO] Step 1800/70000; acc: 95.3; ppl: 132.08; xent: 4.88; aux: 0.000; lr: 8.90e-06; sents:   26907; bsz:  373/ 571/27; 9277/14206 tok/s;    812 sec;
[2024-10-17 07:53:27,419 INFO] Step 1900/70000; acc: 86.9; ppl: 103.78; xent: 4.64; aux: 0.000; lr: 9.39e-06; sents:   55431; bsz:  274/ 423/55; 6596/10195 tok/s;    853 sec;

@francoishernandez
Copy link
Contributor

Your main issue is not your hyperparams. It's configuration management. (Not 100% sure about the validation BLEU issue, but let's validate standard inference first.)
We should update the recipes or create a new one to make this even clearer.

In your case, please use root level transform configuration. It will make things way easier for you.

Proper configurations

Training config

## IO
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]

### Vocab
src_vocab: processed_data/spm_src-train.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab
#n_sample: -1

data:
    corpus_1:
        path_src: data/src-train.txt
        path_tgt: data/tgt-train.txt
    valid:
        path_src: data/src-val.txt
        path_tgt: data/tgt-val.txt
        transforms: [normalize, onmt_tokenize]

transforms: [normalize, onmt_tokenize, filtertoolong]
transforms_configs:
            normalize:
                src_lang: en
                tgt_lang: ty
                norm_quote_commas: true
                norm_numbers: true
            onmt_tokenize:
                src_subword_type: sentencepiece
                src_subword_model: data/en.wiki.bpe.vs25000.model
                tgt_subword_type: sentencepiece
                tgt_subword_model: processed_data/spm_tgt-train.model
            filtertoolong:
                src_seq_length: 512
                tgt_seq_length: 512


# Number of candidates for SentencePiece sampling
subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
subword_alpha: 0.1
  

training:
    # Model configuration
    model_path: models
    keep_checkpoint: 50
    save_checkpoint_steps: 1000
    train_steps: 70000
    valid_steps: 500

    # bucket_size: 
    bucket_size: 256
    num_workers: 4
    prefetch_factor: 2
    world_size: 1
    gpu_ranks: [0]
    batch_type: "tokens"
    batch_size: 1024
    valid_batch_size: 1024
    batch_size_multiple: 8
    accum_count: [10]
    accum_steps: [0]
    dropout_steps: [0]
    dropout: [0.2]
    attention_dropout: [0.2]
    compute_dtype: "fp16"
    optim: "adam"
    learning_rate: 0.02
    average_decay: 0.0001
    warmup_steps: 4000
    decay_method: "noam"
    adam_beta2: 0.998
    max_grad_norm: 0
    label_smoothing: 0.1
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"
    #early_stopping: 3

tensorboard: true
tensorboard_log_dir: logs
   
log_file: logs/eole.log
   
# Pretrained embeddings configuration for the source language
embeddings_type: word2vec
src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v-256.txt
#tgt_embeddings:
save_data: processed_data/
#position_encoding_type: SinusoidalInterleaved

model:
    architecture: "transformer"
    hidden_size: 256
    share_decoder_embeddings: true
    share_embeddings: false
    layers: 6
    heads: 8
    transformer_ff: 256
    word_vec_size: 256
    position_encoding: true

Inference config

(If the model was trained with above config, the transform configuration is actually not necessary here, since it should be in the model config.json)

valid_metrics: ["BLEU"]
data:
    corpus_1:

transforms: [normalize, onmt_tokenize]
transforms_configs:
    normalize:
        src_lang: en
        tgt_lang: ty
        norm_quote_commas: True
        norm_numbers: True
    onmt_tokenize:
        src_subword_type: sentencepiece
        src_subword_model: data/en.wiki.bpe.vs25000.model
        tgt_subword_type: sentencepiece
        tgt_subword_model: processed_data/spm_tgt-train.model

report_time: true

verbose: true
n_best: 3
top_p: 0.9
beam_size: 5

world_size: 1
gpu: 0

The "data" key in your inference file is not used, we're not using the datasets at inference, only the "src" file. Not sure why it doesn't raise a warning by the way.

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 18, 2024

Thank you François, but changing to root-level transform configuration hasnʻt changed anything. I still get zero for Bleu score, a rapid increase of acc score to 93%, and the updated inference file still wonʻt save any translations.
See updated config file:

## IO
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]

### Vocab
src_vocab: processed_data/spm_src-train.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab
#n_sample: -1

data:
    corpus_1:
        path_src: data/src-train.txt
        path_tgt: data/tgt-train.txt
    valid:
        path_src: data/src-val.txt
        path_tgt: data/tgt-val.txt
        transforms: [normalize, onmt_tokenize]

transforms: [normalize, onmt_tokenize, filtertoolong]
transforms_configs:
            normalize:
                src_lang: en
                tgt_lang: ty
                norm_quote_commas: true
                norm_numbers: true
            onmt_tokenize:
                src_subword_type: sentencepiece
                src_subword_model: data/en.wiki.bpe.vs25000.model
                tgt_subword_type: sentencepiece
                tgt_subword_model: processed_data/spm_tgt-train.model
            filtertoolong:
                src_seq_length: 512
                tgt_seq_length: 512


# Number of candidates for SentencePiece sampling
subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
subword_alpha: 0.1
  

training:
    # Model configuration
    model_path: models
    keep_checkpoint: 50
    save_checkpoint_steps: 1000
    train_steps: 70000
    valid_steps: 500

    # bucket_size: 
    bucket_size: 256
    num_workers: 4
    prefetch_factor: 2
    world_size: 1
    gpu_ranks: [0]
    batch_type: "tokens"
    batch_size: 1024
    valid_batch_size: 1024
    batch_size_multiple: 8
    accum_count: [10]
    accum_steps: [0]
    dropout_steps: [0]
    dropout: [0.2]
    attention_dropout: [0.2]
    compute_dtype: "fp16"
    optim: "adam"
    learning_rate: 0.02
    average_decay: 0.0001
    warmup_steps: 4000
    decay_method: "noam"
    adam_beta2: 0.998
    max_grad_norm: 0
    label_smoothing: 0.1
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"
    #early_stopping: 3

tensorboard: true
tensorboard_log_dir: logs
   
log_file: logs/eole.log
   
# Pretrained embeddings configuration for the source language
embeddings_type: word2vec
src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v-256.txt
#tgt_embeddings:
save_data: processed_data/
#position_encoding_type: SinusoidalInterleaved

model:
    architecture: "transformer"
    hidden_size: 256
    share_decoder_embeddings: true
    share_embeddings: false
    layers: 6
    heads: 8
    transformer_ff: 256
    word_vec_size: 256
    position_encoding: true

and new updated inference file:

valid_metrics: ["BLEU"]
report_time: true
verbose: true
n_best: 3
top_p: 0.9
beam_size: 5

world_size: 1
gpu: 0

and the command I use to translate: eole predict -c inference.yaml --model_path models/step_20000 --src data/src-test.txt --output translations/tgt-test.txt --verbose

@francoishernandez
Copy link
Contributor

There are still several issues stacking up.

1. Update your code and config

You are most probably using an older version of the code, this training config should give you some errors with the latest versions.

Here are some of the issues:

  • subword_nbest/subword_alpha are settings of the onmt_tokenize transform, and as such should be set in the transform_configs.onmt_tokenize field (though I suspect you don't really need these settings in your setup);
  • the position_encoding flag has been deprecated and replaced by the more generic position_encoding_type;
  • not specific to the eole version, but your "normalize" transform cannot work with the latest config, as src_lang/tgt_lang are not configured in your datasets;
  • compute_dtype: fp16 // optim: adam does not work by default (since review flash/sdpa arg #25), your best bet is with compute_dtype: fp16 // optim: fusedadam or specify self_attn_backend: pytorch if you encounter some issues like RuntimeError: FlashAttention only support fp16 and bf16 data type;

2. Check your data/tokenization/transforms setup

The very quickly very high accuracy + only tokens at inference leads me to believe that the data your model learns on is broken. (Basically it sees only/mostly tokens, so it "thinks" it learns properly, but in fact it just learns to output tokens...)

I encourage you to start from a "known-to-work" setup, and build from there. E.g. the WMT17 recipe.

I quickly did this to help you in your journey to a working setup:

  1. run the "standard" wmt17 recipe ✅

Logs should look something like this:

[2024-10-18 11:51:27,909 INFO] Step 100/50000; acc: 4.5; ppl: 15613.15; xent: 9.66; aux: 0.000; lr: 2.50e-05; sents:  109220; bsz: 2947/3201/109; 23280/25285 tok/s;    127 sec;
[2024-10-18 11:53:31,233 INFO] Step 200/50000; acc: 8.0; ppl: 3769.57; xent: 8.23; aux: 0.000; lr: 4.97e-05; sents:  122727; bsz: 3434/3643/123; 27842/29537 tok/s;    250 sec;
[2024-10-18 11:55:35,724 INFO] Step 300/50000; acc: 11.0; ppl: 2442.23; xent: 7.80; aux: 0.000; lr: 7.44e-05; sents:  134048; bsz: 3532/3833/134; 28369/30787 tok/s;    374 sec;
[2024-10-18 11:57:41,574 INFO] Step 400/50000; acc: 12.2; ppl: 1682.11; xent: 7.43; aux: 0.000; lr: 9.91e-05; sents:  127560; bsz: 3700/3976/128; 29402/31596 tok/s;    500 sec;
[2024-10-18 11:59:46,516 INFO] Step 500/50000; acc: 14.4; ppl: 1074.72; xent: 6.98; aux: 0.000; lr: 1.24e-04; sents:  141992; bsz: 3607/3938/142; 28871/31520 tok/s;    625 sec;
[2024-10-18 12:01:52,402 INFO] Step 600/50000; acc: 16.9; ppl: 739.39; xent: 6.61; aux: 0.000; lr: 1.48e-04; sents:  137296; bsz: 3745/3993/137; 29746/31722 tok/s;    751 sec;
[2024-10-18 12:03:58,852 INFO] Step 700/50000; acc: 19.7; ppl: 529.11; xent: 6.27; aux: 0.000; lr: 1.73e-04; sents:  132248; bsz: 3782/4109/132; 29907/32494 tok/s;    878 sec;
[2024-10-18 12:06:05,737 INFO] Step 800/50000; acc: 23.2; ppl: 371.90; xent: 5.92; aux: 0.000; lr: 1.98e-04; sents:  135074; bsz: 3829/4055/135; 30175/31961 tok/s;   1004 sec;
[2024-10-18 12:08:12,305 INFO] Step 900/50000; acc: 28.0; ppl: 254.99; xent: 5.54; aux: 0.000; lr: 2.23e-04; sents:  147519; bsz: 3769/4086/148; 29781/32282 tok/s;   1131 sec;
[2024-10-18 12:10:18,298 INFO] Step 1000/50000; acc: 32.3; ppl: 187.51; xent: 5.23; aux: 0.000; lr: 2.47e-04; sents:  139666; bsz: 3735/4114/140; 29641/32651 tok/s;   1257 sec;
  1. adapt it to use the bpe transform (easiest, since the vocab was prepared with subword_nmt) ✅
## IO
save_data: wmt17_en_de/data
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]

### Vocab
src_vocab: wmt17_en_de/vocab.shared
tgt_vocab: wmt17_en_de/vocab.shared
src_vocab_size: 36000
tgt_vocab_size: 36000
vocab_size_multiple: 8
src_words_min_frequency: 2
tgt_words_min_frequency: 2
share_vocab: True
n_sample: 0

data:
    corpus_1:
        path_src: wmt17_en_de/train.src.bpe.shuf
        path_tgt: wmt17_en_de/train.trg.bpe.shuf
    valid:
        path_src: wmt17_en_de/dev.src.bpe
        path_tgt: wmt17_en_de/dev.trg.bpe


transforms: [bpe]
transforms_configs:
    bpe:
        src_subword_model: wmt17_en_de/codes
        tgt_subword_model: wmt17_en_de/codes


training:
    # Model configuration
    model_path: test_model_wmt17
    keep_checkpoint: 50
    save_checkpoint_steps: 1000
    average_decay: 0
    train_steps: 50000
    valid_steps: 5000

    # bucket_size: 
    bucket_size: 262144
    bucket_size_init: 10000
    bucket_size_increment: 25000
    num_workers: 4
    prefetch_factor: 400
    world_size: 1
    gpu_ranks: [0]
    batch_type: "tokens"
    batch_size: 5000
    valid_batch_size: 4096
    batch_size_multiple: 8
    accum_count: [10]
    accum_steps: [0]
    dropout_steps: [0]
    dropout: [0.1]
    attention_dropout: [0.1]
    compute_dtype: "fp16"
    #apex_opt_level: "O2"
    optim: "fusedadam"
    learning_rate: 2
    warmup_steps: 4000
    decay_method: "noam"
    adam_beta2: 0.998
    max_grad_norm: 0
    label_smoothing: 0.1
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"

model:
    architecture: "transformer"
    hidden_size: 1024
    share_decoder_embeddings: true
    share_embeddings: true
    layers: 6
    heads: 16
    transformer_ff: 4096
    embeddings:
        word_vec_size: 1024
        position_encoding_type: "SinusoidalInterleaved"
[2024-10-18 12:40:45,289 INFO] Step 100/50000; acc: 9.9; ppl: 8126.02; xent: 9.00; aux: 0.000; lr: 2.50e-05; sents:   73488; bsz: 2358/2877/73; 18778/22914 tok/s;    126 sec;
[2024-10-18 12:42:39,416 INFO] Step 200/50000; acc: 26.0; ppl: 1120.32; xent: 7.02; aux: 0.000; lr: 4.97e-05; sents:   84783; bsz: 2829/3378/85; 24787/29596 tok/s;    240 sec;
[2024-10-18 12:44:36,162 INFO] Step 300/50000; acc: 32.3; ppl: 462.67; xent: 6.14; aux: 0.000; lr: 7.44e-05; sents:   91320; bsz: 2954/3586/91; 25305/30720 tok/s;    356 sec;
[2024-10-18 12:46:34,771 INFO] Step 400/50000; acc: 33.8; ppl: 348.25; xent: 5.85; aux: 0.000; lr: 9.91e-05; sents:   93558; bsz: 3133/3741/94; 26417/31539 tok/s;    475 sec;
[2024-10-18 12:48:33,333 INFO] Step 500/50000; acc: 35.2; ppl: 279.87; xent: 5.63; aux: 0.000; lr: 1.24e-04; sents:  100701; bsz: 3134/3778/101; 26435/31863 tok/s;    594 sec;
[2024-10-18 12:50:33,200 INFO] Step 600/50000; acc: 36.7; ppl: 229.05; xent: 5.43; aux: 0.000; lr: 1.48e-04; sents:   96792; bsz: 3223/3893/97; 26886/32474 tok/s;    713 sec;
[2024-10-18 12:52:33,083 INFO] Step 700/50000; acc: 38.4; ppl: 185.31; xent: 5.22; aux: 0.000; lr: 1.73e-04; sents:  101356; bsz: 3204/3848/101; 26722/32094 tok/s;    833 sec;
[2024-10-18 12:54:32,208 INFO] Step 800/50000; acc: 40.3; ppl: 149.51; xent: 5.01; aux: 0.000; lr: 1.98e-04; sents:   95808; bsz: 3181/3820/96; 26706/32068 tok/s;    952 sec;
[2024-10-18 12:56:32,910 INFO] Step 900/50000; acc: 42.7; ppl: 118.99; xent: 4.78; aux: 0.000; lr: 2.23e-04; sents:  101061; bsz: 3305/3918/101; 27383/32456 tok/s;   1073 sec;
[2024-10-18 12:58:32,727 INFO] Step 1000/50000; acc: 45.1; ppl: 96.28; xent: 4.57; aux: 0.000; lr: 2.47e-04; sents:  104648; bsz: 3226/3934/105; 26921/32838 tok/s;   1193 sec;
  1. try and adapt closer to your setup ✅

(note: your learning rate is too low, adam+noam requires a higher value to get started)

## IO
save_data: wmt17_en_de/data
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]

### Vocab
src_vocab: wmt17_en_de/vocab.shared
tgt_vocab: wmt17_en_de/vocab.shared
src_vocab_size: 36000
tgt_vocab_size: 36000
vocab_size_multiple: 8
src_words_min_frequency: 2
tgt_words_min_frequency: 2
share_vocab: True
n_sample: 0

data:
    corpus_1:
        path_src: wmt17_en_de/train.src.bpe.shuf
        path_tgt: wmt17_en_de/train.trg.bpe.shuf
    valid:
        path_src: wmt17_en_de/dev.src.bpe
        path_tgt: wmt17_en_de/dev.trg.bpe


transforms: [bpe]
transforms_configs:
    bpe:
        src_subword_model: wmt17_en_de/codes
        tgt_subword_model: wmt17_en_de/codes


training:
    # Model configuration
    model_path: test_model_wmt17
    keep_checkpoint: 50
    save_checkpoint_steps: 1000
    train_steps: 50000
    valid_steps: 5000

    # bucket_size: 
    bucket_size: 262144
    bucket_size_init: 10000
    bucket_size_increment: 25000
    num_workers: 4
    prefetch_factor: 400
    world_size: 1
    gpu_ranks: [0]
    batch_type: "tokens"
    batch_size: 5000
    valid_batch_size: 4096
    batch_size_multiple: 8
    accum_count: [10]
    accum_steps: [0]
    dropout_steps: [0]
    dropout: [0.1]
    attention_dropout: [0.1]
    compute_dtype: "fp16"
    #apex_opt_level: "O2"
    optim: "fusedadam"
    learning_rate: 2
    average_decay: 0.0001
    warmup_steps: 4000
    decay_method: "noam"
    adam_beta2: 0.998
    max_grad_norm: 0
    label_smoothing: 0.1
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"

model:
    architecture: "transformer"
    hidden_size: 256
    share_decoder_embeddings: true
    share_embeddings: true
    layers: 6
    heads: 8
    transformer_ff: 256
    embeddings:
        word_vec_size: 256
        position_encoding_type: "SinusoidalInterleaved"
[2024-10-18 13:32:07,666 INFO] Step 100/50000; acc: 10.7; ppl: 22529.39; xent: 10.02; aux: 0.000; lr: 4.99e-05; sents:   73488; bsz: 2358/2877/73; 39454/48143 tok/s;     60 sec;
[2024-10-18 13:32:46,435 INFO] Step 200/50000; acc: 12.7; ppl: 6108.03; xent: 8.72; aux: 0.000; lr: 9.93e-05; sents:   84783; bsz: 2829/3378/85; 72967/87123 tok/s;     99 sec;
[2024-10-18 13:33:25,348 INFO] Step 300/50000; acc: 18.6; ppl: 1397.21; xent: 7.24; aux: 0.000; lr: 1.49e-04; sents:   91320; bsz: 2954/3586/91; 75920/92167 tok/s;    137 sec;
[2024-10-18 13:34:06,313 INFO] Step 400/50000; acc: 25.7; ppl: 727.75; xent: 6.59; aux: 0.000; lr: 1.98e-04; sents:   93558; bsz: 3133/3741/94; 76488/91317 tok/s;    178 sec;
[2024-10-18 13:34:45,250 INFO] Step 500/50000; acc: 30.7; ppl: 483.03; xent: 6.18; aux: 0.000; lr: 2.48e-04; sents:  100701; bsz: 3134/3778/101; 80493/97020 tok/s;    217 sec;
[2024-10-18 13:35:24,351 INFO] Step 600/50000; acc: 33.3; ppl: 346.50; xent: 5.85; aux: 0.000; lr: 2.97e-04; sents:   96792; bsz: 3223/3893/97; 82422/99553 tok/s;    256 sec;
[2024-10-18 13:36:04,523 INFO] Step 700/50000; acc: 34.5; ppl: 287.57; xent: 5.66; aux: 0.000; lr: 3.46e-04; sents:  101356; bsz: 3204/3848/101; 79744/95775 tok/s;    297 sec;
[2024-10-18 13:36:44,573 INFO] Step 800/50000; acc: 35.7; ppl: 249.22; xent: 5.52; aux: 0.000; lr: 3.96e-04; sents:   95808; bsz: 3181/3820/96; 79435/95386 tok/s;    337 sec;
[2024-10-18 13:37:24,015 INFO] Step 900/50000; acc: 36.6; ppl: 216.36; xent: 5.38; aux: 0.000; lr: 4.45e-04; sents:  101061; bsz: 3305/3918/101; 83799/99323 tok/s;    376 sec;
[2024-10-18 13:38:03,241 INFO] Step 1000/50000; acc: 37.7; ppl: 187.97; xent: 5.24; aux: 0.000; lr: 4.95e-04; sents:  104648; bsz: 3226/3934/105; 82231/100302 tok/s;    415 sec;

Once you have such a setup working in your context (data, machine), you cand start and add up features.
Changing too many things at once lead to intractable and impossible to resolve situations, especially in toolkits with that many moving parts.

@francoishernandez
Copy link
Contributor

Note that if you pull the very latest commit from main, you will also have to change param_init_glorot: true to param_init_method: "xavier_uniform", following the merge of #32.

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 19, 2024

Hi François, thank you so much for trying to help me out.
I donʻt like the wmt17 recipe because it doesnʻt do on-the-fly tokenization and the recipe looks old and outdated maybe.
Instead, I disabled the vast majority of hyperparameters, and came up with the bare bones, the strict minimum to train a model. I dropped the pre-trained embeddings, as well as the vocab and model that served to create the embeddings, and instead used the vocab and model that I created with SentencePiece from my dataset. (both src and tgt). I changed the whole validation set too. I also changed the transforms from dataset level to general level as advised. I updated Eole NLP to the latest (Main) and changed param_init_glorot: true to param_init_method: "xavier_uniform"
See new config file:

## IO
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]

### Vocab
src_vocab: processed_data/src_sentencepiece_bpe.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab
#n_sample: -1

data:
    corpus_1:
        path_src: data/src-train.txt
        path_tgt: data/tgt-train.txt
    valid:
        path_src: data/EN-val.txt
        path_tgt: data/TY-val.txt
        transforms: [onmt_tokenize]

transforms: [onmt_tokenize]
transforms_configs:
            #normalize:
                #src_lang: en
                #tgt_lang: ty
                #norm_quote_commas: true
                #norm_numbers: true
            onmt_tokenize:
                src_subword_type: sentencepiece
                src_subword_model: models/src_sentencepiece_bpe.model
                tgt_subword_type: sentencepiece
                tgt_subword_model: models/spm_tgt-train.model
            #filtertoolong:
                #src_seq_length: 512
                #tgt_seq_length: 512


# Number of candidates for SentencePiece sampling
#subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
#subword_alpha: 0.1
  

training:
    # Model configuration
    model_path: models
    keep_checkpoint: 50
    save_checkpoint_steps: 1000
    train_steps: 70000
    valid_steps: 500

    #bucket_size: 1024
    #num_workers: 4
    #prefetch_factor: 2
    world_size: 1
    gpu_ranks: [0]
    batch_type: "tokens"
    #batch_size: 1024
    #valid_batch_size: 1024
    #batch_size_multiple: 8
    #accum_count: [10]
    #accum_steps: [0]
    #dropout_steps: [0]
    #dropout: [0.2]
    #attention_dropout: [0.2]
    #compute_dtype: fp16
    #optim: "adam"
    #learning_rate: 0.6
    #average_decay: 0.1
    #warmup_steps: 4000
    #decay_method: "noam"
    #adam_beta2: 0.998
    #max_grad_norm: 0
    #label_smoothing: 0.1
    #param_init: 0
    #param_init_method: "xavier_uniform"
    #normalization: "tokens"
    #early_stopping: 3

tensorboard: true
tensorboard_log_dir: logs
   
log_file: logs/eole.log
   
# Pretrained embeddings configuration for the source language
embeddings_type: word2vec
#src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v-256.txt
#tgt_embeddings:
save_data: processed_data/
#position_encoding_type: Rotary

model:
    architecture: "transformer"
    hidden_size: 256
    share_decoder_embeddings: false
    share_embeddings: false
    layers: 6
    heads: 8
    transformer_ff: 256
    word_vec_size: 256
    position_encoding: true

Even with this minimalist setup, the system is not training the model. The validation accuracy now gets stuck at 97.0432 no matter how many steps.
See Bash below:

[2024-10-19 06:45:18,341 INFO] Step 5600/70000; acc: 81.1; ppl:  2.16; xent: 0.77; aux: 0.000; lr: 6.00e-01; sents:     451; bsz:   31/  44/ 5;  92/129 tok/s;    534 sec;
[2024-10-19 06:45:21,481 INFO] Step 5700/70000; acc: 83.2; ppl:  1.88; xent: 0.63; aux: 0.000; lr: 6.00e-01; sents:     346; bsz:   30/  46/ 3; 960/1474 tok/s;    538 sec;
[2024-10-19 06:45:24,607 INFO] Step 5800/70000; acc: 81.1; ppl:  3.09; xent: 1.13; aux: 0.000; lr: 6.00e-01; sents:     364; bsz:   33/  46/ 4; 1062/1464 tok/s;    541 sec;
[2024-10-19 06:45:27,753 INFO] Step 5900/70000; acc: 81.6; ppl:  2.00; xent: 0.69; aux: 0.000; lr: 6.00e-01; sents:     425; bsz:   31/  47/ 4; 995/1508 tok/s;    544 sec;
[2024-10-19 06:45:30,904 INFO] Step 6000/70000; acc: 84.4; ppl:  1.81; xent: 0.60; aux: 0.000; lr: 6.00e-01; sents:     370; bsz:   32/  47/ 4; 1027/1504 tok/s;    547 sec;
[2024-10-19 06:45:49,278 INFO] valid stats calculation
                           took: 18.372625827789307 s.
[2024-10-19 06:45:50,255 INFO] The translation of the valid dataset for dynamic scoring
                               took : 0.9760706424713135 s.
[2024-10-19 06:45:50,256 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:45:50,375 INFO] validation BLEU: 0.0
[2024-10-19 06:45:50,377 INFO] Train perplexity: 2.14528
[2024-10-19 06:45:50,377 INFO] Train accuracy: 80.9913
[2024-10-19 06:45:50,377 INFO] Sentences processed: 21961
[2024-10-19 06:45:50,377 INFO] Average bsz:   32/  46/ 4
[2024-10-19 06:45:50,377 INFO] Validation perplexity: 2.37872
[2024-10-19 06:45:50,377 INFO] Validation accuracy: 5.91368
[2024-10-19 06:45:50,380 INFO] Saving optimizer and weights to step_6000, and symlink to models
[2024-10-19 06:45:50,620 INFO] Saving transforms artifacts, if any, to models
[2024-10-19 06:45:50,621 INFO] Saving config and vocab to models
[2024-10-19 06:45:53,834 INFO] Step 6100/70000; acc: 87.4; ppl:  1.52; xent: 0.42; aux: 0.000; lr: 6.00e-01; sents:     244; bsz:   32/  48/ 2; 142/208 tok/s;    570 sec;
[2024-10-19 06:45:56,964 INFO] Step 6200/70000; acc: 82.4; ppl:  2.13; xent: 0.76; aux: 0.000; lr: 6.00e-01; sents:     444; bsz:   34/  48/ 4; 1078/1541 tok/s;    573 sec;
[2024-10-19 06:46:00,091 INFO] Step 6300/70000; acc: 79.2; ppl:  2.21; xent: 0.79; aux: 0.000; lr: 6.00e-01; sents:     436; bsz:   34/  46/ 4; 1081/1458 tok/s;    576 sec;
[2024-10-19 06:46:03,273 INFO] Step 6400/70000; acc: 78.9; ppl:  2.42; xent: 0.88; aux: 0.000; lr: 6.00e-01; sents:     398; bsz:   32/  48/ 4; 998/1494 tok/s;    579 sec;
[2024-10-19 06:46:06,481 INFO] Step 6500/70000; acc: 82.3; ppl:  2.29; xent: 0.83; aux: 0.000; lr: 6.00e-01; sents:     382; bsz:   31/  46/ 4; 974/1432 tok/s;    583 sec;
[2024-10-19 06:46:24,695 INFO] valid stats calculation
                           took: 18.212890148162842 s.
[2024-10-19 06:46:36,880 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.184123754501343 s.
[2024-10-19 06:46:36,881 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:46:37,195 INFO] validation BLEU: 0.0
[2024-10-19 06:46:37,196 INFO] Train perplexity: 2.1406
[2024-10-19 06:46:37,196 INFO] Train accuracy: 81.0757
[2024-10-19 06:46:37,196 INFO] Sentences processed: 23865
[2024-10-19 06:46:37,197 INFO] Average bsz:   32/  46/ 4
[2024-10-19 06:46:37,197 INFO] Validation perplexity: 1.14674
[2024-10-19 06:46:37,197 INFO] Validation accuracy: 97.0432
[2024-10-19 06:46:40,428 INFO] Step 6600/70000; acc: 84.0; ppl:  1.76; xent: 0.57; aux: 0.000; lr: 6.00e-01; sents:     415; bsz:   32/  48/ 4;  95/141 tok/s;    617 sec;
[2024-10-19 06:46:43,590 INFO] Step 6700/70000; acc: 80.2; ppl:  2.25; xent: 0.81; aux: 0.000; lr: 6.00e-01; sents:     399; bsz:   33/  44/ 4; 1045/1384 tok/s;    620 sec;
[2024-10-19 06:46:46,709 INFO] Step 6800/70000; acc: 79.9; ppl:  2.27; xent: 0.82; aux: 0.000; lr: 6.00e-01; sents:     349; bsz:   33/  47/ 3; 1050/1522 tok/s;    623 sec;
[2024-10-19 06:46:49,828 INFO] Step 6900/70000; acc: 78.5; ppl:  2.24; xent: 0.81; aux: 0.000; lr: 6.00e-01; sents:     520; bsz:   30/  48/ 5; 972/1525 tok/s;    626 sec;
[2024-10-19 06:46:52,930 INFO] Step 7000/70000; acc: 80.2; ppl:  2.26; xent: 0.82; aux: 0.000; lr: 6.00e-01; sents:     421; bsz:   33/  47/ 4; 1065/1505 tok/s;    629 sec;
[2024-10-19 06:47:11,215 INFO] valid stats calculation
                           took: 18.28428816795349 s.
[2024-10-19 06:47:23,661 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.443923473358154 s.
[2024-10-19 06:47:23,661 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:47:23,978 INFO] validation BLEU: 0.0
[2024-10-19 06:47:23,979 INFO] Train perplexity: 2.1409
[2024-10-19 06:47:23,979 INFO] Train accuracy: 81.0404
[2024-10-19 06:47:23,979 INFO] Sentences processed: 25969
[2024-10-19 06:47:23,979 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:47:23,979 INFO] Validation perplexity: 1.14934
[2024-10-19 06:47:23,979 INFO] Validation accuracy: 97.0432
[2024-10-19 06:47:23,982 INFO] Saving optimizer and weights to step_7000, and symlink to models
[2024-10-19 06:47:24,209 INFO] Saving transforms artifacts, if any, to models
[2024-10-19 06:47:24,209 INFO] Saving config and vocab to models
[2024-10-19 06:47:27,411 INFO] Step 7100/70000; acc: 83.1; ppl:  1.82; xent: 0.60; aux: 0.000; lr: 6.00e-01; sents:     455; bsz:   32/  48/ 5;  93/138 tok/s;    664 sec;
[2024-10-19 06:47:30,595 INFO] Step 7200/70000; acc: 84.3; ppl:  1.70; xent: 0.53; aux: 0.000; lr: 6.00e-01; sents:     357; bsz:   32/  48/ 4; 1020/1518 tok/s;    667 sec;
[2024-10-19 06:47:33,754 INFO] Step 7300/70000; acc: 86.1; ppl:  1.69; xent: 0.52; aux: 0.000; lr: 6.00e-01; sents:     330; bsz:   31/  48/ 3; 978/1507 tok/s;    670 sec;
[2024-10-19 06:47:36,950 INFO] Step 7400/70000; acc: 86.1; ppl:  1.71; xent: 0.54; aux: 0.000; lr: 6.00e-01; sents:     297; bsz:   30/  46/ 3; 953/1454 tok/s;    673 sec;
[2024-10-19 06:47:40,161 INFO] Step 7500/70000; acc: 87.0; ppl:  1.70; xent: 0.53; aux: 0.000; lr: 6.00e-01; sents:     302; bsz:   33/  49/ 3; 1038/1540 tok/s;    676 sec;
[2024-10-19 06:47:58,517 INFO] valid stats calculation
                           took: 18.35431981086731 s.
[2024-10-19 06:48:10,817 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.299047946929932 s.
[2024-10-19 06:48:10,818 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:48:11,133 INFO] validation BLEU: 0.0
[2024-10-19 06:48:11,134 INFO] Train perplexity: 2.10925
[2024-10-19 06:48:11,134 INFO] Train accuracy: 81.335
[2024-10-19 06:48:11,134 INFO] Sentences processed: 27710
[2024-10-19 06:48:11,134 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:48:11,135 INFO] Validation perplexity: 1.14434
[2024-10-19 06:48:11,135 INFO] Validation accuracy: 97.0432
[2024-10-19 06:48:14,293 INFO] Step 7600/70000; acc: 81.3; ppl:  1.98; xent: 0.68; aux: 0.000; lr: 6.00e-01; sents:     386; bsz:   34/  47/ 4;  99/138 tok/s;    710 sec;
[2024-10-19 06:48:17,441 INFO] Step 7700/70000; acc: 84.5; ppl:  1.92; xent: 0.65; aux: 0.000; lr: 6.00e-01; sents:     421; bsz:   32/  47/ 4; 1020/1481 tok/s;    714 sec;
[2024-10-19 06:48:20,575 INFO] Step 7800/70000; acc: 77.1; ppl:  2.11; xent: 0.75; aux: 0.000; lr: 6.00e-01; sents:     506; bsz:   34/  48/ 5; 1095/1530 tok/s;    717 sec;
[2024-10-19 06:48:23,708 INFO] Step 7900/70000; acc: 80.0; ppl:  1.94; xent: 0.66; aux: 0.000; lr: 6.00e-01; sents:     428; bsz:   32/  46/ 4; 1008/1472 tok/s;    720 sec;
[2024-10-19 06:48:26,885 INFO] Step 8000/70000; acc: 84.8; ppl:  1.78; xent: 0.58; aux: 0.000; lr: 6.00e-01; sents:     379; bsz:   31/  46/ 4; 962/1461 tok/s;    723 sec;
[2024-10-19 06:48:45,302 INFO] valid stats calculation
                           took: 18.415810585021973 s.
[2024-10-19 06:48:57,529 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.227064847946167 s.
[2024-10-19 06:48:57,529 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:48:57,846 INFO] validation BLEU: 0.0
[2024-10-19 06:48:57,847 INFO] Train perplexity: 2.09844
[2024-10-19 06:48:57,847 INFO] Train accuracy: 81.3469
[2024-10-19 06:48:57,847 INFO] Sentences processed: 29830
[2024-10-19 06:48:57,847 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:48:57,847 INFO] Validation perplexity: 1.16033
[2024-10-19 06:48:57,848 INFO] Validation accuracy: 97.0432
[2024-10-19 06:48:57,851 INFO] Saving optimizer and weights to step_8000, and symlink to models
[2024-10-19 06:48:58,074 INFO] Saving transforms artifacts, if any, to models
[2024-10-19 06:48:58,074 INFO] Saving config and vocab to models
[2024-10-19 06:49:01,285 INFO] Step 8100/70000; acc: 83.8; ppl:  1.71; xent: 0.54; aux: 0.000; lr: 6.00e-01; sents:     373; bsz:   31/  46/ 4;  89/134 tok/s;    757 sec;
[2024-10-19 06:49:04,436 INFO] Step 8200/70000; acc: 83.1; ppl:  1.72; xent: 0.54; aux: 0.000; lr: 6.00e-01; sents:     368; bsz:   31/  48/ 4; 996/1513 tok/s;    761 sec;
[2024-10-19 06:49:07,606 INFO] Step 8300/70000; acc: 84.9; ppl:  1.92; xent: 0.65; aux: 0.000; lr: 6.00e-01; sents:     357; bsz:   33/  46/ 4; 1031/1438 tok/s;    764 sec;
[2024-10-19 06:49:10,777 INFO] Step 8400/70000; acc: 84.3; ppl:  1.97; xent: 0.68; aux: 0.000; lr: 6.00e-01; sents:     357; bsz:   34/  46/ 4; 1063/1440 tok/s;    767 sec;
[2024-10-19 06:49:14,019 INFO] Step 8500/70000; acc: 81.7; ppl:  1.82; xent: 0.60; aux: 0.000; lr: 6.00e-01; sents:     369; bsz:   32/  47/ 4; 977/1446 tok/s;    770 sec;
[2024-10-19 06:49:32,501 INFO] valid stats calculation
                           took: 18.480857610702515 s.
[2024-10-19 06:49:44,828 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.325306415557861 s.
[2024-10-19 06:49:44,829 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:49:45,146 INFO] validation BLEU: 0.0
[2024-10-19 06:49:45,147 INFO] Train perplexity: 2.08127
[2024-10-19 06:49:45,147 INFO] Train accuracy: 81.4756
[2024-10-19 06:49:45,147 INFO] Sentences processed: 31654
[2024-10-19 06:49:45,147 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:49:45,148 INFO] Validation perplexity: 1.22323
[2024-10-19 06:49:45,148 INFO] Validation accuracy: 97.0432
[2024-10-19 06:49:48,353 INFO] Step 8600/70000; acc: 79.5; ppl:  2.20; xent: 0.79; aux: 0.000; lr: 6.00e-01; sents:     468; bsz:   31/  46/ 5;  90/134 tok/s;    804 sec;
[2024-10-19 06:49:51,486 INFO] Step 8700/70000; acc: 84.4; ppl:  1.98; xent: 0.68; aux: 0.000; lr: 6.00e-01; sents:     438; bsz:   32/  46/ 4; 1011/1481 tok/s;    808 sec;
[2024-10-19 06:49:54,624 INFO] Step 8800/70000; acc: 82.4; ppl:  1.75; xent: 0.56; aux: 0.000; lr: 6.00e-01; sents:     369; bsz:   30/  46/ 4; 954/1465 tok/s;    811 sec;
[2024-10-19 06:49:57,784 INFO] Step 8900/70000; acc: 86.0; ppl:  1.63; xent: 0.49; aux: 0.000; lr: 6.00e-01; sents:     315; bsz:   31/  46/ 3; 991/1443 tok/s;    814 sec;
[2024-10-19 06:50:00,970 INFO] Step 9000/70000; acc: 87.5; ppl:  1.58; xent: 0.46; aux: 0.000; lr: 6.00e-01; sents:     310; bsz:   31/  45/ 3; 969/1418 tok/s;    817 sec;
[2024-10-19 06:50:19,354 INFO] valid stats calculation
                           took: 18.382594347000122 s.
[2024-10-19 06:50:31,512 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.156486511230469 s.
[2024-10-19 06:50:31,512 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:50:31,828 INFO] validation BLEU: 0.0
[2024-10-19 06:50:31,830 INFO] Train perplexity: 2.06566
[2024-10-19 06:50:31,830 INFO] Train accuracy: 81.611
[2024-10-19 06:50:31,830 INFO] Sentences processed: 33554
[2024-10-19 06:50:31,830 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:50:31,830 INFO] Validation perplexity: 1.16283
[2024-10-19 06:50:31,830 INFO] Validation accuracy: 97.0432
[2024-10-19 06:50:31,833 INFO] Saving optimizer and weights to step_9000, and symlink to models
[2024-10-19 06:50:32,073 INFO] Saving transforms artifacts, if any, to models
[2024-10-19 06:50:32,074 INFO] Saving config and vocab to models
[2024-10-19 06:50:35,298 INFO] Step 9100/70000; acc: 79.6; ppl:  2.23; xent: 0.80; aux: 0.000; lr: 6.00e-01; sents:     375; bsz:   34/  46/ 4; 100/133 tok/s;    851 sec;
[2024-10-19 06:50:38,476 INFO] Step 9200/70000; acc: 87.2; ppl:  1.75; xent: 0.56; aux: 0.000; lr: 6.00e-01; sents:     280; bsz:   31/  47/ 3; 965/1477 tok/s;    855 sec;
[2024-10-19 06:50:41,634 INFO] Step 9300/70000; acc: 85.7; ppl:  1.50; xent: 0.41; aux: 0.000; lr: 6.00e-01; sents:     284; bsz:   31/  47/ 3; 996/1497 tok/s;    858 sec;
[2024-10-19 06:50:44,831 INFO] Step 9400/70000; acc: 86.0; ppl:  1.63; xent: 0.49; aux: 0.000; lr: 6.00e-01; sents:     314; bsz:   33/  46/ 3; 1043/1454 tok/s;    861 sec;
[2024-10-19 06:50:47,970 INFO] Step 9500/70000; acc: 83.7; ppl:  1.84; xent: 0.61; aux: 0.000; lr: 6.00e-01; sents:     355; bsz:   34/  47/ 4; 1073/1507 tok/s;    864 sec;
[2024-10-19 06:51:06,301 INFO] valid stats calculation
                           took: 18.33012342453003 s.
[2024-10-19 06:51:18,492 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.189366102218628 s.
[2024-10-19 06:51:18,492 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:51:18,810 INFO] validation BLEU: 0.0
[2024-10-19 06:51:18,811 INFO] Train perplexity: 2.04903
[2024-10-19 06:51:18,812 INFO] Train accuracy: 81.7606
[2024-10-19 06:51:18,812 INFO] Sentences processed: 35162
[2024-10-19 06:51:18,812 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:51:18,812 INFO] Validation perplexity: 1.15745
[2024-10-19 06:51:18,812 INFO] Validation accuracy: 97.0432
[2024-10-19 06:51:22,014 INFO] Step 9600/70000; acc: 81.2; ppl:  2.02; xent: 0.70; aux: 0.000; lr: 6.00e-01; sents:     384; bsz:   33/  48/ 4;  98/140 tok/s;    898 sec;
[2024-10-19 06:51:25,176 INFO] Step 9700/70000; acc: 82.4; ppl:  1.82; xent: 0.60; aux: 0.000; lr: 6.00e-01; sents:     436; bsz:   34/  49/ 4; 1075/1544 tok/s;    901 sec;
[2024-10-19 06:51:28,354 INFO] Step 9800/70000; acc: 86.7; ppl:  1.66; xent: 0.51; aux: 0.000; lr: 6.00e-01; sents:     341; bsz:   33/  49/ 3; 1026/1528 tok/s;    904 sec;
[2024-10-19 06:51:31,503 INFO] Step 9900/70000; acc: 84.0; ppl:  2.13; xent: 0.76; aux: 0.000; lr: 6.00e-01; sents:     329; bsz:   35/  46/ 3; 1106/1463 tok/s;    908 sec;
[2024-10-19 06:51:34,683 INFO] Step 10000/70000; acc: 84.2; ppl:  1.65; xent: 0.50; aux: 0.000; lr: 6.00e-01; sents:     310; bsz:   31/  46/ 3; 969/1445 tok/s;    911 sec;
[2024-10-19 06:51:53,124 INFO] valid stats calculation
                           took: 18.439740657806396 s.
[2024-10-19 06:52:05,472 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.347004652023315 s.
[2024-10-19 06:52:05,472 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:52:05,789 INFO] validation BLEU: 0.0
[2024-10-19 06:52:05,790 INFO] Train perplexity: 2.03816
[2024-10-19 06:52:05,790 INFO] Train accuracy: 81.859
[2024-10-19 06:52:05,790 INFO] Sentences processed: 36962
[2024-10-19 06:52:05,790 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:52:05,791 INFO] Validation perplexity: 1.14239
[2024-10-19 06:52:05,791 INFO] Validation accuracy: 97.0432
[2024-10-19 06:52:05,793 INFO] Saving optimizer and weights to step_10000, and symlink to models
[2024-10-19 06:52:06,035 INFO] Saving transforms artifacts, if any, to models
[2024-10-19 06:52:06,035 INFO] Saving config and vocab to models
[2024-10-19 06:52:09,272 INFO] Step 10100/70000; acc: 88.6; ppl:  1.48; xent: 0.40; aux: 0.000; lr: 6.00e-01; sents:     297; bsz:   32/  48/ 3;  93/139 tok/s;    945 sec;
[2024-10-19 06:52:12,438 INFO] Step 10200/70000; acc: 87.1; ppl:  1.69; xent: 0.53; aux: 0.000; lr: 6.00e-01; sents:     353; bsz:   32/  47/ 4; 1014/1481 tok/s;    949 sec;
[2024-10-19 06:52:15,594 INFO] Step 10300/70000; acc: 83.9; ppl:  1.94; xent: 0.66; aux: 0.000; lr: 6.00e-01; sents:     287; bsz:   31/  49/ 3; 995/1560 tok/s;    952 sec;
[2024-10-19 06:52:18,780 INFO] Step 10400/70000; acc: 81.8; ppl:  1.93; xent: 0.66; aux: 0.000; lr: 6.00e-01; sents:     450; bsz:   30/  48/ 4; 956/1492 tok/s;    955 sec;
[2024-10-19 06:52:21,944 INFO] Step 10500/70000; acc: 86.7; ppl:  1.51; xent: 0.41; aux: 0.000; lr: 6.00e-01; sents:     267; bsz:   34/  47/ 3; 1068/1491 tok/s;    958 sec;
[2024-10-19 06:52:40,284 INFO] valid stats calculation
                           took: 18.33825397491455 s.
[2024-10-19 06:52:52,500 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.215076208114624 s.
[2024-10-19 06:52:52,500 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:52:52,816 INFO] validation BLEU: 0.0
[2024-10-19 06:52:52,817 INFO] Train perplexity: 2.02034
[2024-10-19 06:52:52,817 INFO] Train accuracy: 82.042
[2024-10-19 06:52:52,817 INFO] Sentences processed: 38616
[2024-10-19 06:52:52,817 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:52:52,818 INFO] Validation perplexity: 1.18518
[2024-10-19 06:52:52,818 INFO] Validation accuracy: 97.0432
[2024-10-19 06:52:56,037 INFO] Step 10600/70000; acc: 85.0; ppl:  1.57; xent: 0.45; aux: 0.000; lr: 6.00e-01; sents:     334; bsz:   35/  50/ 3; 102/147 tok/s;    992 sec;
[2024-10-19 06:52:59,206 INFO] Step 10700/70000; acc: 83.1; ppl:  1.87; xent: 0.63; aux: 0.000; lr: 6.00e-01; sents:     374; bsz:   32/  47/ 4; 1025/1497 tok/s;    995 sec;
[2024-10-19 06:53:02,357 INFO] Step 10800/70000; acc: 82.3; ppl:  2.03; xent: 0.71; aux: 0.000; lr: 6.00e-01; sents:     398; bsz:   34/  47/ 4; 1089/1484 tok/s;    998 sec;
[2024-10-19 06:53:05,494 INFO] Step 10900/70000; acc: 85.1; ppl:  1.65; xent: 0.50; aux: 0.000; lr: 6.00e-01; sents:     396; bsz:   32/  48/ 4; 1008/1519 tok/s;   1002 sec;
[2024-10-19 06:53:08,671 INFO] Step 11000/70000; acc: 88.1; ppl:  1.48; xent: 0.39; aux: 0.000; lr: 6.00e-01; sents:     279; bsz:   31/  47/ 3; 962/1491 tok/s;   1005 sec;
[2024-10-19 06:53:26,913 INFO] valid stats calculation
                           took: 18.240413188934326 s.
[2024-10-19 06:53:27,867 INFO] The translation of the valid dataset for dynamic scoring
                               took : 0.953115701675415 s.
[2024-10-19 06:53:27,867 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:53:27,989 INFO] validation BLEU: 0.0
[2024-10-19 06:53:27,990 INFO] Train perplexity: 2.00458
[2024-10-19 06:53:27,990 INFO] Train accuracy: 82.1658
[2024-10-19 06:53:27,990 INFO] Sentences processed: 40397
[2024-10-19 06:53:27,990 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:53:27,991 INFO] Validation perplexity: 2.91505
[2024-10-19 06:53:27,991 INFO] Validation accuracy: 5.91368
[2024-10-19 06:53:27,994 INFO] Saving optimizer and weights to step_11000, and symlink to models
[2024-10-19 06:53:28,236 INFO] Saving transforms artifacts, if any, to models
[2024-10-19 06:53:28,236 INFO] Saving config and vocab to models
[2024-10-19 06:53:31,445 INFO] Step 11100/70000; acc: 85.9; ppl:  1.74; xent: 0.55; aux: 0.000; lr: 6.00e-01; sents:     309; bsz:   31/  47/ 3; 135/207 tok/s;   1028 sec;
[2024-10-19 06:53:34,621 INFO] Step 11200/70000; acc: 84.9; ppl:  1.99; xent: 0.69; aux: 0.000; lr: 6.00e-01; sents:     350; bsz:   32/  47/ 4; 1014/1476 tok/s;   1031 sec;
[2024-10-19 06:53:37,793 INFO] Step 11300/70000; acc: 80.1; ppl:  2.21; xent: 0.79; aux: 0.000; lr: 6.00e-01; sents:     537; bsz:   33/  46/ 5; 1043/1461 tok/s;   1034 sec;
[2024-10-19 06:53:40,964 INFO] Step 11400/70000; acc: 82.5; ppl:  1.77; xent: 0.57; aux: 0.000; lr: 6.00e-01; sents:     390; bsz:   35/  47/ 4; 1092/1490 tok/s;   1037 sec;
[2024-10-19 06:53:44,138 INFO] Step 11500/70000; acc: 83.9; ppl:  1.87; xent: 0.63; aux: 0.000; lr: 6.00e-01; sents:     367; bsz:   31/  47/ 4; 986/1468 tok/s;   1040 sec;
[2024-10-19 06:54:02,438 INFO] valid stats calculation
                           took: 18.29919195175171 s.
[2024-10-19 06:54:14,659 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.219140768051147 s.
[2024-10-19 06:54:14,659 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:54:15,104 INFO] validation BLEU: 0.0
[2024-10-19 06:54:15,106 INFO] Train perplexity: 2.00025
[2024-10-19 06:54:15,106 INFO] Train accuracy: 82.2225
[2024-10-19 06:54:15,106 INFO] Sentences processed: 42350
[2024-10-19 06:54:15,106 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:54:15,106 INFO] Validation perplexity: 1.1526
[2024-10-19 06:54:15,106 INFO] Validation accuracy: 97.0432
[2024-10-19 06:54:18,268 INFO] Step 11600/70000; acc: 82.5; ppl:  2.12; xent: 0.75; aux: 0.000; lr: 6.00e-01; sents:     449; bsz:   32/  45/ 4;  94/132 tok/s;   1074 sec;
[2024-10-19 06:54:21,468 INFO] Step 11700/70000; acc: 81.2; ppl:  2.01; xent: 0.70; aux: 0.000; lr: 6.00e-01; sents:     486; bsz:   33/  45/ 5; 1037/1409 tok/s;   1078 sec;
[2024-10-19 06:54:24,718 INFO] Step 11800/70000; acc: 84.2; ppl:  1.66; xent: 0.51; aux: 0.000; lr: 6.00e-01; sents:     360; bsz:   31/  45/ 4; 957/1379 tok/s;   1081 sec;
[2024-10-19 06:54:27,961 INFO] Step 11900/70000; acc: 83.0; ppl:  1.88; xent: 0.63; aux: 0.000; lr: 6.00e-01; sents:     395; bsz:   31/  46/ 4; 961/1411 tok/s;   1084 sec;
[2024-10-19 06:54:31,164 INFO] Step 12000/70000; acc: 79.1; ppl:  2.27; xent: 0.82; aux: 0.000; lr: 6.00e-01; sents:     521; bsz:   32/  46/ 5; 994/1444 tok/s;   1087 sec;
[2024-10-19 06:54:49,454 INFO] valid stats calculation
                           took: 18.288257598876953 s.
[2024-10-19 06:55:01,682 INFO] The translation of the valid dataset for dynamic scoring
                               took : 12.22706389427185 s.
[2024-10-19 06:55:01,683 INFO] UPDATING VALIDATION BLEU
[2024-10-19 06:55:02,000 INFO] validation BLEU: 0.0
[2024-10-19 06:55:02,002 INFO] Train perplexity: 1.9994
[2024-10-19 06:55:02,002 INFO] Train accuracy: 82.2121
[2024-10-19 06:55:02,002 INFO] Sentences processed: 44561
[2024-10-19 06:55:02,002 INFO] Average bsz:   32/  47/ 4
[2024-10-19 06:55:02,002 INFO] Validation perplexity: 1.1685
[2024-10-19 06:55:02,002 INFO] Validation accuracy: 97.0432
[2024-10-19 06:55:02,005 INFO] Saving optimizer and weights to step_12000, and symlink to models
[2024-10-19 06:55:02,247 INFO] Saving transforms artifacts, if any, to models
[2024-10-19 06:55:02,247 INFO] Saving config and vocab to models
[2024-10-19 06:55:05,474 INFO] Step 12100/70000; acc: 86.5; ppl:  1.69; xent: 0.52; aux: 0.000; lr: 6.00e-01; sents:     343; bsz:   34/  49/ 3;  98/142 tok/s;   1122 sec;
[2024-10-19 06:55:08,621 INFO] Step 12200/70000; acc: 82.1; ppl:  1.93; xent: 0.66; aux: 0.000; lr: 6.00e-01; sents:     399; bsz:   32/  46/ 4; 1012/1459 tok/s;   1125 sec;

So yes it looks like the model is definitely not learning, despite having disabled virtually all the hyperparemeters, and that was confirmed upon inference as there was no output at all.

@francoishernandez
Copy link
Contributor

Your issue is most probably not with hyperparameters, but with your data/vocab/tokenization.
And disabling most hyperparams is not necessarily a good idea.
Default values are not necessarily the best, because the toolkit is modular, so there is not "one way" of doing things. That's why there are some recipes.
WMT17 is a bit old, I agree, but the base hyperparams do work. The only thing you have to do is try and make it work with your data, it will be way more efficient than trying and build a new setup from scratch.

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 20, 2024

So, Iʻve inspected my data and tokenization methods.
Iʻm questioning the use of transform onmt_tokenize for Sentencepiece-built models.
I see here that you use a transform called sentencepiece for LLM training.
So I tried to mimic that yaml file and came up with this:

## IO
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]

### Vocab
src_vocab: processed_data/spm_src-train.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab
#n_sample: -1

data:
    corpus_1:
        path_src: data/src-train.txt
        path_tgt: data/tgt-train.txt
    valid:
        path_src: data/EN-val.txt
        path_tgt: data/TY-val.txt
        #transforms: [onmt_tokenize]

transforms: [sentencepiece]
transforms_configs:
            #normalize:
                #src_lang: en
                #tgt_lang: ty
                #norm_quote_commas: true
                #norm_numbers: true
    sentencepiece:
        #src_subword_type: sentencepiece
        src_subword_model: data/en.wiki.bpe.vs25000.model
        #tgt_subword_type: sentencepiece
        tgt_subword_model: models/spm_tgt-train.model
            #filtertoolong:
                #src_seq_length: 512
                #tgt_seq_length: 512


# Number of candidates for SentencePiece sampling
#subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
#subword_alpha: 0.1
  

training:
    # Model configuration
    model_path: models
    keep_checkpoint: 50
    save_checkpoint_steps: 1000
    train_steps: 70000
    valid_steps: 500

    bucket_size: 1024
    num_workers: 4
    prefetch_factor: 6
    world_size: 1
    gpu_ranks: [0]
    batch_type: "tokens"
    batch_size: 1024
    valid_batch_size: 1024
    batch_size_multiple: 8
    accum_count: [10]
    accum_steps: [0]
    dropout_steps: [0]
    dropout: [0.2]
    attention_dropout: [0.2]
    #compute_dtype: fp16
    optim: "adam"
    learning_rate: 0.6
    average_decay: 0.1
    warmup_steps: 4000
    decay_method: "noam"
    adam_beta2: 0.998
    max_grad_norm: 0
    label_smoothing: 0.1
    param_init: 0
    param_init_method: "xavier_uniform"
    #normalization: "tokens"
    #early_stopping: 3

tensorboard: true
tensorboard_log_dir: logs
   
log_file: logs/eole.log
   
# Pretrained embeddings configuration for the source language
embeddings_type: word2vec
src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v.txt
#tgt_embeddings:
save_data: processed_data/
position_encoding_type: Rotary

model:
    architecture: "transformer"
    hidden_size: 300
    share_decoder_embeddings: false
    share_embeddings: false
    layers: 6
    heads: 6
    transformer_ff: 300
    word_vec_size: 300
    position_encoding: true

And subsequent error AttributeError: 'SentencePieceTransform' object has no attribute 'mapped_tokens' popped up. Hereʻs the full error:

Traceback (most recent call last):
  File "/home/ubuntu/TY-EN/TY-EN/bin/eole", line 33, in <module>
    sys.exit(load_entry_point('eole', 'console_scripts', 'eole')())
  File "/home/ubuntu/TY-EN/eole/eole/bin/main.py", line 39, in main
    bin_cls.run(args)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 70, in run
    train(config)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 57, in train
    train_process(config, device_id=0)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 242, in main
    trainer.train(
  File "/home/ubuntu/TY-EN/eole/eole/trainer.py", line 328, in train
    for i, (batches, normalization) in enumerate(self._accum_batches(train_iter)):
  File "/home/ubuntu/TY-EN/eole/eole/trainer.py", line 260, in _accum_batches
    for batch, bucket_idx in iterator:
  File "/home/ubuntu/TY-EN/eole/eole/inputters/dynamic_iterator.py", line 404, in __iter__
    for (tensor_batch, bucket_idx) in self.data_iter:
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
    data = next(self.dataset_iter)
  File "/home/ubuntu/TY-EN/eole/eole/inputters/dynamic_iterator.py", line 372, in __iter__
    for bucket, bucket_idx in self._bucketing():
  File "/home/ubuntu/TY-EN/eole/eole/inputters/dynamic_iterator.py", line 309, in _bucketing
    yield (self._tuple_to_json_with_tokIDs(bucket), self.bucket_idx)
  File "/home/ubuntu/TY-EN/eole/eole/inputters/dynamic_iterator.py", line 278, in _tuple_to_json_with_tokIDs
    tuple_bucket = transform_bucket(self.task, tuple_bucket, self.score_threshold)
  File "/home/ubuntu/TY-EN/eole/eole/inputters/text_utils.py", line 36, in transform_bucket
    transf_bucket = transform.batch_apply(
  File "/home/ubuntu/TY-EN/eole/eole/transforms/transform.py", line 272, in batch_apply
    batch = transform.batch_apply(
  File "/home/ubuntu/TY-EN/eole/eole/transforms/transform.py", line 108, in batch_apply
    example = self.apply(example, is_train=is_train, **kwargs)
  File "/home/ubuntu/TY-EN/eole/eole/transforms/tokenize.py", line 203, in apply
    src_out = self._tokenize(example["src"], "src", is_train)
  File "/home/ubuntu/TY-EN/eole/eole/transforms/tokenize.py", line 177, in _tokenize
    if self.mapped_tokens is not None:
AttributeError: 'SentencePieceTransform' object has no attribute 'mapped_tokens'

Can we also use transform sentencepiece for bilingual NMT models? If so, what is the exact way to implement it?

@francoishernandez
Copy link
Contributor

Your sentencepiece transform config seems ok.
You just happened to catch a bug caused by the introduction of mapped_tokens (to try and handle LLM special tokens better).
This was probably fixed around here -- 42e26b8
You can git pull the latest main and the error should go away. If not, you can try and change this line by getattr(self, "mapped_tokens", None).

onmt_tokenize should be working as well, but it might require some additional configuration, e.g. src_onmt_tok_kwargs: {"mode": "none", "spacer_annotate": True}.

To investigate more on the transforms, you can enable the dump_samples flag on a small sample (e.g. n_samples: 1000) and perform some visual checks.

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 21, 2024

Thanks François, that works now.

@francoishernandez
Copy link
Contributor

For reference, I just extented the WMT17 recipe with explicit examples of bpe/sentencepiece/onmt_tokenize[bpe]/onmt_tokenize[sentencepiece] examples here: #129.

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 25, 2024

Hi François, now that I was finally able to train a model, weʻre eventually coming to the topic of this thread, inference.
And again, Iʻm struggling to translate my test file.
See my inference.yaml config:

seed: 1234

src: data/src-test.txt
output: translations/tgt-test.txt

model_path: models/step_37000

transforms: [normalize, sentencepiece]
transforms_configs:
    normalize:
      src_lang: en
      tgt_lang: ty
      norm_quote_commas: true
      norm_numbers: true

    sentencepiece:
      src_subword_type: bpe
      src_subword_model: models/en.wiki.bpe.vs25000.model
      tgt_subword_type: bpe
      tgt_subword_model: processed_data/spm_tgt.model

verbose: true
n_best: 3
top_p: 0.9
beam_size: 5
batch_type: sents

world_size: 1
gpu_ranks: [0]

You said most parameters are now provided by the config.json file itself, that a yaml config file for inference is not necessary and that only some paremeters need to be explicit in the bash command.
So hereʻs my bash command:
eole predict -c inference.yaml -model_path models/step_37000 -src data/src-test.txt -output translations/tgt-test.txt -world_size 1 -gpu_ranks 0 -n_best 3 -top_p 0.9 -beam_size 10
Time w/o python interpreter load/terminate: 1.9073486328125e-06
While the command runs without error, no output file is generated. No verbose either. And Iʻve got almost 300 lines to translate, which should take more than just 2 seconds to translate (at least from my experience inferring with OpenNMT-py).
Since my training looked good (see below Accuracy, Loss function and Bleu score), the training doesnʻt seem to be the problem here.
image
image
I tried several commands, based on the few examples from the recipes (although the wmt17 recipe doesnʻt have an inference config), but to no avail.

@francoishernandez
Copy link
Contributor

At the risk of repeating myself, if you are running the latest version of the code, the inference config file is not needed, all the required information (about transforms notably) is grabbed from the saved model's config.json.

That being said, your setup still looks shady to me.

The inference config provided is not valid and should raise an error (the sentencepiece transform does not expect the `{src,tgt}_subword_type} settings.
Also, your command does not make much sense, as you provide some information twice, in the config and the command line. As explained before, the config and command line arguments are in the end the same thing, it's just a matter of preference/context as to which method you use to configure your run.

So, a more logical command would be something like this:

eole predict -model_path models/step_37000 -src data/src-test.txt -output translations/tgt-test.txt -world_size 1 -gpu_ranks 0 -n_best 3 -top_p 0.9 -beam_size 10

Please share:

  1. git log -1 output
  2. your model's full models/step_37000/config.json

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 25, 2024

Hi François, see below:

(TY-EN) ubuntu@ip-172-31-2-199:~/TY-EN/eole/recipes/wmt17$ git log -1
commit 5369b07e3bcc2560472aa72816ac72b128637c7a (HEAD -> main, origin/main, origin/HEAD)
Author: François Hernandez <francois.hernandez.fh@gmail.com>
Date:   Wed Oct 23 15:36:23 2024 +0200

    misc fixes, add wmt17 bpe/spm configs (#129)
(TY-EN) ubuntu@ip-172-31-2-199:~/TY-EN/eole/recipes/wmt17$ git log -1 --stat
commit 5369b07e3bcc2560472aa72816ac72b128637c7a (HEAD -> main, origin/main, origin/HEAD)
Author: François Hernandez <francois.hernandez.fh@gmail.com>
Date:   Wed Oct 23 15:36:23 2024 +0200

    misc fixes, add wmt17 bpe/spm configs (#129)

 docs/docusaurus_tsx/docs/concepts/command_line.md |   2 +-
 eole/bin/run/build_vocab.py                       |   2 +-
 eole/inputters/text_corpus.py                     |   2 +-
 eole/transforms/tokenize.py                       |   2 +-
 recipes/wmt17/README.md                           |  33 ++++++++++++++++++++++++++++++---
 recipes/wmt17/prepare_wmt_ende_data.sh            | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------
 recipes/wmt17/wmt17_ende_bpe.yaml                 |  94 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 recipes/wmt17/wmt17_ende_bpe_onmt_tokenize.yaml   |  96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 recipes/wmt17/wmt17_ende_spm.yaml                 |  92 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 recipes/wmt17/wmt17_ende_spm_onmt_tokenize.yaml   |  96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 518 insertions(+), 30 deletions(-)
{
  "tensorboard_log_dir": "logs",
  "tgt_vocab": "processed_data/spm_tgt.onmt_vocab",
  "seed": 1234,
  "pre_word_vecs_enc": "processed_data/.enc_embeddings.pt",
  "log_file": "logs/eole.log",
  "save_data": "processed_data/",
  "tensorboard": true,
  "embeddings_type": "word2vec",
  "report_every": 100,
  "valid_metrics": [
    "BLEU"
  ],
  "src_vocab": "processed_data/spm_src-train_bpemb_en.wiki.bpe.vs25000.onmt_vocab",
  "src_embeddings": "data/en.wiki.bpe.vs25000.d300.w2v.txt",
  "transforms": [
    "normalize",
    "sentencepiece",
    "filtertoolong"
  ],
  "overwrite": true,
  "tensorboard_log_dir_dated": "logs/Oct-24_09-08-24",
  "training": {
    "label_smoothing": 0.1,
    "gpu_ranks": [
      0
    ],
    "bucket_size": 1024,
    "dropout": [
      0.1
    ],
    "max_grad_norm": 0.0,
    "batch_type": "tokens",
    "valid_steps": 500,
    "adam_beta2": 0.998,
    "batch_size_multiple": 1,
    "dropout_steps": [
      0
    ],
    "decay_method": "noam",
    "accum_count": [
      10
    ],
    "model_path": "models",
    "param_init_method": "xavier_uniform",
    "compute_dtype": "torch.float16",
    "num_workers": 0,
    "save_checkpoint_steps": 1000,
    "attention_dropout": [
      0.1
    ],
    "prefetch_factor": 50,
    "world_size": 1,
    "batch_size": 1024,
    "train_steps": 40000,
    "warmup_steps": 4000,
    "keep_checkpoint": 40,
    "accum_steps": [
      0
    ],
    "learning_rate": 1.4,
    "optim": "adam",
    "normalization": "tokens",
    "valid_batch_size": 1024,
    "average_decay": 0.1
  },
  "transforms_configs": {
    "filtertoolong": {
      "src_seq_length": 300,
      "tgt_seq_length": 300
    },
    "sentencepiece": {
      "src_subword_type": "bpe",
      "tgt_subword_type": "bpe",
      "src_subword_model": "${MODEL_PATH}/en.wiki.bpe.vs25000.model",
      "tgt_subword_model": "${MODEL_PATH}/spm_tgt.model"
    },
    "normalize": {
      "norm_quote_commas": true,
      "tgt_lang": "ty",
      "norm_numbers": true,
      "src_lang": "en"
    }
  },
  "data": {
    "corpus_1": {
      "path_tgt": "data/tgt-train.txt",
      "path_src": "data/src-train.txt",
      "transforms": [
        "normalize",
        "sentencepiece",
        "filtertoolong"
      ],
      "path_align": null
    },
    "valid": {
      "path_tgt": "data/TY-val.txt",
      "path_src": "data/EN-val.txt",
      "transforms": [
        "normalize",
        "sentencepiece",
        "filtertoolong"
      ],
      "path_align": null
    }
  },
  "model": {
    "hidden_size": 300,
    "share_decoder_embeddings": false,
    "layers": 6,
    "architecture": "transformer",
    "transformer_ff": 300,
    "heads": 6,
    "share_embeddings": false,
    "encoder": {
      "src_word_vec_size": 300,
      "position_encoding_type": "Rotary",
      "encoder_type": "transformer",
      "n_positions": null
    },
    "embeddings": {
      "position_encoding_type": "Rotary",
      "word_vec_size": 300,
      "src_word_vec_size": 300,
      "tgt_word_vec_size": 300
    },
    "decoder": {
      "position_encoding_type": "Rotary",
      "decoder_type": "transformer",
      "n_positions": null,
      "tgt_word_vec_size": 300
    }
  }
}

I tried eole predict -model_path models/step_37000 -src data/src-test.txt -output translations/tgt-test.txt -world_size 1 -gpu_ranks 0 -n_best 3 -top_p 0.9 -beam_size 10 and I get this:
Time w/o python interpreter load/terminate: 2.1457672119140625e-06 but no tgt-test.txt file is generated.

@francoishernandez
Copy link
Contributor

I don't understand how this is possible. The sentencepiece transform config is invalid and as such should have raised an error in training (and should raise an error here in predict).

How are you running this code? Did you pip install -e <your_local_repo>? Maybe your eole bin uses a version installed from pypi, which would not be up to date.

To make sure you're running the local (up to date) code, you can pip install -e <your_local_eole_repo> or do something like this for instance:

export PYTHONPATH=/path/of/your/eole_repo
python3 /path/of/your/eole_repo/eole/bin/main.py predict

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 25, 2024

Hi François, I made a complete uninstall of Eole and reinstalled it, using pip install -e .
So now I get more errors:

(TY-EN) ubuntu@ip-172-31-2-199:~/TY-EN/eole/recipes/wmt17$ eole train -config wmt17_enty.yaml
xavier_uniform initialization does not require param_init (0.1)
Traceback (most recent call last):
  File "/home/ubuntu/TY-EN/TY-EN/bin/eole", line 33, in <module>
    sys.exit(load_entry_point('eole', 'console_scripts', 'eole')())
  File "/home/ubuntu/TY-EN/eole/eole/bin/main.py", line 39, in main
    bin_cls.run(args)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 69, in run
    config = cls.build_config(args)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/__init__.py", line 42, in build_config
    config = cls.config_class(**config_dict)
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/pydantic/main.py", line 193, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 2 validation errors for TrainConfig
transforms_configs.sentencepiece.src_subword_type
  Extra inputs are not permitted [type=extra_forbidden, input_value='sentencepiece', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/extra_forbidden
transforms_configs.sentencepiece.tgt_subword_type
  Extra inputs are not permitted [type=extra_forbidden, input_value='sentencepiece', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/extra_forbidden

So, I just disabled src_subword_type and tgt_subword_type in the config file. But Iʻm not sure itʻs now using Sentencepiece BPE tokenization.

Then I get this error:

xavier_uniform initialization does not require param_init (0.1)
[2024-10-25 12:38:53,148 INFO] Default transforms (might be overridden downstream): ['normalize', 'sentencepiece'].
[2024-10-25 12:38:53,148 INFO] Missing transforms field for corpus_1 data, set to default: ['normalize', 'sentencepiece'].
[2024-10-25 12:38:53,148 INFO] Missing transforms field for valid data, set to default: ['normalize', 'sentencepiece'].
[2024-10-25 12:38:53,148 INFO] Parsed 2 corpora from -data.
[2024-10-25 12:38:53,149 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2024-10-25 12:38:53,206 INFO] Reading encoder embeddings from data/en.wiki.bpe.vs25000.d300.w2v.txt
[2024-10-25 12:38:55,731 INFO]  Found 25000 total vectors in file.
[2024-10-25 12:38:55,731 INFO] After filtering to vectors in vocab:
[2024-10-25 12:38:55,739 INFO]  * enc: 16041 match, 7 missing, (99.96%)
[2024-10-25 12:38:55,739 INFO]
Saving encoder embeddings as:
        * enc: processed_data/.enc_embeddings.pt
Traceback (most recent call last):
  File "/home/ubuntu/TY-EN/TY-EN/bin/eole", line 33, in <module>
    sys.exit(load_entry_point('eole', 'console_scripts', 'eole')())
  File "/home/ubuntu/TY-EN/eole/eole/bin/main.py", line 39, in main
    bin_cls.run(args)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 70, in run
    train(config)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 57, in train
    train_process(config, device_id=0)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 154, in main
    checkpoint, vocabs, transforms, config = _init_train(config)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 92, in _init_train
    vocabs, transforms = prepare_transforms_vocabs(config, transforms_cls)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 38, in prepare_transforms_vocabs
    prepare_pretrained_embeddings(config, vocabs)
  File "/home/ubuntu/TY-EN/eole/eole/modules/embeddings.py", line 331, in prepare_pretrained_embeddings
    config.pre_word_vecs_enc = enc_output_file
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/pydantic/main.py", line 853, in __setattr__
    self.__pydantic_validator__.validate_assignment(self, name, value)
pydantic_core._pydantic_core.ValidationError: 1 validation error for TrainConfig
pre_word_vecs_enc
  Object has no attribute 'pre_word_vecs_enc' [type=no_such_attribute, input_value='processed_data/.enc_embeddings.pt', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/no_such_attribute

So, I disabled my pre-trained embeddings and launched the training again.
And this new error comes up:

Traceback (most recent call last):
  File "/home/ubuntu/TY-EN/TY-EN/bin/eole", line 33, in <module>
    sys.exit(load_entry_point('eole', 'console_scripts', 'eole')())
  File "/home/ubuntu/TY-EN/eole/eole/bin/main.py", line 39, in main
    bin_cls.run(args)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 70, in run
    train(config)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 57, in train
    train_process(config, device_id=0)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 242, in main
    trainer.train(
  File "/home/ubuntu/TY-EN/eole/eole/trainer.py", line 329, in train
    for i, (batches, normalization) in enumerate(self._accum_batches(train_iter)):
  File "/home/ubuntu/TY-EN/eole/eole/trainer.py", line 261, in _accum_batches
    for batch, bucket_idx in iterator:
  File "/home/ubuntu/TY-EN/eole/eole/inputters/dynamic_iterator.py", line 404, in __iter__
    for (tensor_batch, bucket_idx) in self.data_iter:
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
    data = next(self.dataset_iter)
  File "/home/ubuntu/TY-EN/eole/eole/inputters/dynamic_iterator.py", line 372, in __iter__
    for bucket, bucket_idx in self._bucketing():
  File "/home/ubuntu/TY-EN/eole/eole/inputters/dynamic_iterator.py", line 309, in _bucketing
    yield (self._tuple_to_json_with_tokIDs(bucket), self.bucket_idx)
  File "/home/ubuntu/TY-EN/eole/eole/inputters/dynamic_iterator.py", line 278, in _tuple_to_json_with_tokIDs
    tuple_bucket = transform_bucket(self.task, tuple_bucket, self.score_threshold)
  File "/home/ubuntu/TY-EN/eole/eole/inputters/text_utils.py", line 36, in transform_bucket
    transf_bucket = transform.batch_apply(
  File "/home/ubuntu/TY-EN/eole/eole/transforms/transform.py", line 272, in batch_apply
    batch = transform.batch_apply(
  File "/home/ubuntu/TY-EN/eole/eole/transforms/transform.py", line 108, in batch_apply
    example = self.apply(example, is_train=is_train, **kwargs)
  File "/home/ubuntu/TY-EN/eole/eole/transforms/normalize.py", line 307, in apply
    self.src_lang_dict[corpus_name],
KeyError: 'corpus_1'

It looks like it doesnʻt like the normalize transform. And so I disabled it and now I can train. Hopefully the inference will run smoother.

@francoishernandez
Copy link
Contributor

For the normalize transform you need to specify src_lang/tgt_lang at the dataset level.
Also, you don't necessarily need to re-run the full training to test inference. Just manually fix the model's config.json (that's why we dump a serialized version of the config, to facilitate experimenting and adapting settings along the way). You can start by removing src/tgt_subword_type from the sentencepiece transform config, and then maybe other fields depending on errors you get.

@HURIMOZ
Copy link
Author

HURIMOZ commented Oct 25, 2024

Okay François, Iʻm finally now able to run inference. Thank you so much for your valuable help. And please let me know if you plan on adding Pre-Trained embeddings function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants