-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eole predict command #126
Comments
You are mentioning a config file, but I don't see it in your command line, so it can't be used.
So, you need to call |
Hi Francois, I still get UNK tokens only. |
That's quite difficult to debug remotely with partial information like that. Could be something going bad in your training, could be some misconfiguration, could be some version conflict. If you built the setup up from an existing recipe, maybe try and do some kind of ablation study / add features iteratively to check at which point things are going rogue. That being said, has your model been trained with a recent version/commit or does it date back to earlier version? Since we moved quite a few things configuration-wise (see below), you might have end up in a shaky setup.
Yes, since version 0.0.2 some transform/inference related params can be stored in the model config for more transparent usage.
If your model has been trained/converted with 0.0.2 or later, you should not need a yaml config, and you can only specify the needed inference related params via command line. Just to be clear, these json/yaml files and command line arguments have the same end goal: build a valid config for your model to run. So the true question is: what do you need for the model to run?
The idea of embedding more stuff in the model config.json file is to make most of this transparent, and allow you to focus on "what's important" when predicting, i.e. (e) in the list above. Hope this helps. |
Hi François, I donʻt know what Iʻm doing wrong.
First thing: When I use early_stopping, the training wonʻt go past 3000 steps or so and finds the best model to be at 500 steps (the very first model saved...). Also, the training seems too fast compared to openNMT-py. It would usually take about 8 hours to train those same 90,000 parallel lines with OpenNMT-py. So I disabled the early-stopping parameter and left the training run overnight and stopped it at 69000 steps. None of the 50 models would yield something in the output. For inference, I tried four commands:
The system seems to compute but then doesnʻt write the output in the txt file. Even I did look into the Eole NLP documentation to try and get rid of OpenNMT-py parameters. Iʻm not sure what Iʻm doing wrong. |
Might not be the only issue, but this line should not be commented: This can probably explain the early-stopping behaviour as well, the model is probably not learning anything because the input data does not make sense with regards to its vocab. |
Hi François, thanks for that.
I now get to 93% acc and 4.2 xent after 5000 steps but the inference still doesnʻt work. None of the four commands above will generate the translations. |
Well, in that case you would need to set the transforms explicitly at inference. Dataset-level configuration is not embedded automatically in the model config. So, you can either:
same as above
this one should work, provided your model is fine (the latest one with 93% acc should be), and the
this one should be fine, provided you trained your model with root level transforms configs, not dataset-level Final notes:
|
Thank you François.
and here is the tentative inference yaml file that I wrote:
I try to keep it simple and add in more features when I no longer get errors or when I see some improvements.
|
Your main issue is not your hyperparams. It's configuration management. (Not 100% sure about the validation BLEU issue, but let's validate standard inference first.) In your case, please use root level transform configuration. It will make things way easier for you. Proper configurationsTraining config## IO
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]
### Vocab
src_vocab: processed_data/spm_src-train.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab
#n_sample: -1
data:
corpus_1:
path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
valid:
path_src: data/src-val.txt
path_tgt: data/tgt-val.txt
transforms: [normalize, onmt_tokenize]
transforms: [normalize, onmt_tokenize, filtertoolong]
transforms_configs:
normalize:
src_lang: en
tgt_lang: ty
norm_quote_commas: true
norm_numbers: true
onmt_tokenize:
src_subword_type: sentencepiece
src_subword_model: data/en.wiki.bpe.vs25000.model
tgt_subword_type: sentencepiece
tgt_subword_model: processed_data/spm_tgt-train.model
filtertoolong:
src_seq_length: 512
tgt_seq_length: 512
# Number of candidates for SentencePiece sampling
subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
subword_alpha: 0.1
training:
# Model configuration
model_path: models
keep_checkpoint: 50
save_checkpoint_steps: 1000
train_steps: 70000
valid_steps: 500
# bucket_size:
bucket_size: 256
num_workers: 4
prefetch_factor: 2
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 1024
valid_batch_size: 1024
batch_size_multiple: 8
accum_count: [10]
accum_steps: [0]
dropout_steps: [0]
dropout: [0.2]
attention_dropout: [0.2]
compute_dtype: "fp16"
optim: "adam"
learning_rate: 0.02
average_decay: 0.0001
warmup_steps: 4000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
#early_stopping: 3
tensorboard: true
tensorboard_log_dir: logs
log_file: logs/eole.log
# Pretrained embeddings configuration for the source language
embeddings_type: word2vec
src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v-256.txt
#tgt_embeddings:
save_data: processed_data/
#position_encoding_type: SinusoidalInterleaved
model:
architecture: "transformer"
hidden_size: 256
share_decoder_embeddings: true
share_embeddings: false
layers: 6
heads: 8
transformer_ff: 256
word_vec_size: 256
position_encoding: true Inference config(If the model was trained with above config, the transform configuration is actually not necessary here, since it should be in the model config.json) valid_metrics: ["BLEU"]
data:
corpus_1:
transforms: [normalize, onmt_tokenize]
transforms_configs:
normalize:
src_lang: en
tgt_lang: ty
norm_quote_commas: True
norm_numbers: True
onmt_tokenize:
src_subword_type: sentencepiece
src_subword_model: data/en.wiki.bpe.vs25000.model
tgt_subword_type: sentencepiece
tgt_subword_model: processed_data/spm_tgt-train.model
report_time: true
verbose: true
n_best: 3
top_p: 0.9
beam_size: 5
world_size: 1
gpu: 0 The "data" key in your inference file is not used, we're not using the datasets at inference, only the "src" file. Not sure why it doesn't raise a warning by the way. |
Thank you François, but changing to root-level transform configuration hasnʻt changed anything. I still get zero for Bleu score, a rapid increase of acc score to 93%, and the updated inference file still wonʻt save any translations.
and new updated inference file:
and the command I use to translate: |
There are still several issues stacking up. 1. Update your code and configYou are most probably using an older version of the code, this training config should give you some errors with the latest versions. Here are some of the issues:
2. Check your data/tokenization/transforms setupThe very quickly very high accuracy + only tokens at inference leads me to believe that the data your model learns on is broken. (Basically it sees only/mostly tokens, so it "thinks" it learns properly, but in fact it just learns to output tokens...) I encourage you to start from a "known-to-work" setup, and build from there. E.g. the WMT17 recipe. I quickly did this to help you in your journey to a working setup:
Logs should look something like this:
## IO
save_data: wmt17_en_de/data
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]
### Vocab
src_vocab: wmt17_en_de/vocab.shared
tgt_vocab: wmt17_en_de/vocab.shared
src_vocab_size: 36000
tgt_vocab_size: 36000
vocab_size_multiple: 8
src_words_min_frequency: 2
tgt_words_min_frequency: 2
share_vocab: True
n_sample: 0
data:
corpus_1:
path_src: wmt17_en_de/train.src.bpe.shuf
path_tgt: wmt17_en_de/train.trg.bpe.shuf
valid:
path_src: wmt17_en_de/dev.src.bpe
path_tgt: wmt17_en_de/dev.trg.bpe
transforms: [bpe]
transforms_configs:
bpe:
src_subword_model: wmt17_en_de/codes
tgt_subword_model: wmt17_en_de/codes
training:
# Model configuration
model_path: test_model_wmt17
keep_checkpoint: 50
save_checkpoint_steps: 1000
average_decay: 0
train_steps: 50000
valid_steps: 5000
# bucket_size:
bucket_size: 262144
bucket_size_init: 10000
bucket_size_increment: 25000
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 5000
valid_batch_size: 4096
batch_size_multiple: 8
accum_count: [10]
accum_steps: [0]
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
compute_dtype: "fp16"
#apex_opt_level: "O2"
optim: "fusedadam"
learning_rate: 2
warmup_steps: 4000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
model:
architecture: "transformer"
hidden_size: 1024
share_decoder_embeddings: true
share_embeddings: true
layers: 6
heads: 16
transformer_ff: 4096
embeddings:
word_vec_size: 1024
position_encoding_type: "SinusoidalInterleaved"
(note: your learning rate is too low, adam+noam requires a higher value to get started) ## IO
save_data: wmt17_en_de/data
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]
### Vocab
src_vocab: wmt17_en_de/vocab.shared
tgt_vocab: wmt17_en_de/vocab.shared
src_vocab_size: 36000
tgt_vocab_size: 36000
vocab_size_multiple: 8
src_words_min_frequency: 2
tgt_words_min_frequency: 2
share_vocab: True
n_sample: 0
data:
corpus_1:
path_src: wmt17_en_de/train.src.bpe.shuf
path_tgt: wmt17_en_de/train.trg.bpe.shuf
valid:
path_src: wmt17_en_de/dev.src.bpe
path_tgt: wmt17_en_de/dev.trg.bpe
transforms: [bpe]
transforms_configs:
bpe:
src_subword_model: wmt17_en_de/codes
tgt_subword_model: wmt17_en_de/codes
training:
# Model configuration
model_path: test_model_wmt17
keep_checkpoint: 50
save_checkpoint_steps: 1000
train_steps: 50000
valid_steps: 5000
# bucket_size:
bucket_size: 262144
bucket_size_init: 10000
bucket_size_increment: 25000
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 5000
valid_batch_size: 4096
batch_size_multiple: 8
accum_count: [10]
accum_steps: [0]
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
compute_dtype: "fp16"
#apex_opt_level: "O2"
optim: "fusedadam"
learning_rate: 2
average_decay: 0.0001
warmup_steps: 4000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
model:
architecture: "transformer"
hidden_size: 256
share_decoder_embeddings: true
share_embeddings: true
layers: 6
heads: 8
transformer_ff: 256
embeddings:
word_vec_size: 256
position_encoding_type: "SinusoidalInterleaved"
Once you have such a setup working in your context (data, machine), you cand start and add up features. |
Note that if you pull the very latest commit from |
Hi François, thank you so much for trying to help me out.
Even with this minimalist setup, the system is not training the model. The validation accuracy now gets stuck at 97.0432 no matter how many steps.
So yes it looks like the model is definitely not learning, despite having disabled virtually all the hyperparemeters, and that was confirmed upon inference as there was no output at all. |
Your issue is most probably not with hyperparameters, but with your data/vocab/tokenization. |
So, Iʻve inspected my data and tokenization methods.
And subsequent error
Can we also use transform |
Your sentencepiece transform config seems ok.
To investigate more on the transforms, you can enable the |
Thanks François, that works now. |
For reference, I just extented the WMT17 recipe with explicit examples of bpe/sentencepiece/onmt_tokenize[bpe]/onmt_tokenize[sentencepiece] examples here: #129. |
At the risk of repeating myself, if you are running the latest version of the code, the inference config file is not needed, all the required information (about transforms notably) is grabbed from the saved model's config.json. That being said, your setup still looks shady to me. The inference config provided is not valid and should raise an error (the So, a more logical command would be something like this:
Please share:
|
Hi François, see below:
I tried |
I don't understand how this is possible. The sentencepiece transform config is invalid and as such should have raised an error in training (and should raise an error here in predict). How are you running this code? Did you To make sure you're running the local (up to date) code, you can
|
Hi François, I made a complete uninstall of Eole and reinstalled it, using
So, I just disabled src_subword_type and tgt_subword_type in the config file. But Iʻm not sure itʻs now using Sentencepiece BPE tokenization. Then I get this error:
So, I disabled my pre-trained embeddings and launched the training again.
It looks like it doesnʻt like the normalize transform. And so I disabled it and now I can train. Hopefully the inference will run smoother. |
For the normalize transform you need to specify src_lang/tgt_lang at the dataset level. |
Okay François, Iʻm finally now able to run inference. Thank you so much for your valuable help. And please let me know if you plan on adding Pre-Trained embeddings function. |
Hi, can you please clarify how to use the
eole predict
command?Right now I use command
eole predict --src data/src-test.txt --model_path models/step_1000 --beam_size 5 --batch_size 4096 --output translations/tgt-test.txt --gpu 0
for inference and all I get is<UNK>
tokens in the output.Iʻm using this in the config file:
I use this for the vocab files:
My files look good and I donʻt understand why Iʻm getting UNK tokens only.
I also see that the eole documentation says we can use
eole translate -c your_config.yaml
. What is that for?The text was updated successfully, but these errors were encountered: