Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degrades after fine-tuning. #4

Open
nemtiax opened this issue Feb 9, 2023 · 1 comment
Open

Performance degrades after fine-tuning. #4

nemtiax opened this issue Feb 9, 2023 · 1 comment

Comments

@nemtiax
Copy link

nemtiax commented Feb 9, 2023

I'm interested in using Kaldi to recognize aircraft tailsigns. I used your speech-training-recorder utility to record 600 samples of myself speaking a tailsign, and then used those to run fine-tuning starting from kaldi_model_daanzu_20200905_1ep-mediumlm-base. Each sample is 2-5 seconds long, and contains 4-8 words.

Here is the performance of the base model on the training set before finetuning (as measured by test_model.py):

Overall -> 28.16 % +/- 1.65 % N=2841 C=2120 S=543 D=178 I=79

And after finetuning:

Overall -> 23.69 % +/- 1.56 % N=2841 C=2190 S=180 D=471 I=22

Note again that these statistics are computed on the training set, not a held-out test set. Thus, I hoped to observe significant improvement. While the topline WER is improved, the nature of the errors has changed significantly. The original model has many substitution errors, which are largely homophones (two -> to, etc.). The new model has mostly deletions. For example, here are some transcriptions by the new model:

File: audio_data/recorder_2023-02-07_15-39-50_337121.wav
Ref: gulfstream two two one charlie mike
Hyp: gulfstream charlie mike
File: audio_data/recorder_2023-02-07_15-39-58_369551.wav
Ref: air west seventy four
Hyp: air west
File: audio_data/recorder_2023-02-07_15-39-54_443257.wav
Ref: precision thirty five sixty six
Hyp: precision

I'm curious whether you have any suggestions for what might be going wrong. I saw in the fine-tuning script this note:

frames_per_eg=150,110,100,50  # Standard default is 150,110,100 but try 150,110,100,50 for training with utterances of short commands

I left it as is, with the 50 included, but I'm not sure whether my dataset counts as "short" or if that refers to something with only one or two words per utterance.

I also noticed this note in the instructions:

--num-utts-subset 3000 : You may need this parameter to prevent an error at the beginning of nnet training if your training data contains many short (command-like) utterances. (3000 is a perhaps overly careful suggestion; 300 is the default value.)

I did not use this, and as far as I know, did not encounter an error. Is this a parameter I should try tuning even if I'm not getting an error?

Finally, I'm curious to understand what part of any performance changes might be attributable to the updated acoustic model vs what would be attributable to changes in the language model. I saw that compile_agf_dictation_graph seems to do some work to build a new Dictation.fst - does this incorporate statistics about my training corpus? Is it possible to use the original Dictation.fst and just drop in my new acoustic model to test where the errors might be coming from, or is that going to cause issues of its own?

Thanks!

@bogdan0083
Copy link

@nemtiax Hi! Did you manage to fix the perfomance? I have exactly the same issue. With default config it improves WER but makes KaldiAG unusable for some reason. The utterances get truncated at the end as well, just like you described:

Ref: precision thirty five sixty six
Hyp: precision

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants